NLP Research Paper
NLP Research Paper
Article
SMS Scam Detection Application Based on Optical Character
Recognition for Image Data Using Unsupervised and Deep
Semi-Supervised Learning
Anjali Shinde, Essa Q. Shahra * , Shadi Basurra, Faisal Saeed , Abdulrahman A. AlSewari and Waheb A. Jabbar *
Abstract: The growing problem of unsolicited text messages (smishing) and data irregularities neces-
sitates stronger spam detection solutions. This paper explores the development of a sophisticated
model designed to identify smishing messages by understanding the complex relationships among
words, images, and context-specific factors, areas that remain underexplored in existing research. To
address this, we merge a UCI spam dataset of regular text messages with real-world spam data, lever-
aging OCR technology for comprehensive analysis. The study employs a combination of traditional
machine learning models, including K-means, Non-Negative Matrix Factorization, and Gaussian Mix-
ture Models, along with feature extraction techniques such as TF_IDF and PCA. Additionally, deep
learning models like RNN-Flatten, LSTM, and Bi-LSTM are utilized. The selection of these models is
driven by their complementary strengths in capturing both the linear and non-linear relationships
inherent in smishing messages. Machine learning models are chosen for their efficiency in handling
structured text data, while deep learning models are selected for their superior ability to capture
sequential dependencies and contextual nuances. The performance of these models is rigorously
evaluated using metrics like accuracy, precision, recall, and F1 score, enabling a comparative analysis
between the machine learning and deep learning approaches. Notably, the K-means feature extraction
Citation: Shinde, A.; Shahra, E.Q.;
with vectorizer achieved 91.01% accuracy, and the KNN-Flatten model reached 94.13% accuracy,
Basurra, S.; Saeed, F.; AlSewari, A.A.;
emerging as the top performer. The rationale behind highlighting these models is their potential to
Jabbar, W.A. SMS Scam Detection
significantly improve smishing detection rates. For instance, the high accuracy of the KNN-Flatten
Application Based on Optical
Character Recognition for Image Data
model suggests its applicability in real-time spam detection systems, but its computational complexity
Using Unsupervised and Deep might limit scalability in large-scale deployments. Similarly, while K-means with vectorizer excels in
Semi-Supervised Learning. Sensors accuracy, it may struggle with the dynamic and evolving nature of smishing attacks, necessitating
2024, 24, 6084. https://doi.org/ continual retraining.
10.3390/s24186084
Keywords: unsupervised machine learning; deep learning semi supervised; feature ex-traction;
Academic Editor: Adrian Barbu
smishing message
Received: 8 August 2024
Revised: 11 September 2024
Accepted: 16 September 2024
Published: 20 September 2024 1. Introduction
Smishing, a portmanteau of SMS and phishing, represents a rapidly escalating mobile
security threat [1]. where attackers use text messages to deceive users by including email
Copyright: © 2024 by the authors.
IDs, website links, or phone numbers to extract sensitive information or lure victims with
Licensee MDPI, Basel, Switzerland. fraudulent offers [2]. Unlike traditional email phishing, smishing leverages the ubiquity and
This article is an open access article immediacy of SMS, making it a particularly effective and dangerous attack vector [3]. The
distributed under the terms and low cost of bulk SMS packages further incentivizes attackers, enabling them to launch large-
conditions of the Creative Commons scale smishing campaigns with minimal investment [4]. The urgency of addressing this
Attribution (CC BY) license (https:// threat is underscored by recent scams related to COVID-19, insurance, food deliveries, and
creativecommons.org/licenses/by/ government programs, which have resulted in significant financial losses, as reported by
4.0/). UK Finance [5]. The detection of SMS spam primarily involves a binary classification task,
where messages are labeled as either ’spam’ or ’ham’ [6]. However, the rapidly evolving
tactics employed by smishers complicate this task, necessitating continuous updates to
detection methods [7]. While various techniques have been proposed for classifying SMS
messages, including supervised learning models [8], these approaches often struggle to
keep pace with the dynamic nature of smishing [9]. The need for constant retraining and
the reliance on labeled data are significant drawbacks, particularly in real-world scenarios
where new and evolving threats emerge regularly.
Existing literature on SMS spam detection has predominantly focused on supervised
learning approaches, where labeled datasets are used to train models to differentiate between
spam and legitimate messages [10]. Techniques such as Support Vector Machines (SVM), Naive
Bayes [11], and Random Forests [12] have been extensively explored and have shown promise
in detecting smishing attacks. For instance, Abayomi-Alli et al. [13] provided a comprehensive
review of machine learning models used for SMS spam detection, highlighting their strengths
and limitations. However, these models often require large, labeled datasets, which are not
always available, and they may struggle to adapt to new types of smishing messages that
differ from those seen during training. Recent research has begun to explore unsupervised and
semi-supervised learning techniques as alternatives to traditional supervised methods [14].
These approaches do not rely on labeled data, making them well-suited for detecting new and
previously unseen smishing attacks [15,16]. Unsupervised learning, in particular, offers the
advantage of identifying anomalies in data without the need for extensive labeling, which is
both time-consuming and costly [17]. Clustering algorithms, such as K-means and Gaussian
Mixture Models [18], have shown potential in this area by grouping similar messages
together and flagging outliers as potential spam. Rokach and Maimon [19] discussed the
potential of clustering techniques in detecting unknown patterns in data, which is crucial
for adapting to the ever-changing landscape of smishing.
Semi-supervised learning, which leverages a small amount of labeled data alongside a
larger pool of unlabeled data, has also gained traction in recent years [20]. This approach
strikes a balance between the robustness of supervised learning and the flexibility of unsu-
pervised methods, making it particularly effective in scenarios where labeled data is scarce.
Mansoor et al. [21] emphasized the need for semi-supervised methods in spam detection,
noting their ability to improve model performance while reducing the dependency on
labeled data. Despite these advancements, there remains a significant gap in the litera-
ture regarding the application of unsupervised and semi-supervised learning techniques
specifically for smishing detection. While some studies have explored these methods in
the context of general spam detection, few have focused on the unique challenges posed
by smishing, such as the integration of contextual information and the detection of highly
targeted attacks.
This paper aims to address these gaps by developing an AI model that leverages
unsupervised and deep semi-supervised learning to detect and classify SMS messages into
’ham’ and ’spam’ categories. The rationale for choosing these methods lies in their ability to
adapt to the evolving nature of smishing attacks without the need for constant retraining
or large labeled datasets. Unsupervised learning techniques, such as clustering, allow the
model to identify novel smishing patterns, while semi-supervised approaches enable the
incorporation of limited labeled data to refine the model’s accuracy.
The proposed approach not only aligns with but also advances the current state of
research by focusing on the practical application of these methods in real-world scenarios.
By addressing the limitations of traditional supervised models, our work offers a more
adaptable and scalable solution to the problem of smishing detection. The structure of this
paper is as follows: Section 2 presents the most recent related work, Section 3 explains the
feature extraction, Section 4 explains the methodology applied, Section 5 elaborates on the
results from all AI models, Section 6 shows the real-time detection and finally, Section 7
concludes the work and outlines the new directions for future research.
Sensors 2024, 24, 6084 3 of 19
2. Related Works
The advent of smartphones has transformed communication, and the detection of SMS
spam has emerged as a critical area of research. Researchers have turned to machine learn-
ing and data mining to develop effective spam filtering methods, essential for bolstering
text message security and usability. Deep learning and regular expressions were employed
in [22] to detect smishing messages within the UCI dataset [23]. Various preprocessing
techniques, such as stemming, word stopping, and punctuation removal using regex, were
applied for feature extraction. The classifier included traditional methods like multino-
mial NB, SVM, and RF, as well as LSTM variants for deep learning. Notably, the stacked
Bi-LSTM achieved the highest accuracy, surpassing others with scores of 98.8% and 99.09%.
In [24], SMS spam detection was examined using two datasets, with different classifiers
and preprocessing techniques tested. The CNN classifier achieved the highest accuracy,
reaching 99.19% and 98.25% accuracy for the two datasets. Authors in [25] conducted a
UCI dataset study using a one-class SVM for SMS spam detection, a novel approach that
outperformed traditional methods. It served as an anomaly detector, even without labeled
SMS data, achieving an overall accuracy of 98%, with 100% SMS spam detection and a 3%
false positive rate. In [26], a machine learning approach for SMS spam detection focused on
feature extraction and evaluation, leveraging features derived from spam and legitimate
message characteristics to train an averaged neural network model. This method delivered
outstanding results on the UCI dataset, boasting an accuracy of 98.8% and an F-measure of
99.29%. Weka and RapidMiner were applied for spam detection on the UCI dataset in [27].
Weka SVM achieved 99.3% accuracy in just 1.54 s, while K-Means excelled in clustering
with a 2.7 s runtime. RapidMiner SVM achieved 96.64% accuracy in 21 s, with K-Means
taking 37.0 s for results. In [28], an intention-based SMS spam filtering method was devel-
oped, emphasizing dynamic keyword semantics. The model, which utilized 13 predefined
intention labels, contextual embeddings, and supervised learning classifiers, achieved an
impressive 98.07% accuracy, along with 0.97% precision and recall. The study by [29]
introduced Text Augmentation for Model Improvement (TAMS) for addressing imbalanced
textual data classification. TAMS employed text augmentation by replacing words with
synonyms to create semantically similar messages, significantly enhancing classification
accuracy. The bidirectional LSTM (Bi-LSTM) classifier achieved a high accuracy of 93.34%
and an impressive F1-score of 94.18%. In [30], the UCI spam dataset was evaluated using
three algorithms: back propagation neural network, naive Bayes, and decision tree. Prepro-
cessing steps, including converting text to lowercase, removing punctuation and unique
strings, stemming, and tokenization, were applied. The neural network identified the
top seven smishing SMS features, achieving a final accuracy of 97.40%. The study by [31]
introduced the discrete-hidden Markov model for efficient spam detection, achieving an
impressive 95.9% accuracy. This model, unlike deep learning methods, is not language-
specific and performs well on both Chinese and English datasets. In [32], the Gini Index
metric was employed to investigate the ANN-SCG method for content-based spam SMS
filtering. Experiments showcased ANN-SCG’s ability to effectively filter over a hundred
spam SMS attributes swiftly, reducing memory usage. The research, which utilized datasets
like DIT SMS Spam, Spam Messages Collection, and British English SMS, revealed the
method’s high efficacy, achieving 99.1% accuracy in spam message filtering using just one
hundred features. In [33], a combination of machine learning (NB, LR, RF, GB, SGD) and
deep learning (CNN, LSTM) methods were introduced for spam filtering using UCI spam
datasets. The CNN achieved a remarkable 99.44% accuracy, though the study was focused
exclusively on English text messages.
Summarizing the reviewed papers, as presented in Table 1, the majority of the research
focuses on supervised and deep learning algorithms, with limited exploration of unsu-
pervised learning models. Consequently, our study aims to investigate the performance
of unsupervised and deep semi-supervised models and their applicability in real-world
scenarios, addressing the challenge of obtaining labeled data for training.
Sensors 2024, 24, 6084 4 of 19
3. Feature Generation
In the realm of machine learning and artificial intelligence projects, the initial steps are
pivotal for the success of model implementations [34]. The process of Optical Character
Recognition (OCR) begins with capturing visual data using cameras or scanners, which act
as sensing devices to detect and digitize the textual content embedded within images [35].
These sensors convert the physical properties of light and color into digital signals that
Sensors 2024, 24, 6084 5 of 19
represent the visual patterns of the text [36]. Once the data is captured, OCR technology pro-
cesses the image, identifying and converting the detected text into machine-readable format.
This foundational step is crucial, as it enables the system to transform raw image data into
usable text for further analysis, such as spam detection in our proposed application. These
early stages encompass data cleaning, preprocessing [37], and feature engineering [38].
Data cleaning involves the removal of extraneous elements, such as stop words, numbers,
and spaces, from the raw data. Preprocessing is the phase where data is transformed and
standardized to make it suitable for modeling. Finally, feature engineering entails creating
essential features and evaluating their impact on model outcomes. These steps collectively
lay the foundation for robust and accurate machine learning models. As illustrated in
Figure 1, our research places a strong emphasis on these initial steps, recognizing them as
the cornerstone in determining the ultimate success of each model implementation.
Figure 1. A Hierarchical framework for feature generation in the context of the proposed SMS fraud
detection system
3.1. Dataset
In contrast to previous studies that have primarily relied on the UCI SMS spam
dataset, our research took a novel approach by collecting real user-reported spam messages
captured as screenshot images using scanners. These scanned images, sourced from online
platforms, were stored locally and provided a diverse and realistic dataset for our analysis.
The spam messages varied in content and origin, offering a rich resource for evaluating
our models. For classification, we organized the SMS spam into three distinct categories:
those containing an email ID, messages with a website link, and messages that included a
phone number, as visually represented in the dataset in Figure 2. To transform these image-
based messages into machine-readable text, we harnessed OCR technology, leveraging the
Python-Tesseract library. OCR is a sophisticated tool that detects and converts text within
images into a format that computers can readily interpret. Notably, we used an open-source
OCR engine maintained by Google [39]. Subsequently, we merged this new dataset of
1500 SMS messages with the UCI SMS spam dataset, which contained 5574 messages,
culminating in a consolidated dataset of 7074 messages, as detailed in Table 2. This unified
dataset contains a collection of English SMS text messages with varying sentence lengths,
including both text and numerical content. Each record in the dataset is accompanied by a
label, where ‘1’ designates ‘ham’ (non-spam) and ‘0’ indicates ‘spam’. This dataset served as
the foundation for our model development and evaluation. For our unsupervised learning
experiments, we fed the data without labels into the models. In the case of semi-supervised
learning, we incorporated 10% of labeled data from the entire corpus of 5000 messages
sourced from the UCI spam dataset, along with 90% of the newly collected SMS data from
real-time users.
Sensors 2024, 24, 6084 6 of 19
Figure 2. Illustrative example of a simulated SMS containing an email address, hyperlink, and
contact number.
Text Label
Hi, How are you. When are you planning to meet me 1
Congratulations on winning the prize of 2000. To stop receiving messages, type
stop www.morereplayport.co.uk, accessed on 8 August 2024 Customer Support 0
0987617205546
Good Morning. Can we discuss this issue after sometime instaed of now 1
Service announcement from BRP. You have received a BRP card. Please call
0
07046744435 right away to schedule delivery between 8 a.m. and 8 p.m.
N
Wi , j = t f i , j ∗log (1)
d fi
corpus that include a specific term. For a corpus consisting of “Text1” and “Text2” with a
total of two documents, the document frequency of “part” is assessed. For a comprehensive
understanding of TF_IDF and its applications, please refer to Tables 3 and 4.
Before tokenizer me also da i fel yesterday night wait til day night dear
After tokenizer [29, 253, 319, 3, 384, 354, 200, 215, 355, 78, 200, 102]
Encoded_train [216, 1085, 1086, 123, 1, 1633, 320, 1634, 3, 79, 385, 2, 90, 85, 3, 40, 47]
Padded_train_pre [0 0 0 0 0 0 0 0 216 1085 1086 123 1 1633 320 1634 3 79 385 2 90 85 3 40 47]
Padded_train_post [216 1085 1086 123 1 1633 320 1634 3 79 385 2 90 85 3 40 47 0 0 0 0 0 0 0 0]
Sensors 2024, 24, 6084 8 of 19
tized precision over recall, excelling at detecting legitimate messages, but struggling with
spam identification.
K-Means K-Means
Runs NMF PCA Guassian_Matrix
Vectorizer Transformer
1 90.51% 88.24% 69.81% 71.87% 88.31%
2 91.88% 88.24% 71.94% 68.18% 91.47%
3 92.30% 88.24% 71.66% 68.18% 88.03%
4 91.06% 88.24% 66.78% 71.87% 82.39%
5 90.51% 88.24% 71.94% 71.87% 87.14%
6 91.27% 88.24% 71.73% 71.87% 86.59%
7 91.06% 88.24% 71.32% 71.87% 86.31%
8 92.43% 88.24% 71.87% 71.87% 92.37%
9 89.96% 88.24% 66.71% 71.87% 91.33%
10 92.57% 88.24% 72.01% 71.87% 87.28%
11 90.92% 88.24% 71.80% 71.87% 86.67%
12 89.13% 88.24% 72.01% 71.87% 85.35%
13 91.06% 88.24% 72.01% 71.87% 90.30%
14 90.44% 88.24% 69.81% 71.87% 91.47%
15 90.30% 88.24% 72.01% 71.87% 92.37%
16 90.92% 88.24% 72.01% 71.87% 92.50%
17 89.41% 88.24% 71.80% 71.87% 88.10%
18 91.54% 88.24% 71.94% 71.87% 89.61%
19 90.78% 88.24% 72.01% 71.87% 92.37%
20 92.23% 88.24% 71.80% 71.87% 90.92%
Accuracy 91.01% 88.24% 71.15% 71.50% 89.04%
Tokenizer and pad sequence techniques, which transform text data into a format suitable
for neural network processing.
Post-experimentation, we saved the model and feature extraction technique using the
Pickle library, ensuring ease of sharing and reusability. We then evaluated the model’s
prediction performance on unseen data, where the RNN-Flatten algorithm led with the
highest accuracy of 91%, surpassing Bi-LSTM (86%) and LSTM (84%).
During model training, we found that a message_length = 8 for sequence padding
and word embedding produced the best results. Additionally, we observed that the RNN
model with a flatten activation function consistently outperformed LSTM and Bi-LSTM
across all 20 iterations, demonstrating its superiority, as depicted in Figure 6. These results
highlight the efficacy of the RNN-Flatten model in spam SMS detection and showcase the
importance of architecture choice in deep semi-supervised learning for text classification.
Sensors 2024, 24, 6084 14 of 19
5. Discussion
In our experiments, we evaluated multiple models for spam detection, utilizing pre-
processing techniques, word embeddings, and a combination of machine learning and
deep learning models. Departing from prior studies that primarily used the UCI spam
dataset, we incorporated real-world, user-reported spam messages, extracted from images
using Optical Character Recognition (OCR) technology. This provided a more diverse
and realistic dataset for our analysis. The results indicated that the RNN-Flatten model
outperformed others, achieving a notable 94.13% accuracy, compared to 91.01% with the
K-means model, as shown in the accompanying Figure 7. This disparity underscores the
relative strengths and limitations of each model in handling diverse and complex data.
learning architecture, with its multiple layers and non-linear activation functions, allows
the RNN-Flatten model to effectively interpret and classify messages based on intricate
features, including context and sequential information. Additionally, the model’s capacity
to handle and learn from the sequential nature of SMS content provides it with a significant
advantage over simpler clustering algorithms like K-means.
K-means Model: The K-means model, an unsupervised learning technique, relies
on clustering data based on similarity measures. While it demonstrated robust perfor-
mance with a 91.01% accuracy using entirely unlabelled data, it generally performs well in
identifying clusters but lacks the depth of feature learning that deep learning models like
RNN-Flatten offer. K-means is limited by its reliance on predefined clusters and may strug-
gle with the complexity of nuanced text data, which affects its classification performance in
comparison to models that can learn complex patterns.
1. First, each SMS image is individually classified, allowing the system to handle mes-
sages one at a time. The user can then select from a variety of models to analyze the
nature of the image, offering the ability to choose the most suitable model for their
needs as shown in Figures 9 and 10 respectively.
2. Once the SMS is submitted, the system initiates preprocessing to prepare the data
for analysis. Following preprocessing, the selected model’s feature extraction tech-
niques and classifier are applied to the SMS, enabling the system to accurately assess
its content.
3. Finally, the application displays the result, indicating whether the message is classified
as spam or ham with the accuracy given by the selected model as shown in figure.
Author Contributions: A.S., E.Q.S. and S.B. conceived the presented idea. A.S., E.Q.S., S.B., F.S.,
A.A.A. and W.A.J. developed the theory and performed the computation. A.S. planned and carried
out the simulations. E.Q.S., S.B., A.S., E.Q.S., S.B., F.S., A.A.A. and W.A.J. verified the analytical
method. A.S. wrote the draft of the manuscript with input from all authors. A.S., E.Q.S., S.B., F.S.,
A.A.A. and W.A.J. revised and edited the manuscript. E.Q.S. and S.B. supervised the project. All
authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Available upon request.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Samad, S.R.A.; Ganesan, P.; Rajasekaran, J.; Radhakrishnan, M.; Ammaippan, H.; Ramamurthy, V. SmishGuard: Leveraging
Machine Learning and Natural Language Processing for Smishing Detection. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 11. [CrossRef]
2. Njuguna, D.N.; Kamau, J.; Kaburu, D. A Review of Smishing Attaks Mitigation Strategies. Int. J. Comput. Inf. Technol. 2022, 11,
9–13. [CrossRef]
3. Haber, M.J.; Chappell, B.; Hills, C. Attack Vectors. In Cloud Attack Vectors: Building Effective Cyber-Defense Strategies to Protect
Cloud Resources; Apress: Berkeley, CA, USA, 2022; pp. 117–219.
4. Vosen, D.J. An Exploration of Cyberpsychology Strategies Addressing Unintentional Insider Threats Through Undergraduate
Education: A Qualitative Study. Ph.D. Thesis, Colorado Technical University, Springs, CO, USA, 2021.
5. McLennan, M. The Global Risks Report 2022, 17th ed.; World Economic Forum: Cologny, Switzerland, 2022.
6. Julis, M.R.; Alagesan, S. Spam Detection in SMS Using Machine Learning through Textmining. Int. J. Sci. Technol. Res. 2020, 9, 2.
7. Barrera, D.; Naranjo, V.; Fuertes, W.; Macas, M. Literature Review of SMS Phishing Attacks: Lessons, Addresses, and Future
Challenges. In Proceedings of the International Conference on Advanced Research in Technologies, Information, Innovation and
Sustainability, Madrid, Spain, 18–20 October 2023; Springer: Cham, Switzerland, 2024; pp. 191–204.
8. Tiwari, A. Supervised Learning: From Theory to Applications. In Artificial Intelligence and Machine Learning for EDGE Computing;
Academic Press: Cambridge, MA, USA, 2022; pp. 23–32.
9. Al-Qahtani, A.F.; Cresci, S. The COVID-19 Scamdemic: A Survey of Phishing Attacks and Their Countermeasures during
COVID-19. IET Inf. Secur. 2022, 16, 324–345. [CrossRef]
10. Akinyelu, A.A. Advances in Spam Detection for Email Spam, Web Spam, Social Network Spam, and Review Spam: ML-Based
and Nature-Inspired-Based Techniques. J. Comput. Secur. 2021, 29, 473–529. [CrossRef]
11. Wickramasinghe, I.; Kalutarage, H. Naive Bayes: Applications, Variations and Vulnerabilities: A Review of Literature with Code
Snippets for Implementation. Soft Comput. 2021, 25, 2277–2293. [CrossRef]
12. Genuer, R.; Poggi, J.-M. Random Forests; Springer: Cham, Switzerland, 2020.
13. Abayomi, A.O.; Misra, S.; Abayomi, A.A.; Odusami, M. A review of soft techniques for SMS spam classification: methods,
approaches and applications. J. Eng. Appl. Artif. Intell. 2019, 86, 197–212. [CrossRef]
14. Taha, K. Semi-Supervised and Un-Supervised Clustering: A Review and Experimental Evaluation. Inf. Syst. 2023, 114, 102178.
[CrossRef]
Sensors 2024, 24, 6084 18 of 19
15. Kumarasiri, W.L.T.T.N.; Siriwardhana, M.K.J.C.; Suraweera, S.A.D.S.L.; Senarathne, A.N.; Harshanath, S.M.B. Cybersmish: A
Proactive Approach for Smishing Detection and Prevention Using Machine Learning. In Proceedings of the 2023 7th International
Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Kirtipur, Nepal, 11–13 October 2023; pp. 210–217.
16. Shahra, E.Q.; Basurra, S.; Wu, W. Real-Time Multi-Class Classification of Water Quality Using MLP and Ensemble Learning. In
Proceedings of the International Congress on Information and Communication Technology; Springer: Singapore, 2024; pp. 481–491.
17. Usmani, U.A.; Happonen, A.; Watada, J. A Review of Unsupervised Machine Learning Frameworks for Anomaly Detection in
Industrial Applications. In Intelligent Computing; Springer: Cham, Switzerland, 2022; pp. 158–189.
18. Patel, E.; Kushwaha, D.S. Clustering Cloud Workloads: K-Means vs Gaussian Mixture Model. Procedia Comput. Sci. 2020, 171,
158–167. [CrossRef]
19. Rokach, L.; Maimon, O. Clustering Methods. In Data Mining and Knowledge Discovery Handbook; Springer, Boston, MA, USA, 2005;
pp. 321–352.
20. Slijepcevic, I.V.; Scaife, A.M.M.; Walmsley, M.; Bowles, M.; Wong, O.I.; Shabala, S.S.; Tang, H. Radio Galaxy Zoo: Using
Semi-Supervised Learning to Leverage Large Unlabelled Data Sets for Radio Galaxy Classification Under Data Set Shift. Mon.
Not. R. Astron. Soc. 2022, 514, 2599–2613. [CrossRef]
21. Mansoor, R.A.Z.A.; Jayasinghe, N.D.; Muslam, M.M.A. A Comprehensive Review on Email Spam Classification Using Machine
Learning Algorithms. In Proceedings of the 2021 International Conference on Information Networking (ICOIN), Jeju Island,
Republic of Korea, 13–16 January 2021; pp. 327–332.
22. Sharaff, A.; Pathak, V.; Paul, S.S. Deep Learning-Based Smishing Message Identification Using Regular Expression Feature
Generation. Expert Syst. 2022, 40, e13153. [CrossRef]
23. Shahra, E.Q.; Wu, W.; Basurra, S.; Rizou, S. Deep Learning for Water Quality Classification in Water Distribution Networks. In
Proceedings of the International Conference on Engineering Applications of Neural Networks, Crete, Greece, 25–27 June 2021;
pp. 153–164.
24. Gupta, M.; Bakliwal, A.; Agarwal, S.; Mehndiratta, P. A Comparative Study of Spam SMS Detection Using Machine Learning
Classifiers. In Proceedings of the 2018 Eleventh International Conference on Contemporary Computing (IC3), Noida, India, 2–4
August 2018; pp. 1–7.
25. Yerima, S.Y.; Bashar, A. Semi-Supervised Novelty Detection with One Class SVM for SMS Spam Detection. In Proceedings of the
2022 29th International Conference on Systems, Signals and Image Processing (IWSSIP), Sofia, Bulgaria, 1–3 June 2022; pp. 1–4.
26. Sheikhi, S.; Kheirabadi, M.T.; Bazzazi, A. An Effective Model for SMS Spam Detection Using Content-Based Features and
Averaged Neural Network. Int. J. Eng. 2020, 33, 221–228.
27. Zainal, K.; Sulaiman, N.F.; Jali, M.Z. An Analysis of Various Algorithms for Text Spam Classification and Clustering Using
RapidMiner and Weka. Int. J. Comput. Sci. Inf. Secur. 2015, 13, 66.
28. Oswald, C.; Simon, S.E.; Bhattacharya, A. SpotSpam: Intention Analysis Driven SMS Spam Detection Using BERT Embeddings.
ACM Trans. Web (TWEB) 2022, 16, 1–27. [CrossRef]
29. Jouban, M.Q.; Farou, Z. TAMS: Text Augmentation Using Most Similar Synonyms for SMS Spam Filtering. 2022. Available online:
https://ceur-ws.org/Vol-3226/paper4.pdf (accessed on 8 August 2024).
30. Mishra, S.; Soni, D. Implementation of ‘Smishing Detector’: An Efficient Model for Smishing Detection Using Neural Network.
SN Comput. Sci. 2022, 3, 1–13. [CrossRef]
31. Zhang, B.; Zhao, G.; Feng, Y.; Zhang, X.; Jiang, W.; Dai, J.; Gao, J. Behavior Analysis Based SMS Spammer Detection in Mobile
Communication Networks. In Proceedings of the 2016 IEEE First International Conference on Data Science in Cyberspace (DSC),
Changsha, China, 13–16 June 2016; pp. 538–543.
32. Waheeb, W.; Ghazali, R.; Deris, M.M. Content-Based SMS Spam Filtering Based on the Scaled Conjugate Gradient Backpropagation
Algorithm. In Proceedings of the 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD),
Zhangjiajie, China, 15–17 August 2015; pp. 675–680.
33. Roy, P.K.; Singh, J.P.; Banerjee, S. Deep Learning to Filter SMS Spam. Future Gener. Comput. Syst. 2020, 102, 524–533. Available
online: https://www.sciencedirect.com/science/article/pii/S0167739X19306879 (accessed on 8 August 2024). [CrossRef]
34. Shahra, E.Q.; Wu, W.; Basurra, S.; Aneiba, A. Intelligent Edge-Cloud Framework for Water Quality Monitoring in Water
Distribution System. Water 2024, 16, 196. [CrossRef]
35. Nair, A.R.; Tripathy, V.D.; Lalitha Priya, R.; Kashimani, M.; Janthalur, G.A.N.; Ansari, N.J.; Jurcic, I. A Smarter Way to Collect and
Store Data: AI and OCR Solutions for Industry 4.0 Systems. In Topics in Artificial Intelligence Applied to Industry 4.0; Wiley Telecom:
Hoboken, NJ, USA, 2024; pp. 271–288.
36. Manovich, L. Computer vision, human senses, and language of art. AI SOCIETY 2021, 36, 1145–1152. [CrossRef]
37. Tabassum, A.; Patil, R.R. A survey on text pre-processing & feature extraction techniques in natural language processing. Int. Res.
J. Eng. Technol. (IRJET) 2020, 7, 4864–4867.
38. Dong, G.; Liu, H. Feature Engineering for Machine Learning and Data Analytics; CRC Press: Boca Raton, FL, USA, 2018.
39. Patel, C.; Patel, A.; Patel, D. Optical character recognition by open source OCR tool tesseract: A case study. Int. J. Comput. Appl.
2012, 55, 50–56. [CrossRef]
40. Guyon, I.; Elisseeff, A. An introduction to feature extraction. In Feature Extraction: Foundations and Applications; Springer:
Berlin/Heidelberg, Germany, 2006; pp. 1–25.
Sensors 2024, 24, 6084 19 of 19
41. Karamizadeh, S.; Abdullah, S.M.; Manaf, A.A.; Zamani, M.; Hooman, A. An overview of principal component analysis. J. Signal
Inf. Process. 2020, 4. [CrossRef]
42. Imani, M.; Montazer, G.A. Email Spam Detection Using Linear Discriminant Analysis Based on Clustering. CSI J. Comput. Sci.
Eng. 2017, 15, 22–30.
43. Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng.
Sci. 2016, 374, 20150202. [CrossRef] [PubMed]
44. Wang, X.S.; Ryoo, J.H.J.; Bendle, N.; Kopalle, P.K. The role of machine learning analytics and metrics in retailing research. J. Retail.
2021, 97, 658–675. [CrossRef]
45. Ouali, Y.; Hudelot, C.; Tami, M. An overview of deep semi-supervised learning. arXiv 2020, arXiv:2006.05278.
46. Van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.