0% found this document useful (0 votes)
22 views6 pages

Detecting Phishing in Text Messages

This research paper discusses the detection of phishing in text messages, specifically SMS phishing or smishing, which is a growing cybersecurity threat. It evaluates various machine learning models, including SVM and XGBoost, using feature extraction techniques like GloVe and PCA, to improve detection accuracy. The findings indicate that XGBoost with GloVe + PCA achieves the highest accuracy, highlighting the effectiveness of combining word embeddings with dimensionality reduction for phishing detection.

Uploaded by

IJMSRT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views6 pages

Detecting Phishing in Text Messages

This research paper discusses the detection of phishing in text messages, specifically SMS phishing or smishing, which is a growing cybersecurity threat. It evaluates various machine learning models, including SVM and XGBoost, using feature extraction techniques like GloVe and PCA, to improve detection accuracy. The findings indicate that XGBoost with GloVe + PCA achieves the highest accuracy, highlighting the effectiveness of combining word embeddings with dimensionality reduction for phishing detection.

Uploaded by

IJMSRT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Volume-3,Issue3,March2025 International Journal of Modern Science and Research Technology

ISSN NO-2584-2706

Detecting Phishing in Text Messages


Vanshika Sharma, Srishti Verma, Aman Singh, Dr. Tanu Gupta

Abstract: Related work:


Now a day there is a lot of data security SMS phishing, also known as smishing. It
issues. Hackers are now very much expert is a deceptive practice that tricks
in using their knowledge for hack into individuals into revealing sensitive
someone else’s system and grab the information through fraudulent SMS
information. Phishing is one such type of messages. Attackers use various
methodologies which are used to acquire techniques such as impersonation, fake
the information. Phishing is a cyber promotions, malicious links, and urgent
crime in which emails, telephone, text requests to manipulate victims into
messages, personally identifiable clicking phishing links or sharing
information, banking details, credit card confidential data. Smishing is a growing
details, password is been targeted. cybersecurity threat, targeting financial
Phishing is mainly a form of online institutions, businesses, and individuals
identify theft. Social Engineering is worldwide.Several studies have explored
being used by the phisher to steal machine learning approaches for detecting
victim’s personal data and the account phishing SMS. Researchers have
details. This research paper gives a fair employed classification models such as
idea of phishing attack, the types of Support Vector Machine (SVM), Naïve
phishing attack through which the attacks Bayes (NB), Random Forest (RF), and
are performed, detection and prevention XGBoost to distinguish phishing SMS
towards it. from legitimate messages. Feature
Introduction: extraction techniques such as TF-IDF,
Phishing is the act of attempting to payoff Word2Vec, and GloVe embeddings have
information such as username, password been widely used to enhance model
and credit card details as a trustworthy performance.
entity in an electronic communication. Gupta et al. (2020) demonstrated that
Communication purporting to be from Random Forest achieved higher accuracy
popular social websites, auction sites, than SVM when using TF-IDF features.
online payments process or IT Similarly, Sharma et al. (2021) compared
administrator is commonly used to lure Logistic Regression and XGBoost,
the unsuspecting public. Phishing emails showing that GloVe-based features
may contain links to websites that are improved classification accuracy.
infected with malware. However, short text length and lack of
Phishing is an example of Social contextual information in SMS remain
Engineering. Phishing is mainly used in major challenges in phishing detection.
email hacking, in email phishing the Additionally, researchers have
hacker send a link via mail to the user of experimented with dimensionality
let’s say some bank details or any reduction techniques such as PCA to
personal information, so now the user goes optimize feature representation and
to that link and fills all the detail in that improve classification efficiency.
link and then the hacker gets all the Despite these advancements, challenges
information of the user. This is how such as evasive phishing techniques,
phishing is done. multilingual SMS phishing, and
adversarial attacks require further research.
IJMSRT24MAR044 www.ijmsrt.com 335
DOI: https://doi.org/10.5281/zenodo.15113088
Volume-3,Issue3,March2025 International Journal of Modern Science and Research Technology
ISSN NO-2584-2706
Our study builds upon existing work by extraction to improve SMS phishing
evaluating SVM and XGBoost classifiers detection.
using GloVe and GloVe + PCA feature

Proposed methodology: embedding technique is used to make


In this research, we use nltk, numpy , more dimension of sequence into vector.
pandas , scipy, gensim,scikit-learn,spacy After data preparation process, we train the
that is a library in Python for machine model based on SVM, LOGISTIC
learning model development . It has a REGRESSION, RANDOM FOREST,
toolset for data preparation, such as word XGBoost algorithms. Then, we evaluate
tokenization , and word embedding . The the performance of the models and
word tokenization technique is used for compare their performance with the model
taking text inputs into sequential data as based of machine learning algorithms. The
index values of the words. The word working flow of the framework
.

Dataset Extracting the

Data Cleaning

Feature Engineering

Model Building using


Naïve Bayes technique

Ham
Model

Prediction
Spam

 Dataset Collection: Gather a  Data Cleaning: Convert text to


dataset containing SMS messages lowercase. Remove special
labeled as phishing (spam) and characters, numbers, and
legitimate (ham). unnecessary symbols. Remove
 Extracting the Data: Load and stopwords and apply tokenization.
preprocess the dataset to make it
suitable for further processing.

IJMSRT24MAR044 www.ijmsrt.com 336


DOI: https://doi.org/10.5281/zenodo.15113088
Volume-3,Issue3,March2025 International Journal of Modern Science and Research Technology
ISSN NO-2584-2706
 Feature Engineering: Convert text  Model Evaluation: Assess the
data into numerical format using model performance using
feature extraction techniques. Use evaluation metrics such as
TF-IDF, GloVe embeddings, or accuracy, precision, recall, and F1-
PCA for dimensionality reduction. score.

 Model Building: Train machine  Prediction: Deploy the trained


learning models for classification. model to classify incoming SMS
Use algorithms such as Support messages as ham (legitimate) or
Vector Machine (SVM) and spam (phishing).
XGBoost.

Datasets which include text and number in


In this experiment, we use a SMS different length of sentences. All
spam dataset proposed by records in this dataset already
mohitgupta-1O1/Kaggle-SMS- labeled. The spam messages are
Spam-Collection-Dataset. labelled as 1 (747 records) and the
This dataset consists of normal messages are labelled as 0
approximately 5,574 records. It (4,825 records). The example of the
contains SMS text messaging dataset illustrated.
conversations in English language,

IJMSRT24MAR044 www.ijmsrt.com 337


DOI: https://doi.org/10.5281/zenodo.15113088
Volume-3,Issue3,March2025 International Journal of Modern Science and Research Technology
ISSN NO-2584-2706

Models and Algorithms Used: predictions from multiple trees to reduce


1. Support Vector Machine (SVM) overfitting.Suitable for handling non-linear
SVM is a powerful supervised learning relationships in data.
algorithm that works by finding the  Advantage: Provides high accuracy and
optimal hyperplane to separate different robustness to noisy data.
classes.It is effective for high-dimensional  Limitation: Can be computationally
text data.It uses the kernel trick to expensive and slow for very large
transform non-linearly separable data into datasets.
a higher-dimensional space.
 Advantage: Works well with small to 4. Logistic Regression (LR):
medium-sized datasets and handles text Logistic Regression is a linear model used
classification efficiently. for binary classification tasks.Computes
 Limitation: Computationally expensive the probability of an SMS being phishing
for large datasets. or legitimate using the sigmoid
function.Works well when features are
2. Naïve Bayes (NB): linearly separable.
Naïve Bayes is a probabilistic classifier  Advantage: Simple, interpretable, and
based on Bayes' theorem with an effective for text classification.
assumption of independence between  Limitation: May not perform well on
features.Commonly used for spam complex, non-linear relationships.
detection due to its simplicity and
efficiency.Uses term frequency and 5. XGBoost (Extreme Gradient Boosting)
conditional probabilities to classify SMS XGBoost is a boosting algorithm that
messages. improves classification performance by
 Advantage: Fast and performs well training weak models iteratively.Uses
even with small datasets. gradient boosting to minimize errors and
 Limitation: Assumption of feature enhance model accuracy. Handles missing
independence may not always hold in data and large-scale datasets efficiently.
real-world text data.  Advantage: Highly efficient, scalable,
and outperforms traditional models in
3. Random Forest (RF): many text classification tasks.
Random Forest is an ensemble learning  Limitation: Requires careful
method that combines multiple decision hyperparameter tuning to avoid
trees to improve classification overfitting.
accuracy.Works by aggregating
IJMSRT24MAR044 www.ijmsrt.com 338
DOI: https://doi.org/10.5281/zenodo.15113088
Volume-3,Issue3,March2025 International Journal of Modern Science and Research Technology
ISSN NO-2584-2706
Experiment and Results In this experiment, we compare the
This section presents the findings of the performance of different machine learning
proposed framework in this study. The models, including SVM and XGBoost,
experiments evaluate the performance of using GloVe and GloVe + PCA
different machine learning models, embeddings for phishing SMS detection.
including SVM, Naïve Bayes, Random The models are evaluated based on
Forest, Logistic Regression, and XGBoost. accuracy, precision, recall, F1-score, and
The models are analyzed and compared AUC-ROC. The results indicate that
based on accuracy, precision, recall, F1- XGBoost with GloVe + PCA achieves the
score, and AUC-ROC. highest accuracy, demonstrating its
effectiveness in feature extraction and
Experiment 1: classification. The table below presents the
Performance Comparison of Machine detailed comparison of these models.
Learning Models using GloVe and
GloVe + PCA

Model Accuracy Precision Recall F1- AUC-


Score ROC
SVM (Glove) 0.949776 0.861538 0.746667 0.800000 0.966370
SVM (Glove + 0.937220 0.803030 0.706667 0.751773 0.965406
PCA)
XGBoost 0.964126 0.923077 0.800000 0.857143 0.980197
(Glove)
XGBoost 0.969507 0.946154 0.820000 0.878571 0.981440
(Glove +
PCA)

Experiment 2: strong performance, incorporating PCA


Impact of Feature Extraction on Model enhances generalization, particularly for
Performance SVM and XGBoost. The table below
In this experiment, we analyze the impact summarizes the performance variations.
of different feature extraction techniques These findings highlight the effectiveness
on model performance. We compare the of XGBoost in phishing detection and
results of models using GloVe and GloVe demonstrate that combining GloVe with
+ PCA to assess how dimensionality PCA enhances model performance.
reduction affects classification. The results
demonstrate that while GloVe provides

Model Accuracy Precision Recall F1-Score AUC-


ROC
SVM 0.949776 0.861538 0.746667 0.800000 0.966366

XGBoost 0.964126 0.923077 0.800000 0.857143 0.980197


SVM + 0.968610 0.945736 0.813333 0.874552 0.975320
XGBoost

IJMSRT24MAR044 www.ijmsrt.com 339


DOI: https://doi.org/10.5281/zenodo.15113088
Volume-3,Issue3,March2025 International Journal of Modern Science and Research Technology
ISSN NO-2584-2706
Conclusion: [3]A. Aljofey, Q. Jiang, Q. Qu, M. Huang,
"In this study, we explored machine and J.P. Niyigena (2020). An Effective
learning approaches for phishing SMS Phishing Detection Model Based on
detection. Our analysis demonstrated that Character Level Convolutional Neural
SVM, Random Forest, XGBoost, Naïve Network from URL. Electronics (Basel),
Byes achieved the highest accuracy using 9(9), 1514.
GloVe-based feature representation. The https://doi.org/10.3390/electronics9091514
results indicate that word embeddings [4]C. Jones (2022, January 18). 50
combined with dimensionality reduction phishing stats you should know in 2022.
techniques can improve classification 50 Phishing Stats You Should Know In
performance. However, the study was 2022. Retrieved January 27, 2022, from
limited to English-language SMS and a https://expertinsights.com/insights/50-
relatively small dataset. In the future, we phishing-stats-you-should-know/
aim to extend this research to multilingual [5] A. Hannousse and S. Yahiouche
datasets, deep learning-based approaches, (2021), “Web page phishing detection”,
and real-time phishing detection systems Mendeley Data, V3, doi:
to enhance security against evolving cyber 10.17632/c2gw7fy2j4.3
threats." [6] G. K. Soon, C. O. Kim, N. M. Rusli, T.
S. Fun, R. Alfred, and T. T. Guan, T.
Acknowledgement: (2020). Comparison of simple feedforward
"I would like to express my sincere neural network, recurrent neural network,
gratitude to my supervisor Tanu Gupta, for and ensemble neural networks in phishing
their invaluable guidance, constructive detection. Journal of Physics. Conference
feedback, and continuous support Series, 1502(1), 12033.
throughout this research. I am also grateful https://doi.org/10.1088/1742-
to K.R.Mangalam University for providing 6596/1502/1/012033
the necessary resources and academic [7] Anti-Phishing Working Group (APWG)
environment to conduct this study. (2014). Phishing Activity Trends Report,
Additionally, I acknowledge the 3rd Quarter 2021 [Online] Retrieved Feb
contributions of researchers in the field 9, 2022, from
whose work has inspired and informed my https://docs.apwg.org/reports/apwg_trends
research." _report_q3_2021.pdf
[8]M. Kearns, and A. Roth (2022, March
References: 9). Ethical Algorithm Design Should
[1].K. Haynes, H. Shirazi, and I. Ray Guide Technology Regulation. Brookings.
(2021). Lightweight URL-based phishing Retrieved June 9, 2022, from
detection using natural language https://www.brookings.edu/research/ethica
processing transformers for mobile l-algorithm-design-should-
devices. Procedia Computer Science, 191, guidetechnology-regulation/#footnote-3
127–134.
https://doi.org/10.1016/j.procs.2021.07.04
0
[2].A. Abbasi, D. Dobolyi, A. Vance, and
F. M. Zahedi (2021). The Phishing Funnel
Model: A Design Artifact to Predict User
Susceptibility to Phishing Websites.
Information Systems Research, 32(2),
410–436.
https://doi.org/10.1287/isre.2020.0973

IJMSRT24MAR044 www.ijmsrt.com 340


DOI: https://doi.org/10.5281/zenodo.15113088

You might also like