0% found this document useful (0 votes)
171 views25 pages

Sms Spam Filtering System Hybrid Approaches

The project focuses on developing hybrid approaches for SMS spam filtering to address challenges such as evolving spam tactics and dataset imbalance. By integrating multiple machine learning techniques, including Positive-Unlabeled learning, reinforcement learning, and Generative Adversarial Networks, the proposed system aims to enhance accuracy, reduce false positives, and adapt to new spam patterns. The system will feature a user-friendly web interface for real-time classification and insights on model performance.

Uploaded by

anirudh.v4444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views25 pages

Sms Spam Filtering System Hybrid Approaches

The project focuses on developing hybrid approaches for SMS spam filtering to address challenges such as evolving spam tactics and dataset imbalance. By integrating multiple machine learning techniques, including Positive-Unlabeled learning, reinforcement learning, and Generative Adversarial Networks, the proposed system aims to enhance accuracy, reduce false positives, and adapt to new spam patterns. The system will feature a user-friendly web interface for real-time classification and insights on model performance.

Uploaded by

anirudh.v4444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Department of Computer Science and Engineering

Title of the Project


“Hybrid Approaches for SMS Spam Filtering”
Abstract
With the rise of digital communication, spam messages have become a major concern,
leading to privacy risks, financial fraud, and a poor user experience. Traditional spam
detection methods often struggle to keep up with evolving evasion techniques, suffer from
high false positive rates, and face challenges in handling imbalanced datasets where spam
messages are significantly outnumbered. As spam continues to grow more sophisticated, there
is a critical need for more effective, adaptive, and high-accuracy filtering systems.

To address these challenges, a hybrid approach combining multiple machine learning


techniques can significantly enhance spam detection. Positive-Unlabeled (PU) learning with
RoBERTa embeddings helps mitigate dataset imbalance by leveraging unlabeled data, while
reinforcement learning with LightRoBERTa improves adaptability to adversarial spam.
BiLSTM models with DistilBERT embeddings strengthen sequential pattern recognition, and
Generative Adversarial Networks (GANs) generate synthetic spam messages to augment
training data. By integrating these techniques, spam filtering systems can achieve higher
accuracy, lower false positives, and better resilience against evolving spam tactics, making
them more reliable for real-world applications.
Introduction
• The widespread use of digital communication has resulted in an overwhelming increase in
spam messages, posing serious risks such as fraud, phishing scams, and data breaches.
• Spammers constantly develop new evasion tactics, making it difficult for traditional spam
filters to keep up and effectively detect malicious messages.
• Existing spam detection systems face several challenges, including dataset imbalance,
where spam messages are far fewer than non-spam, leading to biased models.
• Many filters suffer from low detection accuracy and high false positives, misclassifying
legitimate messages as spam.
• Advanced spamming techniques, such as obfuscation, adversarial attacks, and content
manipulation, allow spam messages to bypass conventional detection methods.
• Hybrid machine learning approaches can help overcome these limitations by combining
multiple models to improve accuracy, adapt to evolving spam patterns, and handle
imbalanced datasets more effectively.
• These techniques enable more robust and adaptive filtering systems, ensuring better spam
detection while minimizing false positives.
Literature survey
• Based on IEEE 2024 paper titled "Investigating Evasive Techniques in SMS Spam
Filtering: A Comparative Analysis of Machine Learning Models."
• Paper reviews extensive research on SMS spam detection and contributes a
dataset of 60,000 SMS messages. Dataset includes spam and legitimate messages,
collected through various online methods and volunteer contributions.
• Evaluates multiple machine learning and deep learning models for SMS spam
detection. Focuses on identifying key features essential for effective spam
detection. Aims to enhance spam detection capabilities and address gaps in
existing technologies.
• Enhancing Spam Detection with GANs and BERT Embeddings: A Novel
Approach to Imbalanced Datasets published in 2024.
• Addressing the challenge of the existing systems of dataset imbalance using
GANs generating synthetic messages balancing the dataset.
Objectives
• 1. Integrate Hybrid Machine Learning Approaches

Utilize a combination of machine learning techniques to improve the system’s
ability to detect spam, handle diverse message patterns, and strengthen overall
robustness.
• 2. Address Dataset Imbalance Issues

Implement techniques to handle the imbalance between spam and non-spam
messages, ensuring that the model remains unbiased and accurately identifies
spam.
• 3. Improve Accuracy and Reduce False Positives

Enhance the performance of spam classification by increasing detection
accuracy while minimizing false positives, ensuring that legitimate messages
are not incorrectly flagged as spam.
• 4. Develop an Adaptive Spam Detection System

Create a spam filtering model that can effectively detect and classify spam
messages while adapting to evolving evasion techniques used by spammers.
• 5. Develop a Scalable and User-Friendly System

Build a web-based interface that allows users to classify SMS messages in real-
time, providing insights on detection accuracy and model performance.
Problem Statement
1.With the increasing reliance on digital communication, spam messages have become a
significant challenge, leading to fraudulent activities, phishing scams, and data breaches.
Traditional spam detection systems struggle to keep up with the rapidly evolving evasion
techniques used by spammers, making them less effective over time. Additionally, dataset
imbalance, where spam messages are significantly fewer than non-spam, leads to biased
models that fail to accurately classify spam. Many existing filters suffer from low detection
accuracy and high false positive rates, misclassifying legitimate messages as spam while
allowing sophisticated spam messages to bypass detection.

2.To address these challenges, there is a need for an advanced and adaptive spam
filtering system that leverages hybrid machine learning techniques to improve detection
accuracy, handle imbalanced datasets, and adapt to new and adversarial spam
patterns. By integrating multiple learning approaches, an effective spam filtering
solution can be developed to enhance robustness, minimize false positives, and ensure
reliable message classification in real-world applications.
Existing systems
1. Two-Class Classifiers:

Support Vector Machines (SVM): Effective for binary classification tasks.
2. One-Class Classifiers:

One-Class SVM: Identifies anomalies; less common in spam detection.
3. Positive and Unlabeled (PU) Learning:

PU Learning: Deals with datasets of positive and unlabeled examples.
4. Deep Learning Models:

Neural Networks, RNNs, CNNs, Transformers: Advanced models for improved
spam detection.
5. Multiple Transformer based models are used:

BERT, ELMO, RoBERTa, DistilBERT, LSTM, BiLSTM, CNN, TCN,
Ensemble(CNN+(BIGRU)).
Drawbacks
• 1. Poor Adaptability to Evolving Spam Techniques

Traditional filters struggle to keep up with constantly changing spam tactics, such as
obfuscation, adversarial text modifications, and disguised phishing attempts.
• 2. High False Positives and False Negatives

Many spam detection systems misclassify legitimate messages as spam (false
positives) or fail to detect actual spam (false negatives), leading to unreliable
filtering.
• 3. Bias Due to Dataset Imbalance

Since spam messages are significantly fewer than non-spam in most datasets, models
often become biased toward classifying messages as non-spam, reducing detection
accuracy.
• 4. Limited Contextual and Semantic Understanding

Basic machine learning and rule-based models struggle to understand the meaning
behind messages, making them ineffective against sophisticated and context-
dependent spam.
• 5. Vulnerability to Manipulated Spam Content

Spammers exploit weaknesses in existing filters by modifying message structures,
using misspellings, special characters, and hidden text to bypass detection.
Proposed systems
• Hybrid Approaches:
• Combined Models: Integrated techniques for better performance.
• Improved handling of evasion tactics.
• Using Frozen embeddings with a model might provide better
accuracy due to limited data availability.
• Using the two step PU learning method with RoBERTa embeddings
and XG boost model on top of it.
• Using the semi-dynamic Reinforcement learning (Actor-Critic with
PPO(proximal policy optimization)) along with fine tuning the
DistilRoBERTa model.
• The GAN model generates synthetic spam message embeddings (BERT
Embeddings) using an MLP based generator and discriminator,
enhancing training data diversity to improve spam detection
robustness.
Architechture:
Architechture of ML Models:

1. DistilBERT 2. (Two step) 3. Reinforcement


Embeddings PU learning learning model
with BiLSTM with RoBERTa (Actor-Critic)with
layer Model embeddings fine tuned
DistilRoBERTa
model with PPO
Agent.
Architechture of ML Models:

4. GAN Model
to generate
Synthetic BERT
Embeddings
ans train, test a
classifier.
Modules
1. Data Processing Module: 3. Model Testing & Evaluation Module
 Load the dataset (labeled & unlabeled SMS data).  Test each trained model on the adversarial dataset.
 Pre process text (cleaning, tokenization, lower casing, removing Stop  Measure performance metrics (accuracy, F1-score, precision, recall).
words,etc.).  Store results for comparison.
 Convert text into embeddings (DistilBERT, RoBERTa, etc.). Split dataset 4. API & Backend (Django) Module
into training and testing sets.  Set up Django REST API with endpoints.
2. Model Training Module (Four Models):  Load all trained models in the backend.
(a) PU Learning Model  Process incoming SMS text or file uploads.
 Use RoBERTa embeddings with a two-step PU learning approach.  Select the appropriate ML model for classification.
 Train an XGBoost classifier on the embeddings.  Return results (spam/not spam + metrics).
 Save the trained model for later use. 5. Frontend (React) Module
(b) GAN Model for Data Augmentation  Single text input → Display spam/not spam result.
 Train MLP-based GAN to generate synthetic BERT embeddings.  File upload → Show accuracy, F1-score, etc.
 Evaluate the Discriminator to filter realistic embeddings.  Model selection → Choose which model to use.
 Save generated embeddings to improve spam classification.  User-friendly UI for smooth experience.
(c) RL-based Model (Actor-Critic with PPO)
 Fine-tune LightRoBERTa as the Actor network.
 Use PPO to train the model with reinforcement learning.
 Save the trained actor model with tokenizer.
(d) BiLSTM Model with DistilBERT Embeddings
 Convert SMS texts into DistilBERT embeddings.
 Train a BiLSTM classifier on the embeddings.
 Save the trained BiLSTM model.
BiLSTM with (DistilBERT)Embeddings Model Modules
1️. Data Processing Module

Load spam and non-spam SMS messages.

Pre process text: Cleaning, tokenization, stop word removal.

Convert text into DistilBERT embeddings (frozen, not fine-tuned).
2️. Model Training Module
Step 1:

Train BiLSTM Network

Feed pre computed DistilBERT embeddings into a BiLSTM model (Bidirectional Long Short-Term Memory).

BiLSTM captures sequential patterns in the embeddings to classify spam vs. non-spam.
Step 2:

Optimize BiLSTM Model

Train BiLSTM using cross-entropy loss for classification.

Optimize BiLSTM weights using the Adam optimizer.

DistilBERT embeddings remain frozen (not updated during training).
Step 3:

Hyper parameter Tuning

Adjust learning rate, batch size, BiLSTM layers, and sequence length to improve performance.
3️. Model Training & Testing Module

Evaluate BiLSTM model on a test dataset.

Compare performance using accuracy, precision, recall, and F1-score.

Store the trained BiLSTM model for deployment.
Two step PU Learning Model Modules
1️. Data Processing Module

Load labeled spam (P) and unlabeled (U) messages.

Pre process text: Tokenization, lower casing, stop word removal, and cleaning.

Convert text into RoBERTa embeddings for feature representation.
2️. Model Training Module
Step 1:

Identify Reliable Negatives (RN)

Train XGBoost on P (Spam) vs. U (Unlabeled) samples.

Assign probability scores to U samples to determine high-confidence non-spam (RN).
Step 2:

Train Final Classifier (P vs. RN)

Train a second XGBoost classifier using P (Spam) and RN (Non-Spam).
3️. Model Testing Module

Test the trained PU learning model on a separate adversarial dataset.

Evaluate performance: Accuracy, Precision, Recall, F1-score.

Store trained PU learning model for deployment.
Outcome: A spam classifier trained using PU learning, effectively distinguishing between spam and non-spam.
RL- Based Model(Actor-critic) with PPO Agent Modules
1️. Data Processing Module

Load spam and non-spam messages for reinforcement learning.

Pre process text: Tokenization, lower casing, stop word removal, and embedding conversion.

Convert SMS messages into LightRoBERTa embeddings for feature extraction.
2️. Model Training Module
Step 1:

Fine-Tune LightRoBERTa as the Actor

The Actor model (LightRoBERTa) learns to classify spam messages.
Step 2:

Train Critic for Reward Optimization

A Critic model assigns rewards based on classification correctness.
Step 3:

Optimize Using PPO (Proximal Policy Optimization)

PPO updates the Actor model to refine spam classification over time.
3️. Model Training & Testing Module

Test the RL- trained model on an adversarial dataset.

Evaluate improvements in classification accuracy using reward signals.

Store the trained RL-based model for future use.
Outcome: A self-improving spam classifier that dynamically optimizes its performance using reinforcement
GAN Based Model Modules
1️. Data Processing Module

Load spam messages to generate additional synthetic spam data.

Pre process text: Cleaning, tokenization, and embedding conversion.

Convert real spam messages into BERT-based embeddings as input.
2️. Model Training Module
Step 1:

Train Generator (MLP-based GAN)

A Generator (MLP) learns to create realistic synthetic spam embeddings.
Step 2:

Train Discriminator

A Discriminator (MLP) learns to differentiate between real and synthetic spam embeddings.
Step 3:

Filter High-Quality Embeddings

The Discriminator filters low-quality synthetic embeddings, keeping only realistic spam representations.
3️. Model Testing Module

Evaluate synthetic embeddings for realism using the Discriminator.

Train a separate spam classifier on the augmented dataset (real + synthetic spam embeddings).

Store synthetic spam embeddings and improved classifier for deployment.
Outcome: A GAN-based data augmentation approach, improving spam detection by generating realistic
Backend (Django) deployment Module

1️. Set Up Django REST API


Endpoints:
“/predict” → Classifies a single SMS.
“/predict_file” → Classifies multiple messages from an uploaded file (with multiple texts(SMS)).
2️. Load Trained ML Models
Stores four models (PU Learning, GAN-based, RL-based, and BiLSTM).
Dynamically loads and applies the selected model.
3️. Process Incoming Requests
Cleans, tokenizes, and processes SMS messages.
Converts text into pre computed embeddings (if required).
4️. Model Classification & Response
Runs text through the chosen ML model.
Single SMS → Returns "Spam" or "Not Spam".
Batch File → Returns accuracy, precision, recall, and F1-score.
Outcome: A Django REST API that enables real-time spam detection for single or bulk SMS classification.
Frontend (React) Module

1️. Single Text Input


User enters a message and selects a model.
Clicks "Classify" → Backend returns spam/not spam.
2️. File Upload for Batch Classification
User uploads a CSV file with multiple SMS messages.
Backend processes the file and returns evaluation metrics:
Accuracy, Precision, Recall, F1-score.
3️. Model Selection Drop down
Users select from PU Learning, GAN-based, RL-based, or BiLSTM models.
Selection is sent to the backend to use the corresponding model.
4️. User-Friendly UI
Clear results display for both single and batch classification.
Error handling for invalid inputs or file issues.
Mobile-friendly design for ease of access.
Outcome: A React-based interface that allows users to input SMS messages, choose models, and view
classification results easily.
Sequence Diagram (work flow of Application)
Output screens (Single input)
Output screens (Bulk input)
Feature Enhancements
1️. Advanced Model Architectures
Fine-Tune Transformer Models: Instead of using frozen embeddings, fine-tune DistilBERT, RoBERTa, or DeBERTa for
domain-specific spam detection.
Hybrid Models: Combine BiLSTM with Attention Mechanisms or CNNs for improved sequence learning.
Self-Supervised Learning: Leverage masked language models (MLMs) to improve spam detection with minimal labeled
data.
2️. Smarter Data Processing & Augmentation
Context-Aware Spam Detection: Use named entity recognition (NER) and topic modeling to detect sophisticated spam
patterns.
Diverse Data Augmentation: Implement back-translation, paraphrasing, and synthetic data generation to improve
generalization.
Real-Time Data Collection: Continuously update the dataset with real-world spam trends to keep models up-to-date.
3️. Robust & Adaptive Learning Techniques
Online Learning & Active Learning: Allow the model to learn from new spam patterns in real-time through user feedback.
Adversarial Training: Expose the model to crafted adversarial spam messages to enhance resilience.
Multi-Modal Learning: Integrate image and audio analysis for detecting spam in multimedia messages (MMS).
4️. Efficient & Scalable Deployment
Edge AI & Lightweight Models: Deploy a compressed version of the model for real-time mobile or IoT-based spam
detection.
Federated Learning: Train models collaboratively across multiple devices while preserving user privacy.
Multi-Language Support: Extend spam filtering to different languages and regional dialects for wider applicability.
Conclusion
The increasing sophistication of spam messages necessitates advanced and
adaptive filtering techniques beyond traditional methods. This project
leverages a hybrid machine learning approach to address key challenges such
as dataset imbalance, evolving spam tactics, and high false positive rates. By
integrating multiple models, including PU learning, reinforcement learning,
deep learning, and GAN-based data augmentation, the system enhances spam
detection accuracy and robustness. The implementation of a scalable, real-
time classification system ensures practical usability, making it more effective
in identifying spam while minimizing false positives. This approach not only
improves current spam filtering performance but also lays the foundation for
future advancements in combating increasingly complex spam threats.
References
1. SMS Spam Classification Using Fine-Tuned RoBERTa-Base Transformer
This project focuses on developing a deep learning-based transformer model to accurately classify SMS messages as
spam or legitimate.
2. SMS Spam Classification Using Machine Learning
This research explores the use of pre-trained BERT models combined with machine learning and deep learning
techniques, such as BERT+SVC and BERT+BiLSTM, achieving high accuracy in SMS spam detection.
3. Optimized SMS Spam Detection Using SVM-DistilBERT and Voting Classifier
This study presents a comparative analysis of machine learning models for SMS spam detection, highlighting the
effectiveness of the SVM-DistilBERT model enhanced by a voting classifier.
4. Develop a Spam Filtering Model in Python & Deploy it with Django
This tutorial guides the development of an SMS spam detection web application using Python and the Django
framework, including training a Naive Bayes classifier and deploying it for real-time predictions.
5. A Deep Learning Method for Automatic SMS Spam Classification
This article presents a deep learning model based on BiLSTM and compares its performance with other machine
learning algorithms for SMS spam classification.
6. SMS/Email Spam Classifier: A Step-by-Step Guide | 2024
This video tutorial provides a comprehensive guide to building an SMS and email spam classifier using machine
learning techniques in Python.

You might also like