Sms Spam Filtering System Hybrid Approaches
Sms Spam Filtering System Hybrid Approaches
2.To address these challenges, there is a need for an advanced and adaptive spam
filtering system that leverages hybrid machine learning techniques to improve detection
accuracy, handle imbalanced datasets, and adapt to new and adversarial spam
patterns. By integrating multiple learning approaches, an effective spam filtering
solution can be developed to enhance robustness, minimize false positives, and ensure
reliable message classification in real-world applications.
Existing systems
1. Two-Class Classifiers:
Support Vector Machines (SVM): Effective for binary classification tasks.
2. One-Class Classifiers:
One-Class SVM: Identifies anomalies; less common in spam detection.
3. Positive and Unlabeled (PU) Learning:
PU Learning: Deals with datasets of positive and unlabeled examples.
4. Deep Learning Models:
Neural Networks, RNNs, CNNs, Transformers: Advanced models for improved
spam detection.
5. Multiple Transformer based models are used:
BERT, ELMO, RoBERTa, DistilBERT, LSTM, BiLSTM, CNN, TCN,
Ensemble(CNN+(BIGRU)).
Drawbacks
• 1. Poor Adaptability to Evolving Spam Techniques
Traditional filters struggle to keep up with constantly changing spam tactics, such as
obfuscation, adversarial text modifications, and disguised phishing attempts.
• 2. High False Positives and False Negatives
Many spam detection systems misclassify legitimate messages as spam (false
positives) or fail to detect actual spam (false negatives), leading to unreliable
filtering.
• 3. Bias Due to Dataset Imbalance
Since spam messages are significantly fewer than non-spam in most datasets, models
often become biased toward classifying messages as non-spam, reducing detection
accuracy.
• 4. Limited Contextual and Semantic Understanding
Basic machine learning and rule-based models struggle to understand the meaning
behind messages, making them ineffective against sophisticated and context-
dependent spam.
• 5. Vulnerability to Manipulated Spam Content
Spammers exploit weaknesses in existing filters by modifying message structures,
using misspellings, special characters, and hidden text to bypass detection.
Proposed systems
• Hybrid Approaches:
• Combined Models: Integrated techniques for better performance.
• Improved handling of evasion tactics.
• Using Frozen embeddings with a model might provide better
accuracy due to limited data availability.
• Using the two step PU learning method with RoBERTa embeddings
and XG boost model on top of it.
• Using the semi-dynamic Reinforcement learning (Actor-Critic with
PPO(proximal policy optimization)) along with fine tuning the
DistilRoBERTa model.
• The GAN model generates synthetic spam message embeddings (BERT
Embeddings) using an MLP based generator and discriminator,
enhancing training data diversity to improve spam detection
robustness.
Architechture:
Architechture of ML Models:
4. GAN Model
to generate
Synthetic BERT
Embeddings
ans train, test a
classifier.
Modules
1. Data Processing Module: 3. Model Testing & Evaluation Module
Load the dataset (labeled & unlabeled SMS data). Test each trained model on the adversarial dataset.
Pre process text (cleaning, tokenization, lower casing, removing Stop Measure performance metrics (accuracy, F1-score, precision, recall).
words,etc.). Store results for comparison.
Convert text into embeddings (DistilBERT, RoBERTa, etc.). Split dataset 4. API & Backend (Django) Module
into training and testing sets. Set up Django REST API with endpoints.
2. Model Training Module (Four Models): Load all trained models in the backend.
(a) PU Learning Model Process incoming SMS text or file uploads.
Use RoBERTa embeddings with a two-step PU learning approach. Select the appropriate ML model for classification.
Train an XGBoost classifier on the embeddings. Return results (spam/not spam + metrics).
Save the trained model for later use. 5. Frontend (React) Module
(b) GAN Model for Data Augmentation Single text input → Display spam/not spam result.
Train MLP-based GAN to generate synthetic BERT embeddings. File upload → Show accuracy, F1-score, etc.
Evaluate the Discriminator to filter realistic embeddings. Model selection → Choose which model to use.
Save generated embeddings to improve spam classification. User-friendly UI for smooth experience.
(c) RL-based Model (Actor-Critic with PPO)
Fine-tune LightRoBERTa as the Actor network.
Use PPO to train the model with reinforcement learning.
Save the trained actor model with tokenizer.
(d) BiLSTM Model with DistilBERT Embeddings
Convert SMS texts into DistilBERT embeddings.
Train a BiLSTM classifier on the embeddings.
Save the trained BiLSTM model.
BiLSTM with (DistilBERT)Embeddings Model Modules
1️. Data Processing Module
Load spam and non-spam SMS messages.
Pre process text: Cleaning, tokenization, stop word removal.
Convert text into DistilBERT embeddings (frozen, not fine-tuned).
2️. Model Training Module
Step 1:
Train BiLSTM Network
Feed pre computed DistilBERT embeddings into a BiLSTM model (Bidirectional Long Short-Term Memory).
BiLSTM captures sequential patterns in the embeddings to classify spam vs. non-spam.
Step 2:
Optimize BiLSTM Model
Train BiLSTM using cross-entropy loss for classification.
Optimize BiLSTM weights using the Adam optimizer.
DistilBERT embeddings remain frozen (not updated during training).
Step 3:
Hyper parameter Tuning
Adjust learning rate, batch size, BiLSTM layers, and sequence length to improve performance.
3️. Model Training & Testing Module
Evaluate BiLSTM model on a test dataset.
Compare performance using accuracy, precision, recall, and F1-score.
Store the trained BiLSTM model for deployment.
Two step PU Learning Model Modules
1️. Data Processing Module
Load labeled spam (P) and unlabeled (U) messages.
Pre process text: Tokenization, lower casing, stop word removal, and cleaning.
Convert text into RoBERTa embeddings for feature representation.
2️. Model Training Module
Step 1:
Identify Reliable Negatives (RN)
Train XGBoost on P (Spam) vs. U (Unlabeled) samples.
Assign probability scores to U samples to determine high-confidence non-spam (RN).
Step 2:
Train Final Classifier (P vs. RN)
Train a second XGBoost classifier using P (Spam) and RN (Non-Spam).
3️. Model Testing Module
Test the trained PU learning model on a separate adversarial dataset.
Evaluate performance: Accuracy, Precision, Recall, F1-score.
Store trained PU learning model for deployment.
Outcome: A spam classifier trained using PU learning, effectively distinguishing between spam and non-spam.
RL- Based Model(Actor-critic) with PPO Agent Modules
1️. Data Processing Module
Load spam and non-spam messages for reinforcement learning.
Pre process text: Tokenization, lower casing, stop word removal, and embedding conversion.
Convert SMS messages into LightRoBERTa embeddings for feature extraction.
2️. Model Training Module
Step 1:
Fine-Tune LightRoBERTa as the Actor
The Actor model (LightRoBERTa) learns to classify spam messages.
Step 2:
Train Critic for Reward Optimization
A Critic model assigns rewards based on classification correctness.
Step 3:
Optimize Using PPO (Proximal Policy Optimization)
PPO updates the Actor model to refine spam classification over time.
3️. Model Training & Testing Module
Test the RL- trained model on an adversarial dataset.
Evaluate improvements in classification accuracy using reward signals.
Store the trained RL-based model for future use.
Outcome: A self-improving spam classifier that dynamically optimizes its performance using reinforcement
GAN Based Model Modules
1️. Data Processing Module
Load spam messages to generate additional synthetic spam data.
Pre process text: Cleaning, tokenization, and embedding conversion.
Convert real spam messages into BERT-based embeddings as input.
2️. Model Training Module
Step 1:
Train Generator (MLP-based GAN)
A Generator (MLP) learns to create realistic synthetic spam embeddings.
Step 2:
Train Discriminator
A Discriminator (MLP) learns to differentiate between real and synthetic spam embeddings.
Step 3:
Filter High-Quality Embeddings
The Discriminator filters low-quality synthetic embeddings, keeping only realistic spam representations.
3️. Model Testing Module
Evaluate synthetic embeddings for realism using the Discriminator.
Train a separate spam classifier on the augmented dataset (real + synthetic spam embeddings).
Store synthetic spam embeddings and improved classifier for deployment.
Outcome: A GAN-based data augmentation approach, improving spam detection by generating realistic
Backend (Django) deployment Module