Bug Classification Accuracy Report Updated
Bug Classification Accuracy Report Updated
Bug reports contain unstructured text, making it difficult to categorize them directly. To
extract meaningful information, we applied TF-IDF (Term Frequency-Inverse Document
Frequency) to identify the most important words in the dataset.
By computing TF-IDF scores, we ranked the top 2000 words that best represent the bug
reports. These words were stored as "extracted keywords", forming the basis for
categorization.
Once we had a list of important words, the next challenge was to group them into
meaningful bug categories. Instead of manually defining these categories, we leveraged
Google T5 (Flan-T5-Large) to perform Zero-Shot Learning (ZSL).
For each extracted keyword, we prompted Google T5 to classify it into a relevant bug
category. The model dynamically generated categories such as:
The results were stored in a keyword-to-category mapping, which was later used to
classify bug reports.
This approach allowed us to classify bug reports without explicitly training a model,
making it a fully zero-shot learning pipeline.
2. Refinements Made
To improve classification accuracy, the following refinements were applied:
- Expanded Keyword Sets – Added multi-word phrases (e.g., 'memory leak' instead of
'memory' alone).
- Weighted Phrase Matching – Increased focus on critical phrases instead of single words.
Category Occurrences:
1. Data Preparation
• Convert text into numerical embeddings using BERT/SBERT to represent each bug
report as a vector.
Contrastive learning requires paired samples to help the model learn relationships between
different bug reports:
Example:
This helps the model learn similarities within a category and differences across categories.
• Define a Siamese Network that processes pairs of bug reports and generates their
embeddings.
• Train the network using Contrastive Loss (Triplet Loss), which ensures that
embeddings from the same category are closer together while embeddings from
different categories are farther apart.
• Once the Siamese Network is trained, use it to generate vector embeddings for all
bug reports.
• These embeddings are used as input features for classification instead of raw text.
• Use the embeddings from the Siamese Network to train a Multi-Layer Perceptron
(MLP) classifier.
o An output layer that predicts the bug category using Cross-Entropy Loss.
• Train the classifier using Adam optimizer and evaluate performance on a test set.
The performance of the classification model is measured using the following metrics:
• Hyperparameter tuning: Optimizes learning rate, batch size, and network depth for
improved performance.