0% found this document useful (0 votes)
10 views5 pages

ML Project - Report

Uploaded by

Manit Kaushik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

ML Project - Report

Uploaded by

Manit Kaushik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Identifying Bug Types & Severity in Open-Source Code

Ishir Bhardwaj Manit Kaushik Pranav Gupta Raghav Wadhwa


2022223 2022277 2022364 2022385

1. Abstract 3. Literature Review


This report presents the outcomes of the CSE363 Ma- 3.1. Not all bugs are the same: Understanding, charac-
chine Learning group project, guided by Prof. Jainendra terizing, and classifying bug types
Shukla. The project develops a machine learning system
• Goal: Develop a taxonomy of bug types and create an
to predict bug severity and classify issue types, automat-
automated model to classify bugs based on this taxon-
ing bug management to address challenges in large projects
omy.
and enhance workflow efficiency through supervised and
unsupervised learning methods. Github repository for this • Dataset: Manually classifying 1280 bug reports of
project: Link 119 software projects belonging to ecosystems such as
Mozilla, Apache and Eclipse.
2. Introduction
• Features: Extracted textual descriptions from bug re-
The problem addressed in this project is the manual and ports, including details such as error messages, file
time-consuming process of bug severity prediction and is- names, and system components. Then TF-IDF was
sue classification in software development. By using ma- used to identify relevant terms.
chine learning, the goal is to predict both the severity and
issue type of software issues given in a bug report in a super- • Method: Logistic Regression classifier with TF-IDF
vised learning context. Additionally, it aims to explore un- features derived from bug report summaries.
supervised learning methods to categorize the bug domains
like memory, security, GUI etc. • Result: Identified 9 bug types; model achieved 64%
F-Measure and 74% AUC-ROC.

3.2. Machine Learning Approaches for Predicting


the Severity Level of Software Bug Reports in
Closed Source Projects
• Goal: Build prediction models to determine the class
of severity (severe or non-severe) of reported bugs.

• Dataset: Bug reports extracted from the JIRA bug


tracking system used by INTIX company, containing
bug IDs and descriptions.

• Features: The bug report is transformed into a feature


vector (Bag-of-Words) using Tokenization, Stop-word
removal, and Stemming.

• Method: Naive Bayes, Naive Bayes Multinomial,


Support Vector Machine (SVM), Decision Tree (J48),
RandomForest, Logistic Model Trees (LMT), Deci-
sion Rules (JRip), and KNN.

• Result: LMT algorithms reported the best perfor-


Figure 1. Project Flowchart mance results with Accuracy = 86.31, AUC = 0.90,
and F-measure = 0.91.

1
4. Supervised Learning
This section has 2 tasks namely, Classifying Bugs Sever-
ity & Classifying Issue Type.

4.1. Dataset & EDA for Classifying Bugs Severity


The bug reports from Eclipse & Mozilla include a sever-
ity field, which are 5 types - blocker, critical, major, minor
& trivial. Each bug report also has ID and a description of
the issue.
Figure 3. EDA Graphs for Issue Type Dataset

Example: “Stack overflow with namespace aliases”


changes to “Stack overflow namespace alias.”
4.4. Methodology for Predicting Bugs Severity
After preprocessing, features are extracted using TF-
IDF (Term Frequency-Inverse Document Frequency)
vectorization, which measures a word’s importance within
a document relative to the corpus. We also included uni-
grams and bigrams in TF-IDF vectorization to capture sin-
gle words and meaningful word pairs.
After TF-IDF vectorization, Latent Semantic Analy-
sis (LSA) is performed using Truncated Singular Value
Decomposition (SVD) to reduce the feature space to 1000
components, minimizing noise while preserving essential
information. The refined features are used as input for vari-
ous machine learning models, including a multinomial Lo-
gistic Regression with the lbfgs solver and 1000 maximum
iterations, a Decision Tree and Random Forest, both with a
maximum depth of 100, and a Multilayer Perceptron with
two hidden layers (100 and 50 units), 1000 maximum iter-
ations, tanh activation, and an sgd solver. Additionally, an
Ensemble Voting classifier combines predictions from all
models through majority voting.
Figure 2. EDA Graphs for Bug Severity Dataset 4.5. Analysis & Conclusion for Predicting Bugs
Severity
4.2. Dataset & EDA for Classifying Issue Type The results indicate that Logistic Regression and Ensem-
ble Voting both achieved the highest accuracy of 0.60. De-
The JIRA bug tracking system has issue reports contain cision Tree performed the worst with an accuracy of 0.44,
multiple categories for ’Bug Type’. For our project, we while Random Forest and Multilayer Perceptron (MLP)
grouped these categories into broader issues i.e. Defect, performed similarly with accuracies of 0.57 and 0.59, re-
Improvement & Task. spectively. In terms of the weighted F1-score, MLP showed
strong performance with a score of 0.58, outperforming the
4.3. Data Pre-processing for Supervised Learning
other models. The macro F1-scores were generally lower,
The NLP preprocessing steps applied to both datasets in- with the ensemble model achieving a score of 0.44, reflect-
cluded tokenization to split text into individual words or ing the overall balance in model performance.
tokens, lowercasing to ensure uniformity, and the removal The performance of class prediction for major and mi-
of stop words like ”the” and ”is” to reduce noise. Addition- nor classes is relatively strong, while the prediction for
ally, removal of non-alphabetic characters and lemmati- the blocker class remains below average across all mod-
zation was performed to reduce words to their base forms, els. The Multilayer Perceptron (MLP) model achieves a no-
such as converting ”running” to ”run”. tably higher F1-score across all classes when compared to

2
Table 1. Performance Metrics for Models
Model F1-Score Accuracy
Macro Weighted
Logistic Regression 0.44 0.57 0.60
Decision Tree 0.34 0.44 0.44
Random Forest 0.39 0.53 0.57
Multilayer Percep- 0.48 0.58 0.59
tron
Ensemble Voting 0.44 0.56 0.60

Figure 5. Per Class Heatmap for Issue Type Classification

class but below average for the task class. The improve-
ment class lies in the middle, with moderate F1-scores, sug-
gesting partial success in distinguishing these instances but
leaving room for better class-specific predictions.

5. Unsupervised Learning
In this section, we try to cluster the bug reports into bug
Figure 4. Per Class Heatmap for Severity Classification
domains using unsupervised learning techniques.

the other models, demonstrating a more consistent perfor-


5.1. Dataset & Data Pre-processing
mance. For this task, we used the DeepTriage dataset which
contans bug reports from Google Chromium, Mozilla
4.6. Analysis & Conclusion for Classifying Issue Core, & Mozilla Firefox, each including the bug ID, title,
Type and description. After removing duplicates and null values,
The overall performance metrics, including accuracy, the total samples came out to be 116371.
macro F1-score, and weighted F1-score, highlight the Description of bug reports are used as the unlabelled
strong performance of Logistic Regression and MLP, both dataset. The pre-processing steps used were same as before
achieving high macro and weighted F1-scores of 0.60 and i.e. Tokenization, Lowercasing, Stopwords Removal,
0.71 and accuracies of 0.72 and 0.73, respectively. Major- Non-Alphabetic Character Removal & Lemmatization.
ity Voting showed moderate performance with a weighted
F1-score of 0.68, while Decision Tree and Random For- 5.2. Methodology for Clustering
est performed lower, particularly the Decision Tree, which After pre-processing the bug descriptions, 5000 features
struggled with class balance. per sample are created using TF-IDF Vectorization. The
created feature space is then applied to the following unsu-
Table 2. Performance Metrics for Models
Model F1-Score Accuracy pervised paradigms:
Weighted Macro
Logistic Regression 0.72 0.60 0.71 1. K-Means Clustering: The optimal number of clus-
ters, k, was determined using the Elbow and Knee
Decision Tree 0.58 0.48 0.58
methods. A word cloud was then generated for each
Random Forest 0.68 0.50 0.53
cluster to visually represent the most frequent terms
Multilayer Percep- 0.73 0.60 0.71
within them.
tron
Ensemble Voting 0.71 0.56 0.68 2. Gaussian Mixture Model (GMM): Clusters were cre-
ated using the GMM approach, with the number of
Performance of all models is decently well for the defect clusters (k) obtained from the previous step. Each data

3
point was assigned to the cluster with the highest pos- while the remaining clusters are more sparse. This indi-
terior probability. The clustering results were visual- cates that a central pattern defines most of the data, while
ized in both 2D and 3D using Principal Component the other clusters represent less frequent, niche occurrences.
Analysis (PCA). For computational efficiency, 25% of
the dataset was sampled.

5.3. Analysis & Conclusion for K Means Clustering


In the Within-Cluster Sum of Squares (WCSS) vs. k
graph, the ’elbow’ point occurs at k=7, which indicates the
optimal number of clusters. This is the point where the rate
of decrease in WCSS slows significantly.
Labels for the clusters are manually assigned based on
insights derived from their respective word clouds. For in-
stance, a cluster had terms like ’linux’, ’useragent’, ’in-
tel’, ’macintosh’, ’mac’, ’window’, and ’gecko’, which
point to compatibility or configuration issues across various
operating systems and hardware environments and was la-
beled as OS Related. Another cluster included words such
as ’crash’, ’build’, ’run’, ’test’, ’logging’, ’failed’, ’val-
grind’, ’null’, ’missing’, and ’exception’, reflecting errors
encountered during the build and compilation processes and
was given Compilation Errors as the label. These clusters
are visualized below:

Figure 6. Cluster 1: OS Related

Figure 8. 2D & 3D Visualization

This imbalance in cluster distribution is mirrored in


the visualizations, where the dominant groups stands out
clearly, but overlapping regions hint at the complexities
within the smaller clusters. Both the 2D and 3D visual-
Figure 7. Cluster 2: Compilation Error izations demonstrate successful clustering, with data points
generally well-separated. However, some overlap between
clusters suggests underlying similarities or noise. This over-
5.4. Analysis & Conclusion for Gaussian Mixture
lap may decrease in higher dimensions, where cluster sep-
Model
aration could become more pronounced. As the current
The Gaussian Mixture Model (GMM) clustering reveals dimensionality may not fully capture the data’s complex-
an imbalanced distribution across the 7 clusters, with two ity, further analysis in higher dimensions could enhance the
dominant groups containing the majority of the data points, clarity of cluster distinctions.

4
6. Learnings & Contributions
The team learned data preprocessing techniques (e.g., to-
kenization, TF-IDF), applied supervised algorithms (e.g.,
Logistic Regression), and evaluated models using metrics
like accuracy and F1-score. Challenges included data im-
balance, model generalization, and preprocessing unstruc-
tured text data, requiring fine-tuning and optimization for
bug severity and issue classification. Contributions of each
member:

• Ishir Bhardwaj: Unsupervised Learning & making


project report
• Manit Kaushik: Supervised Learning & making
project presentation
• Pranav Gupta: Unsupervised Learning & making
project report
• Raghav Wadhwa: Supervised Learning & making
project presentation

7. References
[1] A. Baarah, A. Al-oqaily, Z. Salah, M. Sal-
lam, & M. Al-qaisy. (2019), Machine Learning
Approaches for Predicting the Severity Level of
Software Bug Reports in Closed Source Projects,
IJACSA.

[2] Tan, Y., Xu, S., Wang, Z., Zhang, T., Xu, Z.,
& Luo, X. (2020). Bug Severity Prediction Using
Question-and-Answer Pairs from Stack Overflow.
Journal of Systems and Software, 110567.

[3] Catolino, G., Palomba, F., Zaidman, A.,&


Ferrucci, F. (2019). Not All Bugs Are the
Same:Understanding, Characterizing, and Classi-
fying Bug Types. Journal of Systems and Soft-
ware.

[4] Senthil Mani, Anush Sankaran, Rahul Ara-


likatte, (IBM Research, India). DeepTriage: Ex-
ploring the Effectiveness of Deep Learning for
Bug Triaging.

[5] Lamkanfi, A., Perez, J., & Demeyer, S.


(2013). The Eclipse and Mozilla defect tracking
dataset: A genuine dataset for mining bug infor-
mation. Proceedings of the 10th Working Con-
ference on Mining Software Repositories (MSR
’13)

[6] Ahmed, H. A., Bawany, N. Z., & Shamsi, J.


A. (n.d.). CaPBug: A framework for automatic
bug categorization and prioritization using NLP
and machine learning algorithms.

You might also like