Software Defect Detection Using Machine Learning (5)
Software Defect Detection Using Machine Learning (5)
By
Guide
CERTIFICATE
This is to certify that project entitled “Software Defect Detection Using Machine
Learning”, which is being submitted to Dr.Babasaheb Ambedkar Technological Univer-
sity,Lonere In partial fulfillment of the award of B.Tech,is the result of the work completed
by Siddhi Manoj Chaudhari, Om Pramod Sonawane, Rajeshwar Ravindra
Swami, and Neha Vikas Sonawane under my supervision and guidance within the
four walls of the institute during the academic year 2024-25 and the same has not sub-
mitted elsewhere for the award of any degree.
Date:
Place: Jalgaon
Dr. S. R. Sugandhi
Principal Examiner
Declaration
We hereby declare that,project entitled,“Software Defect Detection Using Machine
Learning” is carried out and written by us under guidance of Prof.A.Y.Suryawanshi,
HOD of Computer Engineering, Khandesh College Education Society’s College of Engi-
neering and Management, Jalgaon.This work has previously been formed on the basis for
the award of any degree or diploma or certificate nor has been submitted elsewhere for
the award of any degree or diploma or certificate.
ii
Acknowledgement
We would like to thank our guide Prof. A.Y. Suryawanshi for his support and subtle
guidance.We also thank for the valuable guidance of Head of Computer department Prof.
A.Y. Suryawanshi.We also thanks for the support of principal of K.C.E.S’s C.o.E.M.,
Jalgaon. We would like to thanks to all faculty members of Computer Engineering De-
partment and all friends for their co-operation and supports.
We also thank our family for their moral support and encouragement to fulfill my goals.
Lastly all the thanks belong to the almighty for his blessings
iii
Abstract
Traditional software reliability growth models only consider defect discovery data, yet the
practical concern of software engineers is the removal of these defects. Most attempts to
model the relationship between defect discovery and resolution have been restricted to
differential equation-based models associated with these two activities. However, defect
tracking databases offer a practical source of information on the defect lifecycle suitable
for more complete reliability and performance models.Software Engineering is a branch of
computer science that enables tight communication between system software and training
it as per the requirement of the user. We have selected seven distinct algorithms from
machine learning techniques and are going to test them using the datasets acquired for
NASA public promise repositories. The results of our project enable the users of this
software to bag up the defects are selecting the most efficient of given algorithms in doing
their further respective tasks, resulting in effective results.
Keywords: Software quality metrics, Software defect prediction, Software fault predic-
tion, Machine learning algorithms.
iv
Contents
Certificate i
Declaration ii
Acknowledgement iii
Abstract iv
List of Figures 1
1 INTRODUCTION 2
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Previously Existing System . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Report Oragnisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 LITERATURE SURVEY 7
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Presently Available System . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 SYSTEM REQUIREMENT 13
3.1 Software Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Fuctional Requirements . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Non-Fuctional Requirements . . . . . . . . . . . . . . . . . . . . . 13
3.2 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
v
4 SYSTEM DESIGN 15
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 DFD Level-0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.2 DFD Level-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.3 DFD Level-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 ER Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 UML Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.1 Structural Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.2 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.3 Object Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4.4 Component Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.5 Deployment Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Behavioral Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5.1 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5.2 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5.3 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.4 State Machine Diagram . . . . . . . . . . . . . . . . . . . . . . . 28
4.5.5 Communication Diagram . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 IMPLEMENTATION 31
5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1 Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.2 System Setup and Implementation Details . . . . . . . . . . . . . 31
5.1.3 Dataset Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.4 Supabase Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.5 Machine Learning Model . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.6 Authentication Flow . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.7 Data Storage and Access Control . . . . . . . . . . . . . . . . . . 34
5.1.8 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Testing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vi
5.3 Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Costing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
References 51
vii
List of Tables
5.1 Software Metrics Description . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Test Cases for Major Functionalities . . . . . . . . . . . . . . . . . . . . 38
5.3 Cost Estimation Using Software Metrics . . . . . . . . . . . . . . . . . . 40
5.4 Component Costing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
viii
List of Figures
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 DFD Level-0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 DFD Level-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 DFD Level-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.5 ER Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Object Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.8 Component Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.9 Deployment Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.10 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.11 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.12 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.13 State Machine Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.14 Communication Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1
Chapter 1
INTRODUCTION
Software defect detection using machine learning is a technique where algorithms
are trained to identify bugs or errors in code. Instead of manually checking the code,
the machine learning model analyses patterns from past data, such as code features and
previous defects, to predict where new bugs might occur, helping developers find and fix
issues more efficiently.
Software defects are anomalies or errors in computer software that cause it to behave
unexpectedly or incorrectly. Detecting these defects early in the development process can
significantly reduce costs, improve software quality, and enhance user satisfaction. Tradi-
tional testing methods, such as manual testing and static code analysis, have limitations
in terms of efficiency and coverage. To address these challenges, machine learning has
emerged as a promising approach for software defect detection
Machine learning algorithms can analyze vast amounts of data, identify patterns,
and make predictions with high accuracy. By leveraging historical data on software
defects, machine learning models can learn to recognize characteristics associated with
defective code, enabling proactive defect detection
This project aims to explore the application of machine learning techniques for soft-
ware defect detection. By analyzing various factors such as code metrics, commit his-
tory,and test results, we will develop a machine learning model capable of accurately
predicting the likelihood of code defects. This model can be integrated into the software
development process to assist developers in identifying potential issues early on, thereby
improving overall software quality
This project is a Next.js application that implements a machine learning-based soft-
ware defect detection system. It uses Supabase for authentication and data storage.
1.1 Objective
• Reduce effort and time taken for defect detection.
• Lower costs associated with software testing and bug fixes.
• Improve software quality by early defect prediction.
• Machine learning algorithms are used for both training and testing defect prediction
models.
2
CHAPTER 1. INTRODUCTION
1.5 Scope
This project focuses on improving software defect prediction using machine learning tech-
niques. The project will involve collecting and analyzing historical software defect data,
selecting appropriate machine learning algorithms, and evaluating the model’s perfor-
mance. While the focus is on enhancing defect prediction, the project will not include
real-time monitoring or the development of an extensive user interface, prioritizing model
accuracy and effectiveness instead. The scope includes:
• Data Collection: Gathering historical defect data and relevant software metrics
from multiple projects to train and validate the machine learning model. This data
will be used to identify patterns and features associated with software defects.
• Model Development: Creating and training machine learning models to predict
defects. This involves selecting appropriate algorithms, tuning parameters, and
validating the model’s performance using various evaluation metrics.
• Implementation and Testing: Implementing the developed model in a testing
environment to evaluate its effectiveness in predicting defects. This will include
comparing the model’s predictions with actual defect occurrences to assess accuracy
and reliability.
1.6.1 Advantages
• Early Defect Detection: Machine learning models can identify potential defects
early in the development process, reducing the cost of remediation and improving
overall software quality.
• Improved Accuracy: Machine learning algorithms can learn complex patterns in
code that may be difficult for human testers to detect, leading to higher accuracy
in defect prediction.
• Scalability: Machine learning models can handle large codebases and can be easily
scaled to accommodate growing projects.
• Automation: Machine learning can automate the process of defect detection, re-
ducing the workload for developers and testers.
• Continuous Improvement: Machine learning models can learn from new data
over time, improving their accuracy and adapting to changes in the code base.
1.6.2 Disadvantages
• Data Dependency: The performance of machine learning models is highly depen-
dent on the quality and quantity of the training data. Insufficient or biased data
can lead to inaccurate predictions.
• Complexity: Implementing and maintaining machine learning models can be com-
plex, requiring specialized knowledge and skills.
• Interpretability: Machine learning models can be difficult to interpret, making it
challenging to understand why a particular prediction was made. This can hinder
debugging and troubleshooting efforts.
• False Positives and Negatives: Machine learning models may produce false
positives (predicting a defect where none exists) or false negatives (failing to predict
a defect that does exist). This can lead to wasted effort or missed defects.
• Overfitting: Machine learning models can become overfitted to the training data,
leading to poor performance on new, unseen data. This can be mitigated through
techniques like cross-validation and regularization.
1.8 Summary
This chapter introduces the project on ”Software Defect Detection Using Machine
Learning,” covering its objective to enhance defect detection accuracy using machine
learning. It outlines the problem of inefficient traditional methods, proposes an auto-
mated ML-based solution, and contrasts it with past approaches. The chapter discusses
the project’s scope, key advantages, and limitations, and concludes with an outline of the
report’s structure for reader guidance.
7
CHAPTER 2. LITERATURE SURVEY
profile.
• Paper 4: Software Defect Prediction Using Machine Learning Techniques
(2018):
• Author::C.Lakshmi Prabha (Computer Science and Engineering Thiagarajar Col-
lege of Engineering Madurai, India ), lakshmiprabha (student tce.edu ),Dr.N.shivkumar(
Computer Science and Engineering Thiagarajar College of Engineering Madurai)
• Abstract: Software defect prediction provides development groups with observable
outcomes while contributing to industrial results and development faults predicting
defective code areas can help developers identify bugs and organize their test activ
ities. The percentage of classification providing the proper prediction is essential
for early identification. Moreover, software-defected data sets are supported and at
least partially recognized due to their enormous dimension. This Problem is han
dled by hybridized approach that includes the PCA, randomforest, na¨ıve bayes
and the SVM Software Framework, which as five datasets as PC3, MW1, KC1,
PC4, and CM1, are listed in software analysis using the weka simulation tool. A
systematic research analysis is conducted in which parameters of confusion, preci
sion, recall, recognition accuracy, etcAre measured as well as compared with the
prevailing schemes. The analytical analysis indicates that the proposed approach
will provide more useful solutions for device defects prediction. This paper high-
lights the use of cloud services, such as Firebase and AWS, for building scalable
mobile applications. The authors focused on how real-time databases, user authen-
tication, and cloud storage are vital for managing large volumes of data efficiently.
In particular, Firebase’s real-time database offers seamless synchronization across
all devices, making it an ideal choice for mobile applications that require constant
data updates. These findings are directly relevant to our project, as we plan to use
Firebase to handle user data, book listings, and chat communication in real-time
• Paper 5: BRACE: Cloud-based Software Reliability Assurance.(2017)
• Author:Kazuhira Okumoto,Nokia Dublin,Abhaya Asthana,Rashid Mijumbi.
• Abstract:The evolution towards virtualized network functions (VNFs) is expected
to enable service agility within the telecommunications industry. To this end, the
software (or VNFs) from which such services are composed must be developed and
delivered over very short time scales. In order to guarantee the required levels of
software quality within such tight schedules, software reliability tools must evolve.
In particular, the tools should provide development teams spread across geography
and time with reliable and actionable insights regarding the development process.In
this paper, they present BRACE– a cloud-based, integrated, onestop center for soft
ware tools. BRACE is home to tools for software reliability modeling, testing,and
defect analysis each of which is provided as-a-service to development teams.Initial
implementation of BRACE includes a software reliability growth modelling (SRGM)
tool. The SRGM tool is currently being used to enable real-time prediction of the
total number of defects in software being developed, and for providing the required
analytics and metrics to enable managers make informed decisions regarding allo
cation for defect correction so as to meet set deadlines.
• Paper 6:Connecting Software Reliability Growth Models to Software De-
fect Tracking (2017)
• Author: Esra Var,Ying Shi
• Abstract: Traditional software reliability growth models only consider defect dis
covery data, yet the practical concern of software engineers is the removal of these
defects. Most attempts to model the relationship between defect discovery and res
olution have been restricted to differential equation-based models associated with
these two activities. However, defect tracking databases offer a practical source of
information on the defect lifecycle suitable for more complete reliability and perfor
mance models. This paper explicitly connects software reliability growth models to
software defect tracking. Data from a NASA project has been employed to develop
differential equation-based models of defect discovery and resolution as well as dis
tributional and Markovian models of defect resolution. The states of the Markon
model represent thirteenunique tages of the NASA softwar edefect lifecycle.Both
transition probabilities and transition time distributions are computed from the de
fect database. Illustrations compare the predictive and computational performance
of alternative approaches. The results suggest that the simple distributional ap
proach achieves the best trade off between these two performance measures, butthat
enhanced data collection practices could improve the utility of the more ad vanced
approaches and the inferences they enable.
• Paper 7:Study on Software Defect Prediction Model Based on Improved
BP Algorithm(2019)
• Author::Cundong Tang1,Li Chen1,Zhiping Wang,Yuzhou Sima.
• Abstract:Software defect is an important indicator to evaluate software product
quality. To reduce the defects of software products and to improve the quality
of software is always the goal of software development. This paper combines the
simulated annealing (SA) algorithm and JCUDA technology to improve the BP
algorithm,to build an improved software defect prediction model with higher pre
diction accuracy. The experimental results show that the software defect prediction
model based on improved BP algorithm is able to accurately predict the software
defects, which is better than the traditional BP algorithm.
2.3 Summary
In this chapter, discussed the background history and related work of Software Defect
Detection.Next chapter will introduce the system analysis of project requirement.
13
CHAPTER 3. SYSTEM REQUIREMENT
The Software Quality Attributes encompass several key areas. Adaptability ensures
the software is accessible and usable by all users, while Availability guarantees it is freely
accessible, providing easy availability for everyone. Maintainability is another critical
factor; if any issues arise post-deployment, they can be easily resolved by the developer.
The system’s strong Reliability further enhances user trust by maintaining high perfor-
mance standards, and User Friendliness is ensured through a GUI interface, which makes
interacting with the software intuitive and accessible.
Integrity and Security are also prioritized, with access control features to prevent
unauthorized data access and multiple authentication phases to secure user data. Finally,
Testability is considered, ensuring that the software undergoes comprehensive testing to
meet quality and performance standards across all functional aspects.
3.3 Summary
In this chapter, discussion about the project requirements of the system, Functional and
Non-functional requirements, Hardware requirements is done. Next chapter will introduce
the detail Information of System Design.
As shown in Fig.[4.1], this architecture diagram illustrates the process flow of the
”Software Defect Detection Using Machine Learning” system. The user inputs a software
attribute into the system, which triggers a series of processing steps within the Software
Defect Identifier System.The process begins with a pre-trained dataset containing text-
based software attributes and historical defect data. First, the input data undergoes a
15
CHAPTER 4. SYSTEM DESIGN
”Processing” phase, where raw information is refined for further analysis. Following this,
the system performs ”Feature Extraction,” identifying key attributes from the input data
that are essential for detecting defects.
Once features are extracted, they are fed into the ”SVM Algorithm,” which acts as
the core of the defect detection model. The Support Vector Machine (SVM) algorithm
evaluates the input attributes based on the learned patterns from the training dataset,
determining if the software contains potential defects.
Finally, the system outputs the results, which indicate whether the software is de-
fected or not, and this outcome is displayed to the user. The entire workflow allows for an
automated approach to identifying software defects based on machine learning principles,
specifically using SVM as the classification algorithm.
As shown in Fig.[4.2], this Data Flow Diagram (DFD) Level 0 provides a high-
level overview of the ”Software Defect Detection Using Machine Learning” system. The
diagram illustrates the three main components of the system: User, Processing, and
Detection.
The process begins with the ”User” providing input, typically in the form of software
attributes or data relevant to defect detection. This input is directed to the ”Processing”
module, where essential steps such as data cleaning, preprocessing, and feature extraction
occur. This module prepares the data to ensure it is in an optimal format for analysis.
Once the data is processed, it flows to the ”Detection” component, where machine
learning algorithms are applied to determine whether any software defects are present.
This component performs the defect analysis and generates results, which can be pre-
sented back to the user. The DFD Level 0 diagram simplifies the overall flow, highlighting
the core functional stages without diving into specific technical details.
1. Preprocessing: In this stage, the raw data is cleaned and standardized to ensure
consistency and to remove any irrelevant or noisy information. This step is essential
to make the data suitable for analysis.
2. Feature Extraction: After preprocessing, significant attributes or features are ex-
tracted from the data. These features represent key characteristics that will help
the machine learning model in identifying patterns associated with software defects.
3. Classification: In this final stage, the processed and extracted features are fed into a
classification model, typically a machine learning algorithm, to determine whether
a software defect is present. This classification is based on the learned patterns
from a trained dataset.
The output from the classification process flows to the ”Detection” component, which
holds the final decision on whether a defect is detected in the software. This DFD Level
1 diagram provides a more granular view of the system, emphasizing each processing step
crucial for defect detection.
Next, Feature Extraction takes place, where various features are extracted from the
preprocessed code. These features can include code metrics (e.g., cyclomatic complexity,
Halstead metrics), syntactic features (e.g., control flow, data flow), and semantic features
(e.g., function calls, variable usage).
These extracted features are then fed into a Classification module. This module
employs machine learning algorithms to classify the software code as either defective or
non-defective. Additionally, it may also be able to classify the type of defect, if applicable.
Finally, the Detection module receives the classification results and presents them to
the user. This could involve highlighting potential defects in the code, providing a report
with defect probabilities, or suggesting specific code improvements.
This DFD level 2 diagram outlines a comprehensive approach to software defect
detection using machine learning. By preprocessing the code, extracting relevant features,
and applying classification algorithms, the system aims to identify potential defects early
in the development cycle, leading to improved software quality and reliability.
4.3 ER Diagram
As shown in Figure [4.5], the ER diagram illustrates the data entities and relationships
involved in the software defect detection system.
• Entities:
• User: Represents users of the system, including both developers and administrators.
• Code File: Represents the source code files that are uploaded for analysis.
• Dataset: Represents the dataset generated from the processed code files.
• Result: Represents the results of the defect detection process, including defect
details and classifications.
• Admin: Represents the administrators who manage the system.
• Relationships:
• User - Code File: A one-to-many relationship, where one user can upload multiple
code files.
• Code File - Processing: A one-to-one relationship, where each code file is processed
once.
• Processing - Dataset: A one-to-one relationship, where the processing step generates
one dataset.
• Feature Extraction - Classification: The Feature Extraction class passes the ex-
tracted features to the Classification class for classification.
• Classification - Segmentation: The Classification class may interact with the Seg-
mentation class to further analyze specific code segments.
• Classification - Result: The Classification class generates the final Result, which
includes defect details and classification results.
This class diagram provides a clear overview of the system’s components and their inter-
actions, aiding in the design and development of the software defect detection system.
recall, and F1-score. These metrics help assess the effectiveness of the defect detection
process. Additionally, the system utilizes techniques like SVM, DT, and tokenization to
aid in the analysis and classification of code defects.
data analyzer processes the code files, extracting relevant features and generating labeled
datasets. The data classification component employs machine learning models to classify
the code based on defect likelihood. The machine learning module provides the underlying
algorithms and techniques for classification. The tokenization component breaks down
the code into smaller units for analysis. Finally, the data processing component handles
the preprocessing of data, ensuring it is suitable for analysis. These components work
together to effectively detect and identify potential defects in software code.
The user initiates the process by logging in or registering for an account. Once
authenticated, the user can upload a code file. The system then performs a series of
steps:
• Preprocessing: The code is preprocessed, which involves tasks like tokenization,
normalization, and formatting. Feature Extraction: Relevant features are extracted
from the preprocessed code, such as code complexity metrics, syntactic features, and
semantic features.
• Segmentation: The code may be segmented into smaller units for more focused
analysis. Extraction from Text: Features can also be extracted from textual de-
scriptions or comments within the code.
• Classification: Machine learning models are applied to classify the code based on
defect likelihood.
• Detection: Potential defects are identified and presented to the user. The system
may involve an admin who oversees the system’s operations. The admin may have
access to additional functionalities, such as managing user accounts, monitoring
system performance, and updating the machine learning models.
This sequence diagram provides a clear visual representation of the interactions be-
tween the user and the system, highlighting the flow of control and data during the defect
detection process.
The extracted features are then used for Classification, where machine learning mod-
els are applied to classify the code based on defect likelihood.
Finally, the Detection phase identifies potential defects in the code and presents the
results to the user.
This activity diagram provides a clear visual representation of the sequence of activ-
ities involved in the defect detection process, helping to understand the overall workflow
and identify potential bottlenecks or areas for improvement.
instances. The predicted defects can be categorized into specific types, allowing for
targeted remediation efforts. This approach helps organizations proactively identify and
address potential defects, improving software quality and reliability.
4.6 Summary
This section of the project report delves into the Software Defect Detection Using
Machine Learning. It covers essential aspects such as the system architecture emphasizes
the use of the system. The data ow diagram illustrates the information ow within the
system. The structural design includes an ER diagram and class diagram showcasing
entity relationships. The behavioral design encompasses use case, sequence, activity, and
collaboration diagrams, demonstrating system interactions.
5.1 Implementation
5.1.1 Frontend
The frontend of the application is developed using Next.js and React, offering a scalable
and modular structure. To ensure responsiveness and accessibility across various devices,
we used Tailwind CSS for styling. For dynamic and interactive data visualization,
Recharts is used to graphically present the analysis results of the defect detection.
Additionally, Framer Motion is incorporated to add smooth animations, enhancing the
overall user interface and experience.
31
CHAPTER 5. IMPLEMENTATION
mental effort to understand. This metric, derived from Halstead’s software science,
correlates with the potential for defects in complex modules.
5. Documentation Quality: For modules with more than 100 lines of code, if the
ratio of comment lines to code lines is less than 10%, the module is flagged due
to insufficient documentation. Poorly documented code is harder to maintain and
more likely to introduce defects.
6. Control Flow Complexity: If more than 30% of the lines in a module contain
branches and the module exceeds 50 lines, it is considered to have complex control
flow, which is typically harder to test and debug.
If any of these conditions are met, the system flags the module as defective and
records the corresponding reason. If none of the thresholds are crossed, the module is
considered defect-free according to the rule-based logic.
and usability.
Metric Description
loc Total lines of code, including comments and
blanks.
vg Cyclomatic complexity – number of independent
paths.
ev Essential complexity – measures structuredness.
iv Design complexity – complexity of calling pat-
terns.
n Halstead length – total operators and operands.
v Halstead volume – size of implementation.
l Halstead level – abstraction level of code.
d Halstead difficulty – ease of understanding.
i Halstead intelligence – cognitive effort needed.
e Halstead effort – effort to code or understand.
t Halstead time – time to implement/understand.
lOCode Code lines excluding comments and blanks.
lOComment Number of comment lines.
lOBlank Number of blank lines.
locCodeAndComment Lines with both code and comments.
uniq Op Unique operators used.
uniq Opnd Unique operands used.
total Op Total occurrences of operators.
total Opnd Total occurrences of operands.
branchCount Total control flow branches (if,loops).
Screenshots of the Detection History interface are provided in this chapter to demonstrate
how the system archives predictions and makes them easily accessible. This historical
record not only enhances usability but also adds value for developers who wish to monitor
defect trends or re-evaluate past code modules based on updated models.
The testing process was carried out collaboratively by all four team members, with
responsibilities distributed across different areas of the system. While some members
focused on validating the machine learning model’s performance and tuning hyperpa-
rameters, others concentrated on testing the web interface, database integration, and
detection history functionality. Regular discussions and iterative feedback helped refine
the system and resolve bugs or inconsistencies efficiently. This collaborative effort en-
sured a thorough and well-rounded testing methodology, contributing significantly to the
quality and reliability of the final product.
In conclusion, the comprehensive testing process validated the accuracy and robust-
ness of the Software Defect Detection System. All major functionalities were verified,
and the system proved to be effective and stable across multiple environments and data
inputs. The inclusion of Detection History further enriches the user experience, offering
5.4 Costing
methods, however, suggest moderate complexity due to backend logic and machine learn-
ing integration. This multi-model estimation helps ensure the project’s financial planning
is robust and adaptable to real-world challenges.
The component costing for the Software Defect Detection project ensures that all
necessary resources are accounted for to facilitate smooth development, testing, and de-
ployment. It includes both direct costs, such as cloud services and hardware, and indirect
costs like software licenses and documentation preparation.
Development tools like Next.js, React, and Tailwind CSS are free, but costs related
to cloud hosting on Vercel and user management via Supabase are incurred. Machine
learning model training and data storage, along with testing resources, add to the overall
budget. Additionally, preparation of technical and user documentation is a necessary
expenditure.
Miscellaneous expenses, which may arise during development, are also considered.
Proper estimation of these costs ensures that the project stays within budget and can be
completed successfully. The table below presents the detailed component costs for the
project.
5.5 Summary
The implementation of the Software Defect Detection project involved using Next.js and
React for the frontend, along with Tailwind CSS for styling and Recharts for interactive
visualizations. The backend was powered by Next.js API routes, with Supabase han-
dling user authentication and data storage in a PostgreSQL database. Machine learning
algorithms, specifically Support Vector Machine (SVM) and Decision Tree, were used
to analyze software metrics and detect defects. The system processes user-inputted met-
rics, applies the model to detect potential defects, and displays results, enabling users to
generate detailed reports. This chapter outlines the technical stack, system architecture,
and the seamless integration of machine learning with cloud services for a scalable and
efficient defect detection solution.
42
CHAPTER 6. RESULT AND DISCUSSION
and reliability. The pipeline from data ingestion to prediction was found to be efficient
and consistent in delivering outputs, thus reinforcing the system’s end-to-end integrity.
The results from testing on publicly available open-source datasets aligned well with
known defect annotations, further affirming the model’s accuracy. The ability of the
system to produce reliable predictions suggests it can be used effectively in both academic
research and industrial software development. Additionally, the system’s design allows for
retraining the model with new or project-specific data, making it flexible and adaptable
to different programming standards and domains.
As shown in fig 6.1,The Register Page is a core component of the Software Defect
Detection System’s user interface, enabling new users to create an account before ac-
cessing the platform’s features. Built using React and styled with Tailwind CSS, the
page includes fields for entering essential user details such as name, email, and password.
Supabase handles user authentication securely in the backend. The page includes form
validation to ensure proper input formats and guides the user with appropriate prompts.
A clean layout and responsive design allow users to register easily on both desktop and
mobile devices. The figure below displays the visual layout of the registration page.
As shown in fig 6.2,The Login Page allows registered users to securely access the
Software Defect Detection System. Designed with React and Tailwind CSS, the interface
provides a minimalistic and user-friendly experience. Users are required to input their
registered email and password, which are authenticated via Supabase in the backend.
The page includes error handling for invalid credentials and feedback messages for failed
login attempts.This secure entry point ensures that only authorized users can interact
with the system and view sensitive prediction data. The figure below shows the layout
and design of the login page.
As shown in fig 6.3,The Metrics Input Page is a critical part of the Software Defect
Detection System, where users can input software code metrics or upload metric files
for analysis. This page provides a structured form or upload interface that captures
key features such as lines of code, cyclomatic complexity, and other relevant software
metrics. Once submitted, the input is processed by the trained machine learning model,
which instantly predicts whether the given code is defective or non-defective. The result
is displayed on-screen with a clear label—either ”Defect” or ”Non-defect”—based on
the prediction. This page allows users to evaluate code quality quickly and supports
As shown in fig 6.5,The CSV File Upload Page allows users to upload datasets
containing software code metrics in bulk for batch defect prediction. This feature is
especially useful for analyzing large codebases or multiple modules simultaneously. The
page supports .csv files formatted with relevant metric fields such as LOC, complexity,
and module name. Once a file is uploaded, the system parses the content, processes
the data through the trained machine learning model, and generates predictions for each
entry—labeling them as either Defect or Non-defect. The results are displayed in a
structured tabular format for clarity. This functionality streamlines the defect detection
process, reducing manual input and enabling fast analysis. The image below demonstrates
the CSV upload interface and how results are displayed post-analysis.
6.2 Discussion
6.2.1 Overview
The Software Defect Detection System successfully integrates machine learning with mod-
ern web technologies to address the critical issue of software reliability. By using code
metrics as input features, the system predicts whether a given piece of code is likely to
contain defects. The implementation involved creating an intuitive and responsive web
application using React, Next.js, Tailwind CSS, and Supabase, providing a seamless ex-
perience for developers. The inclusion of functionalities such as a dashboard, CSV file
upload, metrics input, and result download significantly improved the usability of the
system for real-world applications.
which required asynchronous data handling and efficient state management. Limitations
of the project include its dependency on metric-based datasets, lack of multi-language
code support, and reliance on labeled training data.
machine learning for defect detection provided tangible value by enabling early identi-
fication of potential code issues, thus improving the software quality assurance process.
Another significant achievement was the efficient use of modern web technologies such
as React, Next.js, and Supabase, which ensured a smooth user experience and scalabil-
ity. The ability to upload code metrics in bulk via CSV and view results on a central
dashboard also proved to be an invaluable feature. Overall, the system met its objec-
tives by offering an easy-to-use solution that aligns with the needs of software developers
and quality assurance teams, delivering a practical tool that adds value to the software
development process.
50
REFERENCES
[1] CB-Path2Vec:A Cross Block Path Based Representation for Software Defect Predic-
tion.XiyuZhang1,YangLu1,2,KeShi.
[5] Software Defect Prediction Using Support Vector Machine(2023) Haneen Abu Alhija
1 , Mohammad Azzeh 2 * and Fadi Almasalha 1 1 Department of Computer Science,
Applied Science Private University, Amman, Jordan 2 Department of Data Science,
Princess Sumaya University for Technology, Amman, Jordan.
[6] Review of the Prediction Model(2011) Member, IEEE), JIAJING WU 1,2JIE ZHANG
1,2, (Graduate Student, (Senior Member, IEEE), CHUAN CHEN 1,2, (Member,
IEEE), ZIBIN ZHENG 1,2, (Senior Member, IEEE), AND MICHAEL R. LYU3, (Fel-
low, IEEE).
[10] PV-Active Power Filter Combination Supplies Power to Nonlinear Load and Com-
pensates Utility Current(2017),NGUYEN DUC TUYEN (Member, IEEE) AND
GORO FUJITA (Member, IEEE).
51
References
[15] Unsupervised Techniques Review (2019),Hoa Khanh Dam, Trang Pham, Shien Wee
Ng, Truyen Tran, John Grundy, Aditya Ghose, Taeksu Kim and Chul-Joo Kim.