Research On Software Multiple Fault Localization Method Based On Machine Learning
Research On Software Multiple Fault Localization Method Based On Machine Learning
1051/matecconf/201823201060
EITCE 2018
Abstract. Fault localization is one of time-consuming and labor-intensive activity in the debugging process.
Consequently, there is a strong demand for techniques that can guide software developers to the locations of
faults in a program with high accuracy and minimal human intervention. Despite the research of neural
network and decision tree has made some progress in software multiple fault localization, there is still a lack
of systematic research on various algorithms of machine learning. Therefore, a novel machine-learning-
based multiple faults localization is proposed in this paper. First, several concepts and connotation of
software multiple fault localization are introduced, move on to the status and development trends of the
research. Next, the principles of machine learning classification algorithm are explained. Then, a software
multiple fault localization research framework based on machine learning is proposed. The process is taking
the Mid function as an example, compares and analyzes the performance of 22 machine learning models in
software multiple fault localization. Finally, the optimal machine learning method is verified in the multiple
fault localization of the Siemens suite dataset. The experimental results show that the machine learning
based on Random Forest algorithm has more accuracy and significant positioning efficiency. This paper
effectively solved the problem of large amount of program spectrum data and multi-coupling fault location,
which is very helpful for improving the efficiency of software multiple fault debugging.
multiple fault localization techniques, research status and Software defects will only be converted into runtime
development trend, and machine learning classification software failures when the specific conditions are met.
model in Section 2. In Section 3, the specific process of Software errors will accumulate and effective
software multiple fault localization method based on propagation will occur. Eventually the software is
machine learning is summarized. Experimental results invalidated.
and analysis are described in Section 4. Finally, According to the multiple fault mechanism of the
conclusions are presented in Sections 5. defect described in Figure 1, the basic principle of the
traditional fault localization is to re-run the defect
program with the same input after setting the breakpoint,
2 Background then check the corresponding program state and perform
This section describes software multiple fault reverse reasoning, repeat the above process until find the
localization techniques, research status and development defect. The basic techniques used for traditional fault
localization are: breakpoints, single-step execution,
trend,machine learning classification model.
output debug information, logging, event tracking, dump
files, stack trace back, disassembly, observation and
2.1 Software Multiple Fault Localization modification of data, control of debugged processes and
threads.
Software multiple fault localization is to find the wrong
instruction, procedure, or data definition implied in the
source code of the program. The granularity of multiple 2.2 Research Status and Development Trend of
fault localization can be program statements, basic Fault Localization Techniques
blocks, branches, functions, or classes. Software multiple In 2016, W.Eric Wong et al. [6] summarized a milestone
fault localization is mainly divided into static analysis development history of software fault localization
method and dynamic testing method. The multiple fault technology in the paper "A Survey on Software Fault
localization based on static analysis method mainly uses Localization", which from a publication repository that
program dependencies, constraint solving, theorem includes 331 paper published form 1977 to 2014. In his
proving to analyse possible error locations in the survey, the fault localization techniques were classified
program. The multiple fault localization based on into eight categories, including slice-based techniques,
dynamic testing method mainly uses test cases to collect program spectrum-based techniques, statistics-based
program execution information and calculate possible techniques, program state-based techniques, machine
error locations in the program. The process of software learning-based techniques, data mining-based techniques,
multiple failures caused by defects is shown in Figure 1 model-based techniques and miscellaneous techniques.
as follows. In 2015, Chen Xiang et al. [7] offers a systematic
overview of existing research achievements of the
domestic and foreign researchers in recent years in the
paper “Review of Dynamic Fault Localization
Approaches Based on Program Spectrum". The survey
proposed a research framework based on program
spectrum for dynamic defect localization and identified
important influencing factors which can affect the
effectiveness of fault localization. These factors include
program spectrum construction, test suite maintenance
and composition, number of faults, test case oracle, user
feedback, and fault removal cost.
With the development of electronic technology, the
Figure 1. The process of software multiple failures caused by software scale function is getting bigger and bigger, the
defects internal logic relationship is more and more complicated,
First, during the operation of the system, defects may the number of defects in the system is also increasing,
be activated, causing the system to malfunction; and the difficulty of software fault localization is
secondly, the fault will propagate as the system runs, and increasing day by day. The traditional spectrum-based
will continue to be converted into errors and passed software fault localization and slice-based software fault
between subsystems; finally, with the error continues to localization solve the problem of software single fault
spread, and the error eventually reaches the user location. Currently, there still exist the problems of low
interface in the system, causing the system's incorrect accuracy of software multiple fault localization and large
behavior to be perceived by the user, which leads to amount of program spectrum data. Under the
system failure. It can be found that there is a multi- background of artificial intelligence and big data
coupling fault action chain between the defect and the promotion, the development trend of the domestic and
failure. Based on the multi-coupling fault action chain, foreign researchers in recent years is as follows:
the backtracking method can be used to find the defect
occurrence position and then locate the defect. From the
above failure mechanism, it can be found that software 2.2.1 Software Multiple Fault Localization
defects do not necessarily lead to software failure. Techniques
2
MATEC Web of Conferences 232, 01060 (2018) https://doi.org/10.1051/matecconf/201823201060
EITCE 2018
Software fault localization techniques usually assume neural networks are known to suffer from issues such as
that only one defect is included in the error program, paralysis and local minima. Subsequently, in 2012,
which is not the case. The presence of multiple faults in Wong et al.[15] proposed an improved approach based
a program can inhibit the ability of fault localization on radial basis function (RBF) networks, which are less
techniques to locate the faults. This problem occurs for susceptible to these problems and have a faster learning
two reasons: first, when a program fails, the number of rate [16], [17]. The RBF network is trained using an
faults is generally unknown; second, certain faults may approach similar to the BP network. Once the training is
mask or obfuscate other faults. In recent years, completed, the output.
researchers have studied how to locate error programs In 2013, He J.L et al. [18] proposed a novel neural-
that contain multiple faults. network-based multiple faults location model, which
In 2007, Jones et al. [8] presented a parallel support degree of the input for each fault. The model
debugging approach to solving thus problem that learns the relationship between the faults and the
leverages the well-known advantages of parallel work candidate locations of faults using the constructed neural
flows to reduce the time-to-release of a program, which network. Constructing an ideal input as the input of
consists of a technique that enables more effective learned neural network, the model can calculate the
debugging in the presence of multiple faults and a suspicious degree of each candidate location of fault,
methodology that enables multiple developers to then obtain the sequence sorting by the suspicious degree,
simultaneously debug multiple faults. Unlike Jones and and complete the task of multiple fault localization.
others who use only program feature behavior, Abreu et The above research is only based on the research of
al. [9] proposed a hybrid framework with logical neural network and decision tree. It has made some
reasoning in 2010. They uses both feature information progress in software multiple fault localization. However,
from program execution and Bayesian inference to infer there is still a lack of systematic research on various
multiple instances. One of the characteristics of Bayesian algorithms of machine learning in software multiple fault
inference that is questionable in the case of defects and localization. For the coupling, correlation and
their suspicious size is that it can well explain why nonlinearity of software multiple fault, machine learning
multiple defects occur intermittently and cause program has strong generalization ability, adaptability and
errors. robustness. It can learn the inherent law implicit in the
sample by learning the finite sample. In summary, there
is an urgent need to use machine learning to train big
2.2.2 Machine Learning-Based Techniques
data program spectrum to solve the problem of software
Machine learning is the study of computer algorithms multiple fault localization.
that improve through experience. Machine learning
techniques are adaptive and robust and can produce 2.3 Machine learning classification model
models based on data. In the context of fault localization,
the problem can be identified as trying to learn or deduce Machine learning classification is a supervised learning
the location of a fault based on input data such as approach in which the computer program learns from the
statement coverage and the execution result of each test data input given to it and then uses this learning to
case. classify new observation. The machine learning
Briand et al. [10] uses the C4.5 decision tree classification model attempts to construct a classifier by
algorithm to construct rules that classify test cases into using a known-observed values, and predict the category
various partitions. The statement coverage of both the of an unknown category object. Machine learning
failed and successful test cases in each partition is used classification algorithms include Bayesian, neural
to rank the statements using a heuristic similar to networks, support vector machines, rules, decision trees,
Tarantula [11] to form a ranking. These individual and integrated learning.
rankings are then consolidated to form a final statement
ranking which can be examined to locate the faults. This
2.3.1 Bayesian Network
technique is more effective for bug locating, as only a
relatively smaller amount of code needs to be examined Bayesian network [19] is a probabilistic network, which
to find bugs, compared to other state of the art is a graphical network based on probabilistic reasoning.
contemporary techniques. Wong et al. [12] proposed a The Bayesian formula is the basis of this probabilistic
fault localization technique based on a back-propagation network. It is suitable for the expression and analysis of
(BP) neural network, The coverage data of each test case uncertain and probabilistic events. Reasoning can be
and the corresponding execution results are collected, made from incomplete, inaccurate or uncertain
and to be used to train a BP neural network so that the knowledge. The main goal of Bayesian inference is to
network can learn the relationship between them. Then, estimate the value of a hidden node given the value of
the coverage of a suite of virtual test cases that each the observed node. Bayesian-based classification
covers only one statement in the program is input to the algorithms include Bayes Net, Naive Bayes, Naive
trained BP network. The outputs can be regarded as the Bayes Multinomial, Naive Bayes Multinomial Text,
likelihood of each statement containing the bug. Naive Bayes Multinomial Updateable, Naive Bayes
In 2009, Ascari et al. [13] extended the BP-based Updateable.
technique [14] to object-oriented programs. As BP
3
MATEC Web of Conferences 232, 01060 (2018) https://doi.org/10.1051/matecconf/201823201060
EITCE 2018
4
MATEC Web of Conferences 232, 01060 (2018) https://doi.org/10.1051/matecconf/201823201060
EITCE 2018
The specific process of software multiple fault The mid function is a classic example of software
localization method based on machine learning is fault localization demonstration. The mid function is a
summarized as follows. the block diagram is shown in function to realize the intermediate value of three
figure 3. numbers. The mid-program is taken as an example to
(1) Input program spectrum dataset. illustrate the software fault localization method of 22
The program spectrum data consists of the statement machine learning models (RF is Random Forest, BP is
coverage vector executed by the test case and the Back Propagation Neural Network, LB is Logit Boost,
execution result of the test case. AB is AdaBoostM1, NB is Naive Bayes, NBU is Naive
(2) The program spectrum dataset is trained to obtain Bayes Updateable, NBM is Naive Bayes Multinomial,
a optimal model. NBMU is Naive Bayes Multinomial Updateable, DS is
The machine learning model (Bayesian, neural Decision Stump, SMO is, HT is Hoeffding Tree, BN is
network, support vector machine, rules, decision tree, Bayes Net, RT is, NBMT is Naive Bayes Multinomial
integrated learning) is used to train the program Text, DT is Decision Table, and REPT is REP Tree).
spectrum dataset, and optimize the training model by
adjusting machine learning parameters.
3.1 Software Single Fault Localization
(3) A statement suspiciousness ranking is obtained by
testing optimal model through the virtual unit matrix. For the single fault localization of line 4 of the mid
Test the optimal model of machine learning by using function, 22 machine learning model training program
the virtual unit matrix, obtain the suspiciousness value of spectrum data were used.
each statement from the test results. Then sort them in
order from high to low. The suspiciousness ranking can
help facilitate the debugger to check the error line by line
according to the suspiciousness of each statement,
thereby improving the fault localization efficiency.
Figure 5. Machine learning algorithms compare single fault suspicious calculation data
5
MATEC Web of Conferences 232, 01060 (2018) https://doi.org/10.1051/matecconf/201823201060
EITCE 2018
6
MATEC Web of Conferences 232, 01060 (2018) https://doi.org/10.1051/matecconf/201823201060
EITCE 2018
7
MATEC Web of Conferences 232, 01060 (2018) https://doi.org/10.1051/matecconf/201823201060
EITCE 2018
8
MATEC Web of Conferences 232, 01060 (2018) https://doi.org/10.1051/matecconf/201823201060
EITCE 2018