An Automatic Software Test-Generation Method To Di
An Automatic Software Test-Generation Method To Di
https://doi.org/10.1007/s11227-025-07219-5
Abstract
One of the time-consuming and expensive phases in software development is soft-
ware testing, which is used to improve the quality of software systems. Therefore,
Software test automation is a helpful technique that can alleviate testing time. Sev-
eral techniques based on evolutionary and heuristic algorithms have been put forth
to produce maximum coverage test sets. The primary shortcomings of earlier meth-
ods are inconsistent outcomes, insufficient branch coverage, and low fault-detec-
tion rates. Increasing branch coverage rate, defect detection rate, success rate, and
stability are the primary goals of this research. A time- and cost-effective method
has been suggested in this research to produce test data automatically by utilizing
machine learning and horse herd optimization algorithms. In the first stage of the
proposed method, the suggested machine learning classification model identifies
the non-error-propagating instructions of the input program using machine learning
algorithms. In the second stage, a test generator was suggested to cover only the pro-
gram’s fault-propagating instructions. The main characteristics of produced test data
are avoiding the coverage of non-error-propagating instructions, maximizing the
coverage of error-propagating instructions, maximizing success rate, and the fault
discovery capability. Several experiments have been performed using nine standard
benchmark programs. In the first stage, the suggested instruction classifier provides
90% accuracy and 82% precision. In the second stage, according to the results, the
produced test data by the suggested method cover 99.93% of the error-prone instruc-
tions. The average success percentage with this method was 98.93%. The suggested
method identifies roughly 89.40% of the injected faults by mutation testing tools.
Vol.:(0123456789)
1 Introduction
The software testing procedure is crucial to guarantee the product’s quality. Soft-
ware testing can be carried out automatically or manually. Automated testing tech-
niques save money and time compared to manual testing, which is labor-intensive.
Because of its significance, researchers view automated software testing as a signifi-
cant challenge and area of concern. Using traditional manual testing techniques on
large-scale real software systems can be expensive and time-consuming. Both the
cost and the testing time can be significantly decreased by using automated software
testing. The problem of automatically generating optimal unit tests is the main topic
of this work. One of the primary optimization challenges is to automatically gener-
ate test data with maximal branch coverage in the least amount of time. Choosing a
small subset from all possible input combinations with maximum program branch
coverage is an NP (Nondeterministic Polynomial Time) complete problem [1, 2].
Many heuristic methods have been proposed to create test sets with maximal
coverage [3–5]. Low success rate, inadequate coverage, higher Computational
Cost, inadequate stability and inadequate ability to cover hard-to-cover instruc-
tions are the main drawbacks of the previous test generating methods. The main
objectives of the present study are:
This study has suggested an efficient approach to generate test data automati-
cally. A machine learning (ML)-based method was proposed to classify and
identify the program’s fault masking (no propagating) and fault-propagating
instructions. Non-propagating (fault masking) instructions are the non-sensitive
instructions with negligible propagate likelihood. In this study, the instructions
with 0%-9% fault-propagation rate are categorized non-sensitive. In the second
stage, the horse herd optimization algorithm (HHOA) was utilized to generate
effective test data. The suggested test generator covers only the fault-propagating
instructions. The main objective of the proposed method is to create test data that
provides the highest branch coverage in the fault-propagating (sensitive) instruc-
tions within a given time limit. The primary contributions of this study are:
The following sections of this paper include Sect. 2, which describes the basic
theoretical ideas and gives an overview of the existing test-generation methods.
Section 3 presents and explains in detail the suggested test-generation method. Sec-
tion 4 discusses the experiments, benchmark programs, and results in terms of cov-
erage, fault-detection rate, and stability. Finally, Sect. 5 concludes the results and
explains future research.
2 Related works
to produce test data. Similarly, the combination of PSO and regression analysis has
been suggested in [9] to provide high-coverage test data. Furthermore, because the
PSO approach is simple to apply and generates test data fast, it was utilized in [10]
to give test data with a distinct objective function. This research suggests branch dis-
tance functions to cover fault-prone or crucial test paths.
In [11], the shuffled Frog Leaping Algorithm (SFLA) is introduced to produce
test data. The SFLA method is simple to use and has a quick convergence rate. The
proposed SFLA uses branch coverage as the fitness function (goal) to provide mean-
ingful test data. The performance of the suggested SFLA was compared with seven
existing test generators using benchmark programs. This method was compared with
the artificial bee colonies (ABC), genetic algorithms (GA), ant colony optimization
(ACO), and particle swarm optimization (PSO) in terms of coverage and success
rate. The findings show that in terms of average success rate, branch coverage, and
average number of iterations to cover all branches, the SFLA-based approach per-
forms better than PSO, GA, and ACO. Several techniques, including the artificial
bee colony (ABC), simulated annealing, particle swarm, genetic algorithm, and oth-
ers, were compared and evaluated in [12]. A distance function based on the branch
coverage serves as a fitness function. Regarding the results, ABC outperforms SA,
PSO, GA, and ACO in terms of coverage and success rate. The ABC algorithm per-
formed better than others in producing the best test data.
A strategy that uses the Ant Colony Optimization (ACO) algorithm was pre-
sented in [13] to produce the high-coverage test data. A unique fitness function was
developed using the coverage criterion. The experimental findings demonstrate that
this method yields more stable outcomes with increased coverage and convergence.
An improved fitness function is utilized in a test-generation method [14]; in this
method, the Imperialist Competitive Algorithm (ICA) was developed as a test-gen-
eration technique. This method tries to cover the program’s fault-prone test paths.
Every approach has pros and cons, succinctly outlined in Table 1. An automated
test generating method based on machine learning and a horse herd optimization
algorithm was employed in this study. Unlike the previous method, the suggested
method finds out the fault-prone parts of the program before test generation. The
method covers the identified fault-prone instructions instead of all instructions. A
detailed discussion of the suggested approach is provided in Sect. 3.
In [15], A test-generation method was suggested to generate automatically a
small set of test scenarios for the feature-based context oriented programs; mini-
mizing the number of context activations between tests is the main objective of this
method. In this study, two strategies were proposed to reduce the number of sup-
plementary test scenarios to regenerate the complete pairwise coverage. By using
this method instead of generating a completely new test suite, the cost of creating a
test suite upon evolution of the system was reduced by 75%. In [16], a hyper-heuris-
tic-based test-generation method has been generated. This method is implemented
using three search operators: Genetic Algorithm with low-level heuristic, Particle
Swarm Optimization with Low-Level Heuristics and strength Pareto Evolutionary
with Low-Level Heuristic. The method was evaluated with a test model; the result
model shows that the proposed method outperforms other existing methods in terms
of size of a test suite, execution time and coverage criteria. However, in this method,
3 The method
In the first stage, the program’s ineffective (insensitive) instructions are elimi-
nated, reducing the program’s size. The program instructions are then categorized
according to the rate of error propagation (sensitivity). The created Machine Learn-
ing (ML)-based classifier initially categorizes the program’s instructions during pre-
processing. In the second stage, a heuristic method was recommended to generate
program test data to cover the branches of the sliced program. The source code of
the sliced program is statistically verified to extract the required structural infor-
mation. The source code is parsed (analyzed), and the data types and number of
input data, branches, and execution paths are determined. The coverage-based test
data is generated in the second stage using the Horse Herd Optimization Algorithm
(HHOA). The utilized fitness function was defined using branch coverage. The test
suite produced by HHOA is the output. Lastly, mutation testing assesses the likeli-
hood of fault detection in the generated test data.
3.1.1 Training dataset
In real-world programs, specific code segments are written to satisfy quality and
non-functional requirements without affecting the final product of the program.
Studies, such as by Arasteh et al. in [25], indicate that a sizable percentage of com-
puter instructions do not affect the program outputs. Mutations (injected bugs)
in these insignificant (non-sensitive) code segments are considered equal and
Fig. 2 The source code of the Statistics benchmark program that includes the quality-related code as
non-sensitive code
ML + HHOA tries to cover the sliced program instead of all instructions. Eliminat-
ing the effectless (non-error propagating) instructions and covering the remaining
effective instructions is one of the main merits of the proposed method. As shown in
Fig. 2, the ML + HHOA provides higher coverage than the pure HHOA and previous
methods.
An instruction classifier uses supervised machine learning methods to categorize
instructions according to their sensitivity. This classifier is intended to examine the
source code of a program and classify its instructions according to how they affect
the program’s output. The dataset developed in [26] (one of the effective datasets in
program instruction sensitivity) was used as a training and test dataset. In this data-
set, the information on benchmark programs with different complexity and structure
has been analyzed and collected. The characteristics of the utilized benchmark pro-
grams to create the dataset are detailed in Table 2. Iterative and recursive structures
are included in the chosen benchmark applications. Many data types and operators
(arithmetic and logical) are employed in benchmark programs. The chosen programs
are made up of different programming structures (such if, for, and while) that are fre-
quently seen in practical applications. Furthermore, benchmark programs have more
cyclomatic complexity than conventional application-based programs. The triangle
program has a cyclic complexity of around 34. On the other hand, in traditional real
and practical applications, the complexity is less than 10. Additionally, every pro-
gram has a variety of data and arithmetic and logical operators. The chosen bench-
mark programs are typically more complicated than conventional functions seen in
real and practical applications.
The training dataset’s records are displayed in Table 3. Ten features are included
in each record. The properties of a particular instruction in a benchmark program
are indicated as a dataset’s row. Indeed, each record (row) in the created dataset is
related to the specification of an instruction in a program. An instruction’s acces-
sibility is indicated by its nesting level; if an instruction has a nesting level of 1, it
is not included in an if instruction. The second feature shows the number of data
definitions (assignments) in the instruction. The number of logical and arithmetic
operators is indicated by features 3 and 4, respectively. Features 5 and 6 record the
number of variable data and intermediate values in the instruction. The data and
control dependencies among instructions were calculated using the generated con-
trol flow graph (CFG). Regarding data and code dependency, features 7 and 8 show
how many more instructions in the code depend on the given instruction. The length
of the instruction is shown by feature 9 in the dataset, which counts the number of
operands and operators in the instruction. Feature 10 depicts the overall quantity of
data used in the instruction. The error-propagating rate of an instruction within a
program is represented by the final column (the eleventh column) in Table 3.
A mutation testing (fault injection) method was applied to each instruction in
the benchmark programs to assess this feature. At this point, the original program
was mutated using MuJava as a mutator tool. Using MuJava’s tool, bugs (program-
ming faults) were introduced into the program’s instructions. To guarantee thorough
instruction coverage, the mutated (buggy) instructions were run through the created
test set, which consists of 100 test data points. Each instruction in the benchmark
program underwent this procedure once more. Each instruction’s sensitivity (error-
propagation rate) was calculated by Eq. (1); to this end, each mutated instruction
was executed by its covering test data. The number of times an instruction fails the
program divided by the total number of executions determines the sensitivity of
that instruction. An instruction with 0%-9% is categorized as low-error propagating
(non-sensitive), while the instructions with 10%-100% sensitivity are classified as
high error propagating (sensitive). For high error-propagating instructions, the value
of the eleventh characteristic is set to 1 in the dataset. Hence, the sensitivity of the
instruction’s classes can be adapted based on the application.
2 1 1 0 0 0 0 0 4 1 0
1 1 0 0 0 0 0 0 7 1 0
2 1 1 0 0 1 0 0 6 2 0
3 0 0 1 2 0 2 0 11 1 1
3 1 0 0 0 1 0 1 7 1 1
1 0 0 1 2 1 2 1 10 3 1
2 1 1 0 1 2 1 2 6 2 1
3 0 0 0 1 0 0 2 3 2 1
2 1 1 0 1 1 1 1 6 1 1
1 0 0 0 0 0 0 1 1 2 0
1 0 0 0 1 1 1 0 10 1 0
B. Arasteh et al.
95
Accuracy precision Recall
90
85
80
75
70
65
60
55
# Failures
Propagation Rate = × 100 (1)
# Execution
3.1.2 Instruction classification
The second step involves automatically analyzing (parsing) the sliced source code,
which only contains instructions that frequently propagate errors. The number of
inputs, the data types, their domain, and the quantity of conditional (branch) instruc-
tions are all disclosed in the program source code. The code that has been imple-
mented carries out the second step of the recommended procedure automatically.
The input and output of the parser (second step) of the suggested method are shown
in Fig. 4. The structure of a horse (an agent) in the developed HHOA is depicted in
Fig. 5. The complexity of the second step of the suggested method is O(n).
This study suggested an automatic technique for producing optimal test data through
HHOA. In this section, the suggested test data generator using HHOA was explained.
HHOA starts with randomly generated populations, each referred to as a horse [18].
The proposed HHOA includes a specific imitation as its local and global search tech-
niques. Grazing, hierarchy, socializing, imitation, defensive mechanisms, and roaming
are the six main divisions of horse activities that HHOA classified based on similarities
Fig. 4 The recommended code-parser’s input and output for the printcalender benchmark program
Fig. 5 The implemented structure for an agent (horse) in the suggested HHOA in the printcalender
benchmark program
in their behaviors among age groups of horses. Figure 6 depicts the steps of the sug-
gested procedure.
Equation (2) indicates the horse movement in HHOA in each iteration. In Eq. (2),
Iter,AGE
XmIter,AGE
indicates the location (position) of the mth horse, AGE and V��⃗
m
show the
age and velocity of the horse; Iter illustrates the iteration. The life span of a horse is
25–30 years. Horses’ ages are divided into four categories, as follows:
The first 10% of horses from the top of a sorted matrix are selected as horses since
HHOA ranks the horses according to their best responses. The remaining horses are
divided up into twenty, thirty, and forty percent horse groups. Equation (3) can be used
to characterize the velocity vector associated with horses in each cycle of the proce-
dure. Six behaviors of the herd’s horses are used to identify the velocity vector.
V⃗ mIter,𝛼 = G
⃗ Iter,𝛼 + D
m
⃗ Iter,𝛼
m
V⃗ mIter,𝛽 = G
⃗ Iter,𝛽 + H
m
⃗ Iter,𝛽 + S⃗ Iter,𝛽 + D
m m
⃗ Iter,𝛽
m
(3)
V⃗ mIter,𝛾 = G
⃗ Iter,𝛾 + H
m
⃗ Iter,𝛾 + S⃗ Iter,𝛾 + ⃗I Iter,𝛾 + D
m m m
⃗ Iter,𝛾 + R
m
⃗ Iter,𝛾
m
V⃗ mIter,𝛿 = G
⃗ Iter,𝛿 + ⃗I Iter,𝛿 + R
m m
⃗ Iter,𝛿
m
Horses are grazing all their lives, regardless of age. Every horse graze in des-
ignated areas as a result of the algorithm modeling the grazing zone surrounding
each animal with a "g" coefficient. Equation (4) and Eq. (5) show how to execute
this horse behavior mathematically.
⃗ Iter,AGE = gIter ǔ + 𝜌̌l + [X (Iter−1) ], AGE = 𝛼, 𝛽, 𝛾, 𝛿 (4)
( )
G m m
(Iter−1),AGE
(5)
Iter,AGE
gm = gm × 𝜔g
Iter,AGE
In Eqs. (3) and (4), G ��⃗
m
Indicates the motion parameter of the ith horse;
in each iteration, it reduces linearly with g. The value of u and l shows the upper
and lower grazing space which are adjusted to be 1.05 and 0.95, respectively. The
value of g is also suggested to be 1.5 for all four horses. Horses follow a leader
(experienced and powerful horse) of the herd. Horses have been seen to comply
with the rules during the Middle Ages (5–15 years). Equations (6) and (7) indi-
cate the horses’ behavior.
Iter,AGE
��⃗ (6)
[ (Iter−1)
= hIter,AGE − Xm(Iter−1) , AGE = 𝛼, 𝛽and𝛾
]
H m m
X∗
(7)
Iter,AGE (Iter−1),AGE
hm = hm × 𝜔h
Iter,AGE
In Eqs. (6) and (7), H ��⃗
m
and X∗(Iter−1) indicates the effect of the best horse loca-
tion on the velocity. Horses need to interact with other herd members. A herd protects
its members from predators, improving their chances of surviving and lessening their
vulnerability. Horses generally do better in groups with other herd members, according
to observations. This horse’s behavior is characterized by a tendency to move toward
the average positions of other horses. HHOA indicates this tendency by variable S, as
indicated in Eqs. (8) and (9).
[( N
) ]
Iter,AGE 1
S�⃗m (Iter−1)
∑
(8)
Iter,AGE (Iter−1)
= sm X − Xm , AGE = 𝛽, 𝛾
N j=1 j
(9)
(Iter−1),AGE
SmIter,AGE = sm × 𝜔s
Iter,AGE
In Eqs. (8) and (9), S�⃗m and sm
Iter,AGE
Indicate the motion vector of ith horse and
th
orientation in Iter iteration. With the ωs parameter, the orientation toward the herd
reduces in each iteration. The value of N is the total number of horses in the herd.
Horses mimic one another and take up good and harmful habits and behaviors, such
as scouting out the best spot to graze. This behavior, which is likewise motivated by
HHOA and shown by i may be modeled using Eqs. (10) and (11).
[( pN
) ]
⃗I
Iter,AGE 1 ∑
� (Iter−1)
(10)
Iter,AGE (Iter−1)
m
= im X −X , AGE = 𝛾
pN j=1 j
(11)
Iter,AGE (Iter−1),AGE
im = im × 𝜔i
In Eqs. (10) and (11), the movement vector of ith horse in the top horse’s direction
Iter,AGE
is shown by ⃗I m . The value of pN shows the number of best horses (horse in best
location). The value of p indicates ten percent of the total quantity of horses; the value
of 𝜔i shows the factor of decreasing in iteration iIter. Their main instinct is running
and bucking. Equations (12) and (13) indicate the defense mechanism of individuals
in the HHOA. The coefficient d shows is defined to keep horses away from incorrect
positions.
[( pN
) ]
⃗ 1 ∑ ̌ (Iter−1)
DmIter,AGE
= −dm Iter,AGE
X −X (Iter−1)
, AGE = 𝛼, 𝛽 and 𝛾 (12)
qN j=1 j
In Eqs. (14) and (15), the random velocity vector of ith horse is indicated by
Iter,AGE
�⃗
R m
. The value of 𝜔r is the decrement factor of rmIter,AGE.
3.2.3 Fitness function
As δ is a constant and has a value of 0.01. The branch distance function is defined
by f. Table 5 is used to compute the branch distance of a branch instruction. The
conditional statement of the branch statement determines the value of its distance
function. Each test suit’s (TS) fitness is quantified using Eq. (17).
[ s
]2
(17)
∑
wi min{f bchi , Xk }m
( )
Fitness(TS) = 1∕ 𝛿 + k=1
i=1
To assess the test set’s coverage, a distance function was employed. Table 5 pro-
vides the distance of a branch (s). As Table 5 illustrates, the value of the distance is
zero if the value of an expression used in the branch instruction on the test data is true.
If not, δ has a value of 0.01 and the variable’s value is added to the distance value.
Equation (17) is evaluated to 1/δ when the test set (TS) covers every branch, achieving
maximum coverage. The branch weight (wi) in Eq. (17) is the distance function’s coef-
ficient. A branch command’s branch weight is the product of the predicate weight and
its nesting level and is calculated using Eq. (18). The balance factor is represented by
λ, and its value was set to 0.5 in the experiments. The following factors determine the
branch’s weight:
The nested level (branch level) is the first component of the branch weight, accord-
ing to
Equation (19). It will be harder to cover a branch with more nesting levels. A branch
where (1 ≤ i ≤ S) is represented by the variable i, and the nesting level of branch i is
represented by the variable nli. The lowest nestinconsider number of iterations level
of the program branch is represented by the variable nlmin, which is 1 in general. The
variable nlmax shows the highest branch level. To normalize the value of branch level,
Eq. (20) is utilized.
nli − nlmin + 1
wn(bchi) = , (19)
nlmax − nlmin + 1
� �
wn bchi
(20)
�
� �
wn bchi = ∑s � �.
i=1
wn bchi
The weight of an expression, which is the second component of the branch weight,
indicates its complexity. Table 6 and Eq. (21) are used to calculate the expression
weight of a branch. If the branch statement is made up of h predicates joined by the
"and" operator, then the weight of the statement is equal to the square root of the weight
of the predicate. The weight of the expression is equal to the minimum weight of the
predicate when h expressions in a branch statement are joined by the "or" operator.
Equation (22) is utilized to normalize the expression weight.
“ = = ” 0.9
“! = ” 0.2
“ = = ” 0.9
Boolean 0.5
“ < ”,” > ”, “ < = ”, “ > = ” 0.6
� �∑
h � �
w2 cj , if the conjunction is and,
(21)
� �
wp bchi = � �r ��
j=1
min wr cj , if the conjunction is or,
� �
wp bchi
(22)
� �
wp� bchi = ∑s � �.
i=1
wp bchi
In Eq. (21), the number of accessible predicates in the ith branch (bchi) is given
by the variable h. In this case, cj denotes the jth conditional predicate of the corre-
sponding branch (1 ≤ j ≤ h). In Table 6, the value of an operator’s weight is indicated
by the wr in a predicate. Table 6 lists the operators that can be utilized in different
predicates in real-world programs.
4.1 Experiment platform
The suggested method was implemented in MATLAB together with previous test
generators based on SA [6], GA [6], PSO [8], ABC [12], and ACO [13] algorithms.
All implemented test generators were executed on the same hardware and soft-
ware platform to produce trustworthy findings. Table 7 is a list of the configuration
parameters for the test-generation algorithms that were examined. A computer sys-
tem including an Intel Corei7 processor and 8 GB RAM was employed to evaluate
the suggested method and alternative ways. In this study, the following evaluation
standards were used:
1. Coverage (AC) is a metric that quantifies how well the generated test data covers
various software branches.
2. The success rate (SR) shows how likely the test data generated will cover every
program’s branches.
3. The Average Generation (AG) indicates the average number of iterations an algo-
rithm requires to cover all program branching.
4. Average Time (AT) displays the average time needed to attain its maximum cover-
age.
5. The Mutation Score (MS) indicates the probability that test data will discover a
fault. Mutation testing tools (like Mujava) have been used to inject faults into the
program under test.
Numerous experiments have been carried out to address the research questions:
1. RQ1: Can the suggested test generator (ML + HHOA) achieve higher branch
coverage?
2. Q2: Can the suggested test generator enhance the success rate?
3. RQ3: Can the suggested test generator minimize the AG and AT?
4. RQ4: Is the suggested method capable of generating high coverage and fault-
detection data?
Each method was run 50 times for each benchmark program to find the average
value of these criteria.
4.2 Benchmarks
Ten standard programs with varying degrees of complexity were utilized to assess
the effectiveness of the suggested test generator and other applied techniques.
Table 8 lists the attributes of all these benchmark programs, which were also
employed in earlier research. The suggested test generator and the earlier tech-
nique were used to generate the test data for these benchmark programs. The sug-
gested method was used in conjunction with the prior test-generation techniques,
which were run on the same hardware and software platform in order to get accu-
rate results. Real programs have their source code arranged into classes, functions,
and components. A function includes between 20 and 100 lines of code regarding
programming standards. These studies employed benchmark programs, which are
widely used and well-known in software testing investigations. Furthermore, these
benchmark programs cover every programming construct that can be used to imple-
ment real and practical applications. These benchmark programs use all operators
and statements relating to conditions, loops, logic, arithmetic, and jumps. The cre-
ated control flow graph of the benchmarks shows that their cyclic complexity is
higher than that of real applications.
4.3.1 Average coverage
A range of experiments were carried out to examine the suggested method given
the particular standards outlined in Sect. 4.2. One metric is the average cover-
age (AC) of data generated experimentally. Each approach was run fifty times
to find the average branch coverage. Figure 7 shows the average branch cover-
age of various benchmark programs based on test data collected over 50 runs of
the test generating methods. The outcomes demonstrate that the produced test
data has a larger coverage in most benchmark programs. Data with more excel-
lent branching coverage improves fault identification. The test data created by
the method are more productive and efficient when compared to the AC met-
ric. The suggested approach was applied fifty times in benchmark programs in
addition to alternative approaches. The average branch coverage generated by
executing various methods is displayed in Fig. 8. The suggested approach yields
Fig. 7 Average branch coverage of test data created by different test generators in 50 runs
98.5
98
97.32
97.5
97
96.5 96.18
96
99.95, 99.98, and 99.84% rates in applications with higher cyclomatic com-
plexity, like Triangle, CalDay, and ComplexMethod. The suggested approach
attained an average coverage of 99.99% in the comparative application, com-
pared to 99.90%, 99.94%, and 99.99% for the PSO, ACO, and ABC algorithms.
In most benchmark programs, the suggested algorithm outperformed alternative
approaches. One of the suggested approach’s primary benefits is eliminating
insensitive instructions; insensitive instructions have a lower error-propagating
rate, which was identified by the first stage of the suggested method. The sug-
gested method tries to cover the programs’ sensitive instructions. The likelihood
of the fault-detection test data is a function of its coverage. The possibility of
fault detection increases with coverage.
100
95
Average Success Rate
90
85
80
75
70
65
60
55
90
Overall SR
85
79.11
80
76.26
75
70
SA GA PSO ACO ABC HHOA ML+HHOA
Regarding the results shown in Fig. 7, the average branch coverage of the first
method, when used with the ML-based slicer, is better than the pure HHOA in
most cases. Indeed, eliminating the non-error-propagating instruction improves
the coverage of the test generators. Figure 8 displays the AC of different meth-
ods for each benchmark program. With and without ML-based slicers, the sug-
gested methodology produced test data with an average coverage of 99.93%
compared to earlier methods. The AC of test data generated by ACO, ABC, and
PSO are roughly 99.77%, 99.83%, and 99.57%, respectively.
A second set of experiments was carried out to address the second research topic
(SR). This investigation assessed the suggested method’s success rate with that of
related approaches. The success rate (SR) is another indicator for producing test data
with 100% coverage. Each test data generator was run 50 times for every benchmark
application. The average number of times a method achieved maximum coverage is
called the SR of the technique. Figure 9 displays the SR criteria results. As per the
outcomes, the HHOA, ABC, and ML + HHOA methods had greater SRs than the
other approaches. The ML + HHOA approach outperforms the different techniques
in terms of AC and SR.
The results (shown in Fig. 10) demonstrate that the suggested method provides
98.93% SR in the experiments. ML + HHOA indicates that the HHOA and ABC
approaches outperform the others. In most benchmark programs, ML + HHOA
outperformed other algorithms, particularly in more extensive and complicated
projects; the outcomes demonstrated that PSO and ACO had higher success rates
than SA and GA. In light of the SR criterion, it can be concluded that the suggested
approach produces test data more efficiently than competing approaches.
Convergence speed is the third criterion emphasizing the time and effort required to
gather test data. A valuable measure of algorithm convergence is the number of iter-
ations needed to provide high-coverage test data. The more iterations required to get
accurate test data, the more successful and efficient the process becomes. The aver-
age convergence speed between many techniques throughout 50 implementations is
displayed in Fig. 11. The results demonstrate that the suggested test generator has
lower average convergence in all benchmark programs. Consequently, the proposed
method requires fewer rounds to generate test data with optimal coverage. Figure 11
displays the number of iterations needed to obtain maximum coverage over 50 exe-
cutions. The average convergence speed of various algorithms across all benchmark
programs is displayed in Fig. 12. The acquired results indicate that the optimal test
60.00
50.00
Average Generaon
40.00
30.00
20.00
10.00
0.00
Fig. 11 The average iterations needed to generate maximum-coverage test data by generators in 50 runs
33 30.83
23
18
13.96
12.35
13
3
SA GA PSO ACO ABC HHOA ML+HHOA
120.00
Average Time (seconds)
100.00
80.00
60.00
40.00
20.00
0.00
data was achieved with an average of 5.74 iterations using the proposed approach.
On the other hand, HHOA without an ML-based slicer produces flawless test data
after roughly 6.41 repetitions. The term "optimal test data" in this study refers to the
data with the highest branch coverage and fault discovery rate. Therefore, the sug-
gested ML + HHOA is quite efficient in terms of convergence speed.
The average execution time of each method was considered an additional cri-
terion. Figure 13 displays the average execution times for 50 distinct runs across
various benchmarks. Consequently, the suggested strategy was compared with other
approaches using the average execution time. The proposed method performed faster
for many programs. Overall, it can be stated that the suggested ML + HHOA per-
forms better in large and complex programs.
Mutation testing assessed whether the test data (created by the suggested approach)
was functioning correctly. Using the Mujava tool, the program’s source code was
altered to contain many flaws (mutants) [19, 20]. Using the test data that was sup-
plied, the introduced faults were discovered. The test data’s mutation score is
assessed using MuJava. Table 9 contains a list of the mutation operators that were
used to create faults in the software. Following fault injection, the faulty (buggy)
program is tested using the created test data, and the test data’s fault-detection rate
(also known as score mutation) is ascertained.
The mutation score (MS) represents each test data’s capacity to identify the
injected faults. Every instruction in the program’s source code used every muta-
tion operator. For every instruction, the likelihood of a fault injection is the same.
The mutation scores, or each test data’s capacity to identify faults, are displayed in
Table 10. The outcomes validate the suggested method’s value in producing data for
fault identification. The planned HHOA addressed the hard-to-cover codes which
does not covered by SA, PSO, GA, and ACO.
Table 11 Mutation Test tools’ Tool name Mutation operator Mutation level
specification
Mujava Class and Function Byte code
Muclipse Class and Function Byte code
PITest Function Byte code
Jester Function Java code
Jumble Function Byte code
JavaLance Function Byte code
Pitest 13 10 16 44 23 9 9 124
Muclipse 81 47 108 310 106 40 67 759
MuJava 128 77 155 445 167 66 98 1136
Jester 10 9 13 47 24 11 11 125
Jumble 16 11 20 47 31 14 11 150
JavaLancer 19 12 26 67 59 20 11 214
HHOA was utilized to create test data for further investigation to show the pro-
posed strategy’s efficiency. Several mutation testing tools (Jumble, Mujava, PITest,
Muclipse, JavaLance, Jester, and Judy) were used to assess the test data created by
HHOA. The tools utilized for mutation analysis are listed in Table 11. A new set of
experiments was employed to determine the fault discovery rate of testing methods
using various mutation tools. The parameters of the second set of models are dis-
played in Table 12.
Table 13 displays the quantity of inserted faults (mutants) produced by various
tools in various programs. The tools introduce different faults in varied numbers,
types, and locations. Muclipse and Mujava are the tools which created the major-
ity of the inserted faults. The outcomes of the ACO-based mutation testing on the
second set of benchmarks are displayed in Fig. 14. Various mutation tools were used
to calculate different mutation scores. The kind and quantity of mutations (injected
90%
Mutaon Score (MS)
80%
70%
60%
50%
40%
BubbleSort Matrix Factorail Sub2 Fibonacci perfect AVG
number
Benchmark Program
Fig. 14 The average mutation score of the produced test data by the ACO
90%
80%
70%
60%
50%
40%
30%
20%
BubbleSort Matrix Factorail Sub2 Fibonacci perfect AVG
number
Benchmark Program
Fig. 15 The average mutation score of the produced test data by the PSO
faults) produced by various techniques account for the variation in the mutation
score between them.
Figures 15 and 16 show the mutation score of the tests by PSO and ML + HHOA.
The conducted results by different tools (Muclipse, Pittest, Jester, Mujajav, Java-
Lancer and Jumble) confirm the efficiency of the suggested technique. The higher
mutation score of the generated test data by ML + HHOA indicates the higher bug
95%
Mutaon Score (MS)
90%
85%
80%
75%
70%
BubbleSort Matrix Factorail Sub2 Fibonacci perfect AVG
number
Benchmark Program
Fig. 16 The average mutation score of the produced test data by the suggested ML + HHOA
1
PSO ACO ML+HHOA
0.9
Average Mutaon Score
0.8
0.7
0.6
0.5
0.4
0.3
BubbleSort Matrix Factorail Sub2 Fibonacci perfect AVG
Benchmark Programs number
1
PSO ACO ML+HHOA
0.9
Average Mutaon Score
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Piest Muclipse Mujava Jester Jumble JavaLancer AVG
Mutaon tesng Tools
Fig. 18 The average fault discovery rate of the test generators in different mutation testing tools
data by the suggested approach can identify various faults brought on by multi-
ple techniques. The recommended method’s superiority over other test generating
techniques is demonstrated by a comparative evaluation with various mutation
testing tools. The average mutation score of the created test data by different tech-
niques for all mutation testing instruments is displayed in Fig. 18. The average
scores represent the superiority of the suggested approach over other methods.
Slicing the input program before test generation is the main superiority of the
ML + HHOA to the pure HHOA-based method. In the first stage of the ML + HHOA,
the developed ML-based classifier categorizes the program instruction into two
classes. The instructions classified as non-sensitive instructions are eliminated before
the test-generation stage. The sliced program by the developed ML-based slicer is
500 445
Num. of Generated Mutants
390
400
300
200
128
102
100 77 66 66 66 59 71 63
54
0
BubbleSort Factorial Fibonacci TriangleType PerfectNumber reminder
HHOA ML+HHOA
Fig. 19 The number of generated mutants for different programs to evaluate the generated test data by
non-slices-based method (HHOA) and sliced-based method (ML + HHOA)
21% 20%
19% 18%
Mutant Reducon by ML+HHOA
17%
15% 14%
13% 12%
11%
11% 10%
9%
7%
5%
BubbleSort Factorial Fibonacci TriangleType PerfectNumber reminder
Fig. 20 The mutant reduction percentage provided by the sliced-based method (ML + HHOA)
covered by the test data generated by the HHOA. In the third stage, the sliced pro-
gram is targeted for the mutation test instead of the original program. Figure 19 indi-
cates the number of generated mutants by MuJava for the original program and sliced
program (by the suggested ML-based classifier). In the suggested ML + HHOA, only
the sensitive instructions are considered for the bug injection. Hence, the number of
injected bugs was reduced. Based on the results shown in Fig. 20, the proposed ML-
based slicer significantly reduces the number of produced mutants.
Figure 21 shows the source code of the TriangleType benchmark program and its
hard-to-cover codes, which were covered by the proposed test generator. The sug-
gested method can cover the hard-to-cover codes in different programs that the other
test generators can hardly cover. The number on the left side of each instruction indi-
cates the number of executions in the experiments. The unexecuted (high-lighted)
instructions are complicated, and the other methods hardly cover them. Meanwhile,
the suggested ML + HHOA can cover the hard-to-cover instructions. Table 14 indi-
cates the ANOVA (Analysis of Variance), as a statistical test, used to compare the suc-
cess rate of different test generating techniques. The p-value is 0.000006 and is lower
than 0.05, hence the null hypothesis was rejected and there is a statistically significant
difference between the performance of the suggested method and previous methods.
Fig. 21 The hard-to-cover codes of the TriangleType program, which were covered by the suggested test
generator
without an ML-based slicer generates perfect test data. Owing to the slicing of the
ineffective instructions, the proposed test generator performed faster for the major-
ity of the benchmark programs. The results confirm the usefulness of the proposed
approach in generating data for fault identification. The hard-to-cover codes that are
not covered by SA, PSO, GA, and ACO were addressed by the proposed HHOA.
The proposed method, even using various mutation tools, discovers a higher per-
centage of faults than other methods. The average mutation score of the test data
produced by the ML + HHOA for all benchmarks is approximately 89.38%.
Disadvantages: The suggested instruction classifier includes false alarm. Indeed,
some of the non-sensitive instructions were classified as sensitive and some of the
sensitive instructions were classified as non-sensitive. Since the suggested test gen-
erator relies on randomness and approximation, they do not guarantee 100%. If a
function contains deeply nested conditions, the algorithm might struggle to generate
test inputs that reach all execution paths. The suggested HHOA-based test generator
works best with numerical or structured input spaces but struggles with GUI testing,
event-driven systems, and complex type inputs. The other disadvantage of the sug-
gested two stage test generator is its computational cost. ML-based instruction clas-
sification along with HHOA test generator imposes time overhead.
5 Conclusion
Authors contribution B. Arasteh and K. Arasteh performed method suggestions, algorithm design, cod-
ing, and implementation. B. Arasteh, A. Ghaffari performed data gathering and data analysis. Experi-
ments, results analysis and manuscript writing were performed by B. Arasteh, K. Arasteh and A. Ghaffari.
Funding Open access funding provided by the Scientific and Technological Research Council of Türkiye
(TÜBİTAK).
Data availability statement The data and subject programs created throughout the investigation are
publicly available on Google Drive. https://drive.google.com/drive/folders/1n8oXY2k1UPwipU5Z5Pm
qUAmbk8ku9XCa?usp=drive_link.
Declarations
Conflict of interests The author declares that throughout this research, they did not receive any funds or
grants. There are no financial or non-financial conflicts of interest that the author needs to report.
Ethical approval The data and information utilized in this study were created by the researchers them-
selves. They do not belong to any other individual or organization. Other researchers will have access to
the research’s data. The authors from Iran are not employed by the Iranian government and don’t have any
governmental job or duty. They are preparing articles in their personal capacity and Interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permis-
sion directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.
References
1. Nouwou Mindom PS, Nikanjam A, Khomh F (2023) A comparison of reinforcement learn-
ing frameworks for software testing tasks. Empir Software Eng 28:111. https://doi.org/10.1007/
s10664-023-10363-2
2. Lin JC, Yeh PL (2001) Automatic test data generation for path testing using GAs. J Inform Sci
131(1):47–64
3. Khatun S, Rabbi KF, Yaakub CY, Klaib MFJ (2011) A Random search based effective algorithm for
pairwise test data generation, International Conference on Electrical, Control and Computer Engi-
neering 2011 (InECCE), pp. 293–297, https://doi.org/10.1109/INECCE.2011.5953894.
4. Eler MM, Endo AT, Durelli VHS (2016) An empirical study to quantify the characteristics of
Java programs that may influence symbolic execution from a unit testing perspective. J Syst Softw
121:281–297
5. Cristian C, Koushik SS (2013) Symbolic execution for software testing: Three decades later. Com-
mun ACM 56(2):82–90
6. Cohen MB, Colbourn CJ, Ling ACH (2003) Augmenting Simulated Annealing to Build Interac-
tion Test Suites, In: Proceedings of the Fourteenth International Symposium on Software Reliability
Engineering (ISSRE’03), pp. 394–405.
7. Sharma C, Sabharwal S, Sibal R (2014) A survey on software testing techniques using genetic algo-
rithm. Int J Comput Sci 10(1):381–393
8. Esnaashari M, Damia AH (2021) Automation of software test data generation using genetic algo-
rithm and reinforcement learning. Expert Syst Appl 183:115446
9. Mao C (2014) Generating Test Data for Software Structural Testing Based on Particle Swarm Opti-
mization. Arab J Sci Eng 39(6):4593–4607
10. Kaur A, Bhatt D (2011) Hybrid particle swarm optimization for regression testing. Int J Comput Sci
Eng 3(5):1815–1824
11. Ahmed BS, Zamli KZ (2011) A variable strength interaction test suites generation strategy using
particle swarm optimization. J Syst Softw 84:2171–2185
12. Ghaemi A, Arasteh B (2020) SFLA-based heuristic method to generate software structural test data,
J Softw Evol, 32(1).
13. Aghdam ZK, Arasteh B (2017) An efficient method to generate test data for software structural test-
ing using artificial bee colony optimization algorithm. Int J Softw Eng Knowl Eng 27(6):951–966
14. Mao C, Xiao L, Yu X, Chen J (2015) Adapting ant colony optimization to generate test data for soft-
ware structural testing. J Swarm Evolut Comput 20:23–36
15. Arasteh B, Hosseini SMJ (2022) Traxtor: An automatic software test suit generation method
inspired by imperialist competitive optimization algorithms. J Electron Test 38:205–215. https://doi.
org/10.1007/s10836-022-05999-9
16. Martou P, Mens K, Duhoux B, Legay A (2023) Test scenario generation for feature-based context-
oriented software systems. J Syst Softw 197:111570. https://doi.org/10.1016/j.jss.2022.111570
17. Sulaiman RA, Jawawi DN, Halim SA (2023) Cost-effective test case generation with the hyper-
heuristic for software product line testing. Adv Eng Softw 175:103335. https://doi.org/10.1016/j.
advengsoft.2022.103335
18. Arasteh B, Arasteh K, Kiani F, Sefati SS, Fratu O, Halunga S, Tirkolaee EB (2024) A bioinspired
test generation method using discretized and modified bat optimization algorithm. Mathematics
12:186. https://doi.org/10.3390/math12020186
19. MiarNaeimi F, Azizyan G, Rashki M (2021) Horse herd optimisation algorithm: A nature-inspired
algorithm for high-dimensional optimisation problems. Knowledge-Based Syst 213:106711. https://
doi.org/10.1016/j.knosys.2020.106711
20. Hosseini MJ, Arasteh B, Isazadeh A, Mohsenzadeh M, Mirzarezaee M (2020) An error-propagation
aware method to reduce the software mutation cost using genetic algorithm. Data Technol Appl
55(1):118–148. https://doi.org/10.1108/DTA-03-2020-0073
21. Shomali N, Arasteh B (2020) Mutation reduction in software mutation testing using firefly optimi-
zation algorithm. Data Technol Appl 54(4):461480. https://doi.org/10.1108/DTA-08-2019-0140
22. Gharehchopogh FS, Abdollahzadeh B, Arasteh B (2023) An improved farmland fertility algorithm
with hyper-heuristic approach for solving travelling salesman problem. CMES-Comput Model Eng
Sci 135(3):1981–2006. https://doi.org/10.32604/cmes.2023.024172
23. Arasteh B, Sadegi R, Arasteh K (2021) Bölen: software module clustering method using the com-
bination of shuffled frog leaping and genetic algorithm. Data Technol Appl 55(2):251–279. https://
doi.org/10.1108/DTA-08-2019-0138
24. Arasteh B (2022) Clustered design-model generation from a program source code using chaos-
based metaheuristic algorithms. Neural Comput Appl. https://doi.org/10.1007/s00521-022-07781-6
25. Arasteh B, Abdi M, Bouyer A (2022) Program source code comprehension by module clustering
using combination of discretized gray wolf and genetic algorithms. Adv Eng Softw 173:103252.
https://doi.org/10.1016/j.advengsoft.2022.103252
26. Arasteh B, Bouyer A, Pirahesh S (2015) An efficient vulnerability-driven method for hardening a
program against soft-error using genetic algorithm. Comput Elect Eng 48:25–43. https://doi.org/10.
1016/j.compeleceng.2015.09.020
27. Arasteh B, Arasteh K, Ghaffari A et al (2024) A new binary chaos-based metaheuristic algorithm
for software defect prediction. Cluster Comput. https://doi.org/10.1007/s10586-024-04486-4
28. Khleel NAA, Nehéz K (2023) Software defect prediction using a bidirectional LSTM network com-
bined with oversampling techniques. Cluster Comput. https://doi.org/10.1007/s10586-023-04170-z
29. Malhotra R, Bansal A, Kessentini M (2024) Deployment and performance monitoring of docker
based federated learning framework for software defect prediction. Cluster Comput. https://doi.org/
10.1007/s10586-024-04266-0
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
* Bahman Arasteh
[email protected]
1
Department of Software Engineering, Faculty of Engineering and Natural Science, Istinye
University, Istanbul, Türkiye
2
Department of Computer Science, Khazar University, Baku, Azerbaijan
3
Applied Science Research Center, Applied Science Private University, Amman, Jordan
1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at