0% found this document useful (0 votes)
5 views37 pages

An Automatic Software Test-Generation Method To Di

This research presents an automatic software test-generation method that combines machine learning and the horse herd optimization algorithm to enhance software testing efficiency. The proposed method aims to improve fault detection rates and branch coverage by classifying program instructions and generating test data that focuses on fault-propagating instructions. Experimental results demonstrate high accuracy and success rates, with the method achieving 99.93% coverage of error-prone instructions and identifying approximately 89.40% of injected faults.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views37 pages

An Automatic Software Test-Generation Method To Di

This research presents an automatic software test-generation method that combines machine learning and the horse herd optimization algorithm to enhance software testing efficiency. The proposed method aims to improve fault detection rates and branch coverage by classifying program instructions and generating test data that focuses on fault-propagating instructions. Experimental results demonstrate high accuracy and success rates, with the method achieving 99.93% coverage of error-prone instructions and identifying approximately 89.40% of injected faults.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

The Journal of Supercomputing (2025) 81:741

https://doi.org/10.1007/s11227-025-07219-5

An automatic software test‑generation method to discover


the faults using fusion of machine learning and horse herd
algorithm

Bahman Arasteh1,2,3 · Keyvan Arasteh1 · Ali Ghaffari1,2

Accepted: 19 March 2025


© The Author(s) 2025

Abstract
One of the time-consuming and expensive phases in software development is soft-
ware testing, which is used to improve the quality of software systems. Therefore,
Software test automation is a helpful technique that can alleviate testing time. Sev-
eral techniques based on evolutionary and heuristic algorithms have been put forth
to produce maximum coverage test sets. The primary shortcomings of earlier meth-
ods are inconsistent outcomes, insufficient branch coverage, and low fault-detec-
tion rates. Increasing branch coverage rate, defect detection rate, success rate, and
stability are the primary goals of this research. A time- and cost-effective method
has been suggested in this research to produce test data automatically by utilizing
machine learning and horse herd optimization algorithms. In the first stage of the
proposed method, the suggested machine learning classification model identifies
the non-error-propagating instructions of the input program using machine learning
algorithms. In the second stage, a test generator was suggested to cover only the pro-
gram’s fault-propagating instructions. The main characteristics of produced test data
are avoiding the coverage of non-error-propagating instructions, maximizing the
coverage of error-propagating instructions, maximizing success rate, and the fault
discovery capability. Several experiments have been performed using nine standard
benchmark programs. In the first stage, the suggested instruction classifier provides
90% accuracy and 82% precision. In the second stage, according to the results, the
produced test data by the suggested method cover 99.93% of the error-prone instruc-
tions. The average success percentage with this method was 98.93%. The suggested
method identifies roughly 89.40% of the injected faults by mutation testing tools.

Keywords Software testing · Sensitive-instructions classification · Branch coverage ·


Machine learning · Horse herd optimization algorithms · Fault-detection score

Extended author information available on the last page of the article

Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 2 of 36 B. Arasteh et al.

1 Introduction

The software testing procedure is crucial to guarantee the product’s quality. Soft-
ware testing can be carried out automatically or manually. Automated testing tech-
niques save money and time compared to manual testing, which is labor-intensive.
Because of its significance, researchers view automated software testing as a signifi-
cant challenge and area of concern. Using traditional manual testing techniques on
large-scale real software systems can be expensive and time-consuming. Both the
cost and the testing time can be significantly decreased by using automated software
testing. The problem of automatically generating optimal unit tests is the main topic
of this work. One of the primary optimization challenges is to automatically gener-
ate test data with maximal branch coverage in the least amount of time. Choosing a
small subset from all possible input combinations with maximum program branch
coverage is an NP (Nondeterministic Polynomial Time) complete problem [1, 2].
Many heuristic methods have been proposed to create test sets with maximal
coverage [3–5]. Low success rate, inadequate coverage, higher Computational
Cost, inadequate stability and inadequate ability to cover hard-to-cover instruc-
tions are the main drawbacks of the previous test generating methods. The main
objectives of the present study are:

• Generating test data with maximum fault-detection rate


• Reducing the number of required test data to cover the error-propagating
instruction
• Reducing the required time for effective test generation
• Enhancing the success rate of the test generators
• Maximizing the stability of the test generators

This study has suggested an efficient approach to generate test data automati-
cally. A machine learning (ML)-based method was proposed to classify and
identify the program’s fault masking (no propagating) and fault-propagating
instructions. Non-propagating (fault masking) instructions are the non-sensitive
instructions with negligible propagate likelihood. In this study, the instructions
with 0%-9% fault-propagation rate are categorized non-sensitive. In the second
stage, the horse herd optimization algorithm (HHOA) was utilized to generate
effective test data. The suggested test generator covers only the fault-propagating
instructions. The main objective of the proposed method is to create test data that
provides the highest branch coverage in the fault-propagating (sensitive) instruc-
tions within a given time limit. The primary contributions of this study are:

• Automatic classification of the program instructions by ML algorithms to


determine the error-propagation (sensitive) instructions.
• Discretizing and enhancing the Horse Herd Optimization algorithm (HHOA)
for producing the optimized test data.
• Automatic generation of a minimum number of high-coverage test data with
the highest fault discovery rate in a minimum time.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 3 of 36 741

• Covering only the program’s error-propagating codes to discover propagating


faults by the suggested HHOA.
• Developing an automatic test tool with a high success rate.

The following sections of this paper include Sect. 2, which describes the basic
theoretical ideas and gives an overview of the existing test-generation methods.
Section 3 presents and explains in detail the suggested test-generation method. Sec-
tion 4 discusses the experiments, benchmark programs, and results in terms of cov-
erage, fault-detection rate, and stability. Finally, Sect. 5 concludes the results and
explains future research.

2 Related works

A random-based test-generation method was suggested in [3]. More time is required


by the random method to attain necessary coverage due to the redundant and repeti-
tive data. Furthermore, creating a limited number of effective test data by random
methods is practically impossible. The researchers suggested a symbolic execution
to enhance the performance of the existing test-generation methods [4, 5]. One effec-
tive and fast way to automatically create test data is symbolic testing. These methods
try to create high-coverage test data within a limited time. In [6], a simulated anneal-
ing (SA) algorithm created high-coverage test data. The suggested SA generated
effective test data for the input programs. By avoiding local optimum, SA increases
the likelihood of finding the global optimum (high-coverage test data). The primary
drawbacks of this approach are its limited efficiency and insufficient coverage. The
SA is sensitive to the value of the calibration parameters, and tuning the parameters
to generate maximum coverage test data is challenging.
In [2], a technique to create test data through a genetic algorithm (GA) was sug-
gested. In this technique, GA was used to select the best execution path. This study
adapted GA to generate test data based on the code and path coverage. The results
indicated an improvement in the generation of test data. The time required to deter-
mine the best paths is reduced by using GA. In [6], another GA-based approach for
generating test data was presented. This study used GA to achieve the best possible
test results. This algorithm has been used to increase efficiency and effectiveness.
Six standard programs were used to evaluate the coverage obtained using the sug-
gested method. The results indicated that the test data output has improved. One
of the primary issues with GA is that chromosomes cannot evolve independently;
instead, mutation operators can only make them better. In GA, a subset of chromo-
somes cannot be assessed. The higher time required to generate high-coverage test
data is another drawback of this method.
A method using GA and reinforcement learning was suggested for automati-
cally creating test data to fix the low performance of the previous methods [7]. By
employing reinforcement learning, this method narrows GA’s search space. This
method avoids the generation of redundant and repetitive test data. Experiments
show that this hybrid technique outperforms GA in terms of coverage and success
rate. In [8], scientists employed the particle swarm optimization (PSO) technique

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 4 of 36 B. Arasteh et al.

to produce test data. Similarly, the combination of PSO and regression analysis has
been suggested in [9] to provide high-coverage test data. Furthermore, because the
PSO approach is simple to apply and generates test data fast, it was utilized in [10]
to give test data with a distinct objective function. This research suggests branch dis-
tance functions to cover fault-prone or crucial test paths.
In [11], the shuffled Frog Leaping Algorithm (SFLA) is introduced to produce
test data. The SFLA method is simple to use and has a quick convergence rate. The
proposed SFLA uses branch coverage as the fitness function (goal) to provide mean-
ingful test data. The performance of the suggested SFLA was compared with seven
existing test generators using benchmark programs. This method was compared with
the artificial bee colonies (ABC), genetic algorithms (GA), ant colony optimization
(ACO), and particle swarm optimization (PSO) in terms of coverage and success
rate. The findings show that in terms of average success rate, branch coverage, and
average number of iterations to cover all branches, the SFLA-based approach per-
forms better than PSO, GA, and ACO. Several techniques, including the artificial
bee colony (ABC), simulated annealing, particle swarm, genetic algorithm, and oth-
ers, were compared and evaluated in [12]. A distance function based on the branch
coverage serves as a fitness function. Regarding the results, ABC outperforms SA,
PSO, GA, and ACO in terms of coverage and success rate. The ABC algorithm per-
formed better than others in producing the best test data.
A strategy that uses the Ant Colony Optimization (ACO) algorithm was pre-
sented in [13] to produce the high-coverage test data. A unique fitness function was
developed using the coverage criterion. The experimental findings demonstrate that
this method yields more stable outcomes with increased coverage and convergence.
An improved fitness function is utilized in a test-generation method [14]; in this
method, the Imperialist Competitive Algorithm (ICA) was developed as a test-gen-
eration technique. This method tries to cover the program’s fault-prone test paths.
Every approach has pros and cons, succinctly outlined in Table 1. An automated
test generating method based on machine learning and a horse herd optimization
algorithm was employed in this study. Unlike the previous method, the suggested
method finds out the fault-prone parts of the program before test generation. The
method covers the identified fault-prone instructions instead of all instructions. A
detailed discussion of the suggested approach is provided in Sect. 3.
In [15], A test-generation method was suggested to generate automatically a
small set of test scenarios for the feature-based context oriented programs; mini-
mizing the number of context activations between tests is the main objective of this
method. In this study, two strategies were proposed to reduce the number of sup-
plementary test scenarios to regenerate the complete pairwise coverage. By using
this method instead of generating a completely new test suite, the cost of creating a
test suite upon evolution of the system was reduced by 75%. In [16], a hyper-heuris-
tic-based test-generation method has been generated. This method is implemented
using three search operators: Genetic Algorithm with low-level heuristic, Particle
Swarm Optimization with Low-Level Heuristics and strength Pareto Evolutionary
with Low-Level Heuristic. The method was evaluated with a test model; the result
model shows that the proposed method outperforms other existing methods in terms
of size of a test suite, execution time and coverage criteria. However, in this method,

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Table 1  The existing software test-generation techniques’ specifications
Method Merits Demerits

Modified GA [2] Higher coverage and performance Low success rate


Random search [3] Simplicity of implementation lower coverage and lack of objective function
SA-based method [6] faster than a random search Low success rate and coverage
GA-based method [6, 7] Higher coverage and performance Inadequate coverage
PSO-based method [8–10] High implementation speed and simplicity Low success rate and insufficient coverage
SFLA-based method [11] Higher stability, coverage, and success rate Computational Cost (due to sorting and shuffling)
An automatic software test‑generation method to discover…

ABC-based method [12] Higher coverage and speed Inadequate stability


ACO-based method [13] Higher coverage and performance Inadequate performance and stability
ICA-based method [14] Higher coverage and convergence Inadequate stability
Model-based Method [15] Scenario-based coverage Models dependent
Model-based Method [16] Higher state coverage and performance It’s necessary to balance coverage and cost
BAT-based method [17] Higher Branch Coverage Inadequate ability to cover hard-to-cover instructions
Page 5 of 36
741

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 6 of 36 B. Arasteh et al.

there is still inconsistency in terms of performance overhead to make a trade-off


between effectiveness and cost. It is due to the complexity and delay in the ranking
process of the method.
In [17], a bioinspired approach using a discretized Bat Optimization Algorithm
(BOA) was suggested for automated test data generation. The method analyzes pro-
gram code to identify branches and iteratively generates effective test data based on
branch coverage. Experimental results show that it achieves 99.95% branch cover-
age, operates 16 times faster than similar methods, and has a 99.85% success rate. In
this method all test paths have been taken into consideration. While some test paths
in the real program’s have higher error propagation. Therefore, the lack of prior-
ity given to the error-propagation paths is the main drawback of this method. The
objective function should prioritize the paths that have a higher error-propagation
rate.

3 The method

As shown in Fig. 1, the suggested method includes two main stages:

• First Stage: Program Preprocessing


• Second Stage: Test Generation

In the first stage, the program’s ineffective (insensitive) instructions are elimi-
nated, reducing the program’s size. The program instructions are then categorized
according to the rate of error propagation (sensitivity). The created Machine Learn-
ing (ML)-based classifier initially categorizes the program’s instructions during pre-
processing. In the second stage, a heuristic method was recommended to generate
program test data to cover the branches of the sliced program. The source code of
the sliced program is statistically verified to extract the required structural infor-
mation. The source code is parsed (analyzed), and the data types and number of
input data, branches, and execution paths are determined. The coverage-based test
data is generated in the second stage using the Horse Herd Optimization Algorithm
(HHOA). The utilized fitness function was defined using branch coverage. The test
suite produced by HHOA is the output. Lastly, mutation testing assesses the likeli-
hood of fault detection in the generated test data.

3.1 Stage 1: Program preprocessing

3.1.1 Training dataset

In real-world programs, specific code segments are written to satisfy quality and
non-functional requirements without affecting the final product of the program.
Studies, such as by Arasteh et al. in [25], indicate that a sizable percentage of com-
puter instructions do not affect the program outputs. Mutations (injected bugs)
in these insignificant (non-sensitive) code segments are considered equal and

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 7 of 36 741

Fig. 1  The process of the sug-


gested test generator

ineffective. Consequently, removing these sections improves the effectiveness of


testing techniques. In addition, aside from insignificant portions in interactive and
real-world systems, a significant amount of code is devoted to quality features like
usability and user-friendliness. Therefore, avoiding covering and mutating these
parts (non-sensitive instructions) is possible during the test data production process.
The "sliced" program, devoid of useless instructions, is the foundation for the first
stage of the suggested approach, which uses machine learning (ML)-based classi-
fiers. Figure 2 shows the source code of the Statistics benchmark program, includ-
ing exception handling code (quality-related code). This part of the code is intended
to handle the execution time exceptions. This program can be executed without
this part of the code. The suggested ML-based slicer can eliminate this part of the
code because this code does not affect the program output. Hence, the proposed

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 8 of 36 B. Arasteh et al.

Fig. 2  The source code of the Statistics benchmark program that includes the quality-related code as
non-sensitive code

ML + HHOA tries to cover the sliced program instead of all instructions. Eliminat-
ing the effectless (non-error propagating) instructions and covering the remaining
effective instructions is one of the main merits of the proposed method. As shown in
Fig. 2, the ML + HHOA provides higher coverage than the pure HHOA and previous
methods.
An instruction classifier uses supervised machine learning methods to categorize
instructions according to their sensitivity. This classifier is intended to examine the
source code of a program and classify its instructions according to how they affect
the program’s output. The dataset developed in [26] (one of the effective datasets in
program instruction sensitivity) was used as a training and test dataset. In this data-
set, the information on benchmark programs with different complexity and structure
has been analyzed and collected. The characteristics of the utilized benchmark pro-
grams to create the dataset are detailed in Table 2. Iterative and recursive structures
are included in the chosen benchmark applications. Many data types and operators
(arithmetic and logical) are employed in benchmark programs. The chosen programs
are made up of different programming structures (such if, for, and while) that are fre-
quently seen in practical applications. Furthermore, benchmark programs have more
cyclomatic complexity than conventional application-based programs. The triangle
program has a cyclic complexity of around 34. On the other hand, in traditional real
and practical applications, the complexity is less than 10. Additionally, every pro-
gram has a variety of data and arithmetic and logical operators. The chosen bench-
mark programs are typically more complicated than conventional functions seen in
real and practical applications.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 9 of 36 741

Table 2  Specification of the Program # LOC # Branch (if Complexity


benchmark program used to and loop) (Cyclomatic)
generate instruction sensitivity
dataset Max 25 3 4
Complex_Num 60 9 20
Binary_Search 30 4 6
Prim_Num 30 4 6
Triangle_Finding 37 10 34
Reverce_Num 17 1 7
Leap_Year 25 5 12
Statistics_Cal 32 5 4
Bubble_Sort 35 5 7

The training dataset’s records are displayed in Table 3. Ten features are included
in each record. The properties of a particular instruction in a benchmark program
are indicated as a dataset’s row. Indeed, each record (row) in the created dataset is
related to the specification of an instruction in a program. An instruction’s acces-
sibility is indicated by its nesting level; if an instruction has a nesting level of 1, it
is not included in an if instruction. The second feature shows the number of data
definitions (assignments) in the instruction. The number of logical and arithmetic
operators is indicated by features 3 and 4, respectively. Features 5 and 6 record the
number of variable data and intermediate values in the instruction. The data and
control dependencies among instructions were calculated using the generated con-
trol flow graph (CFG). Regarding data and code dependency, features 7 and 8 show
how many more instructions in the code depend on the given instruction. The length
of the instruction is shown by feature 9 in the dataset, which counts the number of
operands and operators in the instruction. Feature 10 depicts the overall quantity of
data used in the instruction. The error-propagating rate of an instruction within a
program is represented by the final column (the eleventh column) in Table 3.
A mutation testing (fault injection) method was applied to each instruction in
the benchmark programs to assess this feature. At this point, the original program
was mutated using MuJava as a mutator tool. Using MuJava’s tool, bugs (program-
ming faults) were introduced into the program’s instructions. To guarantee thorough
instruction coverage, the mutated (buggy) instructions were run through the created
test set, which consists of 100 test data points. Each instruction in the benchmark
program underwent this procedure once more. Each instruction’s sensitivity (error-
propagation rate) was calculated by Eq. (1); to this end, each mutated instruction
was executed by its covering test data. The number of times an instruction fails the
program divided by the total number of executions determines the sensitivity of
that instruction. An instruction with 0%-9% is categorized as low-error propagating
(non-sensitive), while the instructions with 10%-100% sensitivity are classified as
high error propagating (sensitive). For high error-propagating instructions, the value
of the eleventh characteristic is set to 1 in the dataset. Hence, the sensitivity of the
instruction’s classes can be adapted based on the application.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741

Table 3  Features of the produced dataset


Page 10 of 36

Independent features Dependent


feature
Nest. level # Def # Arithmetic # Conditional # Variable # Literal data # Data # Control Length of # Use (data) Error
operator operator depend depend instruction propaga-
tion (sensi-
tivity)

2 1 1 0 0 0 0 0 4 1 0
1 1 0 0 0 0 0 0 7 1 0
2 1 1 0 0 1 0 0 6 2 0
3 0 0 1 2 0 2 0 11 1 1
3 1 0 0 0 1 0 1 7 1 1
1 0 0 1 2 1 2 1 10 3 1
2 1 1 0 1 2 1 2 6 2 1
3 0 0 0 1 0 0 2 3 2 1
2 1 1 0 1 1 1 1 6 1 1
1 0 0 0 0 0 0 1 1 2 0
1 0 0 0 1 1 1 0 10 1 0
B. Arasteh et al.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 11 of 36 741

Table 4  Instruction classes Category Sensitivity rate


based on the sensitivity rate
Sensitive (propagating) 10%—100%
Non-Sensitive (non-propagating) 0%-9%

95
Accuracy precision Recall
90
85
80
75
70
65
60
55

Fig. 3  The performance of the instruction classifiers (program slicers)

# Failures
Propagation Rate = × 100 (1)
# Execution

3.1.2 Instruction classification

A Multi-Layer Perceptron (MLP) was used with a selection of additional machine


learning (ML) methods to create the instruction classifier. The dataset was used by
the ML algorithms as a training and testing dataset [26]. Table 3 displays a subset
of the dataset, where each instruction in a benchmark program is associated with a
record (row) in the dataset. The properties of the dataset for the ML algorithms are
shown in Table 3, where the output (error-propagation rate) is evaluated exclusively
for the learning data and is indicated in the final column. Using k-fold cross-vali-
dation, the performance of the developed ML-based classifiers was assessed after
the dataset was divided into k (10) roughly equal-sized subsets. For k iterations, the
model was trained and tested. For each iteration, k-1 folds were used for training
and the other for classifier testing. The instruction classes are displayed in Table 4.
The program instructions are divided into sensitive and non-sensitive classes using
the created machine learning classifier. Figure 3 displays the testing stage results,
which show that the multi-layer perceptron (MLP) classifier performed better than
the others regarding recall, accuracy, and precision. After the required classifier was
created utilizing ML methods, it was utilized to locate and eliminate non-sensitive
instructions from the source code, functioning as an ML-based slicer. According to
the results, the suggested source code slicer (ML-based slicer) might result in a 30%

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 12 of 36 B. Arasteh et al.

reduction in program size. Program slicing, carried out as a preprocessing step in


this study, helped the next steps of the suggested method (test creation), leading to
shorter execution time by reducing the size of the input program.
When dealing with imbalanced datasets, the accuracy metric can be misleading
because a model may predict only the majority class and still achieve high accuracy.
Hence, as shown in Fig. 3, along with the accuracy, precision, and recall were calcu-
lated to evaluate the performance of the suggested instruction classifier.

3.2 Stage 2: Test generation

3.2.1 Parsing the source code before test generation

The second step involves automatically analyzing (parsing) the sliced source code,
which only contains instructions that frequently propagate errors. The number of
inputs, the data types, their domain, and the quantity of conditional (branch) instruc-
tions are all disclosed in the program source code. The code that has been imple-
mented carries out the second step of the recommended procedure automatically.
The input and output of the parser (second step) of the suggested method are shown
in Fig. 4. The structure of a horse (an agent) in the developed HHOA is depicted in
Fig. 5. The complexity of the second step of the suggested method is O(n).

3.2.2 Generating the test data

This study suggested an automatic technique for producing optimal test data through
HHOA. In this section, the suggested test data generator using HHOA was explained.
HHOA starts with randomly generated populations, each referred to as a horse [18].
The proposed HHOA includes a specific imitation as its local and global search tech-
niques. Grazing, hierarchy, socializing, imitation, defensive mechanisms, and roaming
are the six main divisions of horse activities that HHOA classified based on similarities

public void PrintCalendar(int month, int year)


{
DateTime firstDayOfMonth = new
DateTime(year, month, 1); Structural information of the input program
DateTime lastDayOfMonth =
firstDayOfMonth.AddMonths(1).AddDays(-1); Input 1 : integer month
int daysInMonth = lastDayOfMonth.Day; Input 2 : integer year
int newline=0; Num. of Branch 2
for (int i = 1; i <= daysInMonth; i++)
Cyclomatic Complexity 4
{
DateTime currentDate = new Branch Instructions Str
DateTime(year, month, i);
if (currentDate.DayOfWeek ==
DayOfWeek.Saturday)
newline++;
}
}

Fig. 4  The recommended code-parser’s input and output for the printcalender benchmark program

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 13 of 36 741

Fig. 5  The implemented structure for an agent (horse) in the suggested HHOA in the printcalender
benchmark program

in their behaviors among age groups of horses. Figure 6 depicts the steps of the sug-
gested procedure.
Equation (2) indicates the horse movement in HHOA in each iteration. In Eq. (2),
Iter,AGE
XmIter,AGE
indicates the location (position) of the mth horse, AGE and V��⃗
m
show the
age and velocity of the horse; Iter illustrates the iteration. The life span of a horse is
25–30 years. Horses’ ages are divided into four categories, as follows:

• δ: 0 to 5 years old horse


• γ: 5 to 10 years old horse
• β:10 to 15 years old horse
• α: 15 and older horses
Iter,AGE
��⃗
XmIter,AGE = V m
+ Xm(Iter−1),AGE , AGE = 𝛼, 𝛽, 𝛾, 𝛿 (2)

The first 10% of horses from the top of a sorted matrix are selected as horses since
HHOA ranks the horses according to their best responses. The remaining horses are
divided up into twenty, thirty, and forty percent horse groups. Equation (3) can be used
to characterize the velocity vector associated with horses in each cycle of the proce-
dure. Six behaviors of the herd’s horses are used to identify the velocity vector.

V⃗ mIter,𝛼 = G
⃗ Iter,𝛼 + D
m
⃗ Iter,𝛼
m

V⃗ mIter,𝛽 = G
⃗ Iter,𝛽 + H
m
⃗ Iter,𝛽 + S⃗ Iter,𝛽 + D
m m
⃗ Iter,𝛽
m
(3)
V⃗ mIter,𝛾 = G
⃗ Iter,𝛾 + H
m
⃗ Iter,𝛾 + S⃗ Iter,𝛾 + ⃗I Iter,𝛾 + D
m m m
⃗ Iter,𝛾 + R
m
⃗ Iter,𝛾
m

V⃗ mIter,𝛿 = G
⃗ Iter,𝛿 + ⃗I Iter,𝛿 + R
m m
⃗ Iter,𝛿
m

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 14 of 36 B. Arasteh et al.

Fig. 6  Workflow of the discre-


tized HHOA for test generation
[18]

Horses are grazing all their lives, regardless of age. Every horse graze in des-
ignated areas as a result of the algorithm modeling the grazing zone surrounding
each animal with a "g" coefficient. Equation (4) and Eq. (5) show how to execute
this horse behavior mathematically.
⃗ Iter,AGE = gIter ǔ + 𝜌̌l + [X (Iter−1) ], AGE = 𝛼, 𝛽, 𝛾, 𝛿 (4)
( )
G m m

(Iter−1),AGE
(5)
Iter,AGE
gm = gm × 𝜔g
Iter,AGE
In Eqs. (3) and (4), G ��⃗
m
Indicates the motion parameter of the ith horse;
in each iteration, it reduces linearly with g. The value of u and l shows the upper
and lower grazing space which are adjusted to be 1.05 and 0.95, respectively. The
value of g is also suggested to be 1.5 for all four horses. Horses follow a leader
(experienced and powerful horse) of the herd. Horses have been seen to comply

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 15 of 36 741

with the rules during the Middle Ages (5–15 years). Equations (6) and (7) indi-
cate the horses’ behavior.
Iter,AGE
��⃗ (6)
[ (Iter−1)
= hIter,AGE − Xm(Iter−1) , AGE = 𝛼, 𝛽and𝛾
]
H m m
X∗

(7)
Iter,AGE (Iter−1),AGE
hm = hm × 𝜔h
Iter,AGE
In Eqs. (6) and (7), H ��⃗
m
and X∗(Iter−1) indicates the effect of the best horse loca-
tion on the velocity. Horses need to interact with other herd members. A herd protects
its members from predators, improving their chances of surviving and lessening their
vulnerability. Horses generally do better in groups with other herd members, according
to observations. This horse’s behavior is characterized by a tendency to move toward
the average positions of other horses. HHOA indicates this tendency by variable S, as
indicated in Eqs. (8) and (9).
[( N
) ]
Iter,AGE 1
S�⃗m (Iter−1)

(8)
Iter,AGE (Iter−1)
= sm X − Xm , AGE = 𝛽, 𝛾
N j=1 j

(9)
(Iter−1),AGE
SmIter,AGE = sm × 𝜔s
Iter,AGE
In Eqs. (8) and (9), S�⃗m and sm
Iter,AGE
Indicate the motion vector of ith horse and
th
orientation in Iter iteration. With the ωs parameter, the orientation toward the herd
reduces in each iteration. The value of N is the total number of horses in the herd.
Horses mimic one another and take up good and harmful habits and behaviors, such
as scouting out the best spot to graze. This behavior, which is likewise motivated by
HHOA and shown by i may be modeled using Eqs. (10) and (11).
[( pN
) ]
⃗I
Iter,AGE 1 ∑
� (Iter−1)
(10)
Iter,AGE (Iter−1)
m
= im X −X , AGE = 𝛾
pN j=1 j

(11)
Iter,AGE (Iter−1),AGE
im = im × 𝜔i

In Eqs. (10) and (11), the movement vector of ith horse in the top horse’s direction
Iter,AGE
is shown by ⃗I m . The value of pN shows the number of best horses (horse in best
location). The value of p indicates ten percent of the total quantity of horses; the value
of 𝜔i shows the factor of decreasing in iteration iIter. Their main instinct is running
and bucking. Equations (12) and (13) indicate the defense mechanism of individuals
in the HHOA. The coefficient d shows is defined to keep horses away from incorrect
positions.
[( pN
) ]
⃗ 1 ∑ ̌ (Iter−1)
DmIter,AGE
= −dm Iter,AGE
X −X (Iter−1)
, AGE = 𝛼, 𝛽 and 𝛾 (12)
qN j=1 j

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 16 of 36 B. Arasteh et al.

dmIter,AGE = dm(Iter−1),AGE × 𝜔d (13)


Iter,AGE
In Eqs. (12) and (13), D ��⃗
m
shows escape factor the ith horse with the worst
position indicated by the X̌ . The value of qN shows the number of worst horses. 𝜔d
indicates the factor of the decrement per dIter. The value of q is considered to show
twenty percent of all horses. Regarding Eqs. (14) and (15), HHOA simulated this
behavior (as a random movement) that is illustrated by factor r.
Iter,AGE
�⃗
R m
= rmIter,AGE pX (Iter−1) , AGE = 𝛾 and 𝛿 (14)

rmIter,AGE = rm(Iter−1),AGE × 𝜔r (15)

In Eqs. (14) and (15), the random velocity vector of ith horse is indicated by
Iter,AGE
�⃗
R m
. The value of 𝜔r is the decrement factor of rmIter,AGE.

3.2.3 Fitness function

An objective function (fitness function) was constructed to assess the alternatives’


efficiency and select the best one. Selecting a suitable fitness function is a crucial
step in resolving optimization issues. There are s branches in the input program.
The bchi variable is used to choose various program branches. The test set, or TS
for short, consists of a collection of generated test data in this study. The variable
Xk ∈ TS (1 < k < m) provides the input data if there are m inputs. The fitness of a test
set was determined using Eq. (16).
� � 1
Fitness Xk = � ∑s �2 (16)
𝛿 + i=1 wi f (bchi , Xk )

As δ is a constant and has a value of 0.01. The branch distance function is defined
by f. Table 5 is used to compute the branch distance of a branch instruction. The
conditional statement of the branch statement determines the value of its distance
function. Each test suit’s (TS) fitness is quantified using Eq. (17).

Table 5  Calculating the distance value of branch instruction [11]


No bchi (predicate) Branch-Distance ( f (bchi))

1 Boolean If the predicate’s value is true f (bchi) = 0 else f (bchi) = δ


2 p=q If (abs(p − q) = 0) f (bchi) = 0 else f (bchi) = (abs(p − q) + 𝛿)
3 p≠q If (abs(p − q) = 0) f (bchi) = 0 else f (bchi) = δ
4 p<q If (p −q < 0) f (bchi) = 0 else f (bchi) = (abs(p − q) + 𝛿)
5 p≤q If (p −q ≤ 0) f (bchi) = 0 else f (bchi) = (abs(p − q) + 𝛿)
6 a>b If (q − p < 0) f (bchi) = 0 else f (bchi) = (abs(q − p) + 𝛿)
7 a≥b If (q − p ≥ 0) f (bchi) = 0 else f (bchi) = (abs(q − p) + 𝛿)
8 aandb f (bchi) = ( f (p) + f (q))
9 aorb f (bchi) = min( f (p), f (q))

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 17 of 36 741

[ s
]2
(17)

wi min{f bchi , Xk }m
( )
Fitness(TS) = 1∕ 𝛿 + k=1
i=1

To assess the test set’s coverage, a distance function was employed. Table 5 pro-
vides the distance of a branch (s). As Table 5 illustrates, the value of the distance is
zero if the value of an expression used in the branch instruction on the test data is true.
If not, δ has a value of 0.01 and the variable’s value is added to the distance value.
Equation (17) is evaluated to 1/δ when the test set (TS) covers every branch, achieving
maximum coverage. The branch weight (wi) in Eq. (17) is the distance function’s coef-
ficient. A branch command’s branch weight is the product of the predicate weight and
its nesting level and is calculated using Eq. (18). The balance factor is represented by
λ, and its value was set to 0.5 in the experiments. The following factors determine the
branch’s weight:

• Nesting level (branch level)


• Weight of expression

The nested level (branch level) is the first component of the branch weight, accord-
ing to
Equation (19). It will be harder to cover a branch with more nesting levels. A branch
where (1 ≤ i ≤ S) is represented by the variable i, and the nesting level of branch i is
represented by the variable nli. The lowest nestinconsider number of iterations level
of the program branch is represented by the variable nlmin, which is 1 in general. The
variable nlmax shows the highest branch level. To normalize the value of branch level,
Eq. (20) is utilized.

wi = 𝜆wn� bchi + (1 − 𝜆)wp� bchi (18)


( ) ( )

nli − nlmin + 1
wn(bchi) = , (19)
nlmax − nlmin + 1

� �
wn bchi
(20)

� �
wn bchi = ∑s � �.
i=1
wn bchi

The weight of an expression, which is the second component of the branch weight,
indicates its complexity. Table 6 and Eq. (21) are used to calculate the expression
weight of a branch. If the branch statement is made up of h predicates joined by the
"and" operator, then the weight of the statement is equal to the square root of the weight
of the predicate. The weight of the expression is equal to the minimum weight of the
predicate when h expressions in a branch statement are joined by the "or" operator.
Equation (22) is utilized to normalize the expression weight.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 18 of 36 B. Arasteh et al.

Table 6  The weight of operators Operator Weight of


in the predicate [11] operator

“ =  = ” 0.9
“! = ” 0.2
“ =  = ” 0.9
Boolean 0.5
“ < ”,” > ”, “ <  = ”, “ >  = ” 0.6

� �∑
h � �
w2 cj , if the conjunction is and,
(21)
� �
wp bchi = � �r ��
j=1
min wr cj , if the conjunction is or,

� �
wp bchi
(22)
� �
wp� bchi = ∑s � �.
i=1
wp bchi

In Eq. (21), the number of accessible predicates in the ­ith branch (bchi) is given
by the variable h. In this case, cj denotes the jth conditional predicate of the corre-
sponding branch (1 ≤ j ≤ h). In Table 6, the value of an operator’s weight is indicated
by the wr in a predicate. Table 6 lists the operators that can be utilized in different
predicates in real-world programs.

4 Experiments and results

4.1 Experiment platform

The suggested method was implemented in MATLAB together with previous test
generators based on SA [6], GA [6], PSO [8], ABC [12], and ACO [13] algorithms.
All implemented test generators were executed on the same hardware and soft-
ware platform to produce trustworthy findings. Table 7 is a list of the configuration
parameters for the test-generation algorithms that were examined. A computer sys-
tem including an Intel Corei7 processor and 8 GB RAM was employed to evaluate
the suggested method and alternative ways. In this study, the following evaluation
standards were used:

1. Coverage (AC) is a metric that quantifies how well the generated test data covers
various software branches.
2. The success rate (SR) shows how likely the test data generated will cover every
program’s branches.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 19 of 36 741

Table 7  The values of the test


generator parameters GA Parameters Value
Pc 0.8
Pm 0.1
# Iteration 100
ACO Parameters Value
τ (pheromone) 1
Q 1
α (Pheromone weight) 1
ρ (Evaporation rate) 0.1
# Iteration 100
PSO Parameters Value
Inertia Weight 0.8
C1 2.0
C2 2.0
# Particles 25
# Iteration 100
ABC Parameters Value
Lbound -3
Ubound 100
# Onlooker Bees 50% Population
# Iteration 100
HHOA Parameters Value
Quantity of horses 40
# Iteration 100
hy 0.50
hβ 0.90
Sy 0.10
Sβ 0.20
Sy, Rδ 0.10
Ry 0.05
dβ 0.20

3. The Average Generation (AG) indicates the average number of iterations an algo-
rithm requires to cover all program branching.
4. Average Time (AT) displays the average time needed to attain its maximum cover-
age.
5. The Mutation Score (MS) indicates the probability that test data will discover a
fault. Mutation testing tools (like Mujava) have been used to inject faults into the
program under test.

Numerous experiments have been carried out to address the research questions:

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 20 of 36 B. Arasteh et al.

1. RQ1: Can the suggested test generator (ML + HHOA) achieve higher branch
coverage?
2. Q2: Can the suggested test generator enhance the success rate?
3. RQ3: Can the suggested test generator minimize the AG and AT?
4. RQ4: Is the suggested method capable of generating high coverage and fault-
detection data?

Each method was run 50 times for each benchmark program to find the average
value of these criteria.

4.2 Benchmarks

Ten standard programs with varying degrees of complexity were utilized to assess
the effectiveness of the suggested test generator and other applied techniques.
Table 8 lists the attributes of all these benchmark programs, which were also
employed in earlier research. The suggested test generator and the earlier tech-
nique were used to generate the test data for these benchmark programs. The sug-
gested method was used in conjunction with the prior test-generation techniques,
which were run on the same hardware and software platform in order to get accu-
rate results. Real programs have their source code arranged into classes, functions,
and components. A function includes between 20 and 100 lines of code regarding
programming standards. These studies employed benchmark programs, which are
widely used and well-known in software testing investigations. Furthermore, these
benchmark programs cover every programming construct that can be used to imple-
ment real and practical applications. These benchmark programs use all operators
and statements relating to conditions, loops, logic, arithmetic, and jumps. The cre-
ated control flow graph of the benchmarks shows that their cyclic complexity is
higher than that of real applications.

4.3 Evaluation of the results

4.3.1 Average coverage

A range of experiments were carried out to examine the suggested method given
the particular standards outlined in Sect. 4.2. One metric is the average cover-
age (AC) of data generated experimentally. Each approach was run fifty times
to find the average branch coverage. Figure 7 shows the average branch cover-
age of various benchmark programs based on test data collected over 50 runs of
the test generating methods. The outcomes demonstrate that the produced test
data has a larger coverage in most benchmark programs. Data with more excel-
lent branching coverage improves fault identification. The test data created by
the method are more productive and efficient when compared to the AC met-
ric. The suggested approach was applied fifty times in benchmark programs in
addition to alternative approaches. The average branch coverage generated by
executing various methods is displayed in Fig. 8. The suggested approach yields

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Table 8  Specification of the benchmark programs in the test-generation experiments
Program #Arguments #Arg. Type Complexity (Cyclo- LOC Specification
matic)

“TriangleType” 3 Int 34 35 A triangle classification


“CalDay” 3 Int 22 72 Day of the week determination for an input date
“IsValidDate” 3 Int 12 41 Validating a date
“Cal” 6 Int 20 40 Determining the days between two dates
An automatic software test‑generation method to discover…

“reminder” 2 Int 6 25 Two numbers division


“Bessj” 2 Real 6 49 Calculating Bessel function
“printCalender” 2 Int 8 60 Printing the calendar
“ComplexMethod” 3 Bool, Int, string 20 56 Complex calculations for different data types
“Statistics” 1 Int Array 4 31 Standard deviation Calculation
Page 21 of 36
741

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 22 of 36 B. Arasteh et al.

Average Branch-coverage 100


99
98
97
96
95
94
93

SA GA PSO ACO ABC HHOA ML+HHOA

Fig. 7  Average branch coverage of test data created by different test generators in 50 runs

100 99.57 99.77 99.84 99.93 99.93


99.5
99
Overall AC

98.5
98
97.32
97.5
97
96.5 96.18
96

Fig. 8  Overall branch coverage of different test-generation methods

99.95, 99.98, and 99.84% rates in applications with higher cyclomatic com-
plexity, like Triangle, CalDay, and ComplexMethod. The suggested approach
attained an average coverage of 99.99% in the comparative application, com-
pared to 99.90%, 99.94%, and 99.99% for the PSO, ACO, and ABC algorithms.
In most benchmark programs, the suggested algorithm outperformed alternative
approaches. One of the suggested approach’s primary benefits is eliminating
insensitive instructions; insensitive instructions have a lower error-propagating
rate, which was identified by the first stage of the suggested method. The sug-
gested method tries to cover the programs’ sensitive instructions. The likelihood
of the fault-detection test data is a function of its coverage. The possibility of
fault detection increases with coverage.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 23 of 36 741

100
95
Average Success Rate

90
85
80
75
70
65
60
55

SA GA PSO ACO ABC HHOA ML+HHOA

Fig. 9  The avg. success rate of different test generators in 50 runs

100 98.52 98.93


97.35 97.80
96.43
95

90
Overall SR

85

79.11
80
76.26
75

70
SA GA PSO ACO ABC HHOA ML+HHOA

Fig. 10  Overall success rate of different methods

Regarding the results shown in Fig. 7, the average branch coverage of the first
method, when used with the ML-based slicer, is better than the pure HHOA in
most cases. Indeed, eliminating the non-error-propagating instruction improves
the coverage of the test generators. Figure 8 displays the AC of different meth-
ods for each benchmark program. With and without ML-based slicers, the sug-
gested methodology produced test data with an average coverage of 99.93%
compared to earlier methods. The AC of test data generated by ACO, ABC, and
PSO are roughly 99.77%, 99.83%, and 99.57%, respectively.

4.3.2 Success rate of the suggested method

A second set of experiments was carried out to address the second research topic
(SR). This investigation assessed the suggested method’s success rate with that of

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 24 of 36 B. Arasteh et al.

related approaches. The success rate (SR) is another indicator for producing test data
with 100% coverage. Each test data generator was run 50 times for every benchmark
application. The average number of times a method achieved maximum coverage is
called the SR of the technique. Figure 9 displays the SR criteria results. As per the
outcomes, the HHOA, ABC, and ML + HHOA methods had greater SRs than the
other approaches. The ML + HHOA approach outperforms the different techniques
in terms of AC and SR.
The results (shown in Fig. 10) demonstrate that the suggested method provides
98.93% SR in the experiments. ML + HHOA indicates that the HHOA and ABC
approaches outperform the others. In most benchmark programs, ML + HHOA
outperformed other algorithms, particularly in more extensive and complicated
projects; the outcomes demonstrated that PSO and ACO had higher success rates
than SA and GA. In light of the SR criterion, it can be concluded that the suggested
approach produces test data more efficiently than competing approaches.

4.3.3 Convergence criterion of the suggested method

Convergence speed is the third criterion emphasizing the time and effort required to
gather test data. A valuable measure of algorithm convergence is the number of iter-
ations needed to provide high-coverage test data. The more iterations required to get
accurate test data, the more successful and efficient the process becomes. The aver-
age convergence speed between many techniques throughout 50 implementations is
displayed in Fig. 11. The results demonstrate that the suggested test generator has
lower average convergence in all benchmark programs. Consequently, the proposed
method requires fewer rounds to generate test data with optimal coverage. Figure 11
displays the number of iterations needed to obtain maximum coverage over 50 exe-
cutions. The average convergence speed of various algorithms across all benchmark
programs is displayed in Fig. 12. The acquired results indicate that the optimal test

60.00

50.00
Average Generaon

40.00

30.00

20.00

10.00

0.00

SA GA PSO ACO ABC HHOA ML+HHOA

Fig. 11  The average iterations needed to generate maximum-coverage test data by generators in 50 runs

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 25 of 36 741

33 30.83

Overall Convergence Speed


28 26.57

23

18
13.96
12.35
13

8 6.73 6.41 5.75

3
SA GA PSO ACO ABC HHOA ML+HHOA

Fig. 12  Overall convergence speed of different methods

120.00
Average Time (seconds)

100.00
80.00
60.00
40.00
20.00
0.00

SA GA PSO ACO ABC HHOA ML+HHOA

Fig. 13  The average execution time of different generators in 50 runs

data was achieved with an average of 5.74 iterations using the proposed approach.
On the other hand, HHOA without an ML-based slicer produces flawless test data
after roughly 6.41 repetitions. The term "optimal test data" in this study refers to the
data with the highest branch coverage and fault discovery rate. Therefore, the sug-
gested ML + HHOA is quite efficient in terms of convergence speed.
The average execution time of each method was considered an additional cri-
terion. Figure 13 displays the average execution times for 50 distinct runs across
various benchmarks. Consequently, the suggested strategy was compared with other
approaches using the average execution time. The proposed method performed faster
for many programs. Overall, it can be stated that the suggested ML + HHOA per-
forms better in large and complex programs.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 26 of 36 B. Arasteh et al.

Table 9  Operators create and Operator Specification


inject a fault into a program
source code “AOR” Replacement of Arith. Operators
“AOI” Insertion of an Arith. Operator
“AOD” Deletion of an Arith. Operator
“ROR” Replacement of Relational Operators
“COR” Replacement of Conditional Operators
“COI” Insertion of a Conditional Operator
“COD” Deletion of a Conditional Operator
“LOR” Replacement of Logical Operators
“LOI” Insertion of a Logical Operator
“LOD” Deletion of Logical Operator
“ASR” Replacement of Assignment Operators
“VDL” Variable Deleting Operator
“CDL” Constant Deleting Operator
“ODL” Operator Deleting Operator

4.3.4 Fault discovery rate of the suggested method

Mutation testing assessed whether the test data (created by the suggested approach)
was functioning correctly. Using the Mujava tool, the program’s source code was
altered to contain many flaws (mutants) [19, 20]. Using the test data that was sup-
plied, the introduced faults were discovered. The test data’s mutation score is
assessed using MuJava. Table 9 contains a list of the mutation operators that were
used to create faults in the software. Following fault injection, the faulty (buggy)
program is tested using the created test data, and the test data’s fault-detection rate
(also known as score mutation) is ascertained.
The mutation score (MS) represents each test data’s capacity to identify the
injected faults. Every instruction in the program’s source code used every muta-
tion operator. For every instruction, the likelihood of a fault injection is the same.
The mutation scores, or each test data’s capacity to identify faults, are displayed in
Table 10. The outcomes validate the suggested method’s value in producing data for
fault identification. The planned HHOA addressed the hard-to-cover codes which
does not covered by SA, PSO, GA, and ACO.

Table 10  The mutation score of different test generators


“triangleType” “calDay” “isValidDate: “Cal” “Reminder” “printCalendar” AVG

PSO 68% 43% 82% 85% 80% 88% 74%


ACO 67% 57% 86% 90% 83% 86% 78%
ABC 80% 48% 88% 92% 81% 88% 80%
ML + HHOA 93% 59% 90% 92% 86% 92% 85%

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 27 of 36 741

Table 11  Mutation Test tools’ Tool name Mutation operator Mutation level
specification
Mujava Class and Function Byte code
Muclipse Class and Function Byte code
PITest Function Byte code
Jester Function Java code
Jumble Function Byte code
JavaLance Function Byte code

Table 12  The second Program Line of Code Description


benchmark set
Bubble Sort 21 Sorting Algorithm
Factorial 18 Factorial of a number calculation
Sub2 26 Second root calculation
Fibonacci 10 Fibonacci number calculation
Perfect number 21 Perfect number detection

Table 13  The number of faults injected in the programs by the mutationtools


“Bubble “Factorial” “Binary “Triangle” “Sub2” “Fibo- “perfect “Sum”
Sort” Search” nacci” Num-
ber”

Pitest 13 10 16 44 23 9 9 124
Muclipse 81 47 108 310 106 40 67 759
MuJava 128 77 155 445 167 66 98 1136
Jester 10 9 13 47 24 11 11 125
Jumble 16 11 20 47 31 14 11 150
JavaLancer 19 12 26 67 59 20 11 214

HHOA was utilized to create test data for further investigation to show the pro-
posed strategy’s efficiency. Several mutation testing tools (Jumble, Mujava, PITest,
Muclipse, JavaLance, Jester, and Judy) were used to assess the test data created by
HHOA. The tools utilized for mutation analysis are listed in Table 11. A new set of
experiments was employed to determine the fault discovery rate of testing methods
using various mutation tools. The parameters of the second set of models are dis-
played in Table 12.
Table 13 displays the quantity of inserted faults (mutants) produced by various
tools in various programs. The tools introduce different faults in varied numbers,
types, and locations. Muclipse and Mujava are the tools which created the major-
ity of the inserted faults. The outcomes of the ACO-based mutation testing on the
second set of benchmarks are displayed in Fig. 14. Various mutation tools were used
to calculate different mutation scores. The kind and quantity of mutations (injected

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 28 of 36 B. Arasteh et al.

Pi‚est Muclipse Mujava Jester Jumble JavaLancer


100%

90%
Mutaon Score (MS)

80%

70%

60%

50%

40%
BubbleSort Matrix Factorail Sub2 Fibonacci perfect AVG
number
Benchmark Program

Fig. 14  The average mutation score of the produced test data by the ACO

Piƒest Muclipse Mujava Jester Jumble JavaLancer


100%
Mutaon Score (MS)

90%
80%
70%
60%
50%
40%
30%
20%
BubbleSort Matrix Factorail Sub2 Fibonacci perfect AVG
number
Benchmark Program

Fig. 15  The average mutation score of the produced test data by the PSO

faults) produced by various techniques account for the variation in the mutation
score between them.
Figures 15 and 16 show the mutation score of the tests by PSO and ML + HHOA.
The conducted results by different tools (Muclipse, Pittest, Jester, Mujajav, Java-
Lancer and Jumble) confirm the efficiency of the suggested technique. The higher
mutation score of the generated test data by ML + HHOA indicates the higher bug

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 29 of 36 741

Pi­est Muclipse Mujava Jester Jumble JavaLancer


100%

95%
Mutaon Score (MS)

90%

85%

80%

75%

70%
BubbleSort Matrix Factorail Sub2 Fibonacci perfect AVG
number
Benchmark Program

Fig. 16  The average mutation score of the produced test data by the suggested ML + HHOA

1
PSO ACO ML+HHOA
0.9
Average Mutaon Score

0.8

0.7

0.6

0.5

0.4

0.3
BubbleSort Matrix Factorail Sub2 Fibonacci perfect AVG
Benchmark Programs number

Fig. 17  The average fault discovery rate of different test generators

discovery capability of the suggested ML + HHOA. Furthermore, the suggested


method has stable performance which is not dependent on a specific tool.
The average mutation rates for various test programs, as estimated by analyz-
ing test data from several test generators, are displayed in Fig. 17. The suggested
method finds a more considerable fault percentage than other methods, even if
different mutation tools were used. The average mutation score of the generated
test data for all benchmarks by the ML + HHOA is about 89.38%. This figure for
the PCO and ACO-based test generator is 67.72% and 73.14%, respectively. The
result indicates that the suggested test generator is superior to the others. Differ-
ent mutation testing tools employ different mutation operators. The produced test

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 30 of 36 B. Arasteh et al.

1
PSO ACO ML+HHOA
0.9
Average Mutaon Score

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Piest Muclipse Mujava Jester Jumble JavaLancer AVG
Mutaon tesng Tools

Fig. 18  The average fault discovery rate of the test generators in different mutation testing tools

data by the suggested approach can identify various faults brought on by multi-
ple techniques. The recommended method’s superiority over other test generating
techniques is demonstrated by a comparative evaluation with various mutation
testing tools. The average mutation score of the created test data by different tech-
niques for all mutation testing instruments is displayed in Fig. 18. The average
scores represent the superiority of the suggested approach over other methods.

4.3.5 Comparing HHOA and ML + HHOA

Slicing the input program before test generation is the main superiority of the
ML + HHOA to the pure HHOA-based method. In the first stage of the ML + HHOA,
the developed ML-based classifier categorizes the program instruction into two
classes. The instructions classified as non-sensitive instructions are eliminated before
the test-generation stage. The sliced program by the developed ML-based slicer is

500 445
Num. of Generated Mutants

390
400

300

200
128
102
100 77 66 66 66 59 71 63
54

0
BubbleSort Factorial Fibonacci TriangleType PerfectNumber reminder

HHOA ML+HHOA

Fig. 19  The number of generated mutants for different programs to evaluate the generated test data by
non-slices-based method (HHOA) and sliced-based method (ML + HHOA)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 31 of 36 741

21% 20%

19% 18%
Mutant Reducon by ML+HHOA

17%

15% 14%

13% 12%
11%
11% 10%

9%

7%

5%
BubbleSort Factorial Fibonacci TriangleType PerfectNumber reminder

Fig. 20  The mutant reduction percentage provided by the sliced-based method (ML + HHOA)

covered by the test data generated by the HHOA. In the third stage, the sliced pro-
gram is targeted for the mutation test instead of the original program. Figure 19 indi-
cates the number of generated mutants by MuJava for the original program and sliced
program (by the suggested ML-based classifier). In the suggested ML + HHOA, only
the sensitive instructions are considered for the bug injection. Hence, the number of
injected bugs was reduced. Based on the results shown in Fig. 20, the proposed ML-
based slicer significantly reduces the number of produced mutants.
Figure 21 shows the source code of the TriangleType benchmark program and its
hard-to-cover codes, which were covered by the proposed test generator. The sug-
gested method can cover the hard-to-cover codes in different programs that the other
test generators can hardly cover. The number on the left side of each instruction indi-
cates the number of executions in the experiments. The unexecuted (high-lighted)
instructions are complicated, and the other methods hardly cover them. Meanwhile,
the suggested ML + HHOA can cover the hard-to-cover instructions. Table 14 indi-
cates the ANOVA (Analysis of Variance), as a statistical test, used to compare the suc-
cess rate of different test generating techniques. The p-value is 0.000006 and is lower
than 0.05, hence the null hypothesis was rejected and there is a statistically significant
difference between the performance of the suggested method and previous methods.

4.3.6 Merits and challenges of the method

Advantages: Eliminating insensitive instructions is one of the main advantages of


the proposed method; the first step of the proposed method found that insensitive
instructions had a lower error-propagating rate. The coverage of the generated test
data determines its fault-detection likelihood. The findings show that the proposed
approach yields 98.93% success rate which is better than other algorithms in the
majority of benchmark programs, especially in larger and more complex projects;
the results showed that PSO and ACO had greater success rates than SA and GA.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 32 of 36 B. Arasteh et al.

Fig. 21  The hard-to-cover codes of the TriangleType program, which were covered by the suggested test
generator

Table 14  Statistical test Source SS df MS


(ANOVA) was used to compare
the SR of different test Between-treatments 2329.012 3 776.337 F = 13.8101
generators
Within-treatments 1798.8861 32 56.215 P = 0.000006
Total 4127.8984 35 117.94

According to the obtained results, the suggested method required an average of


5.74 iterations to get the best test data. However, after about 6.41 iterations, HHOA

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 33 of 36 741

without an ML-based slicer generates perfect test data. Owing to the slicing of the
ineffective instructions, the proposed test generator performed faster for the major-
ity of the benchmark programs. The results confirm the usefulness of the proposed
approach in generating data for fault identification. The hard-to-cover codes that are
not covered by SA, PSO, GA, and ACO were addressed by the proposed HHOA.
The proposed method, even using various mutation tools, discovers a higher per-
centage of faults than other methods. The average mutation score of the test data
produced by the ML + HHOA for all benchmarks is approximately 89.38%.
Disadvantages: The suggested instruction classifier includes false alarm. Indeed,
some of the non-sensitive instructions were classified as sensitive and some of the
sensitive instructions were classified as non-sensitive. Since the suggested test gen-
erator relies on randomness and approximation, they do not guarantee 100%. If a
function contains deeply nested conditions, the algorithm might struggle to generate
test inputs that reach all execution paths. The suggested HHOA-based test generator
works best with numerical or structured input spaces but struggles with GUI testing,
event-driven systems, and complex type inputs. The other disadvantage of the sug-
gested two stage test generator is its computational cost. ML-based instruction clas-
sification along with HHOA test generator imposes time overhead.

5 Conclusion

It is an NP-complete challenge to automatically generate software test data with


optimal branch coverage and fault-detection rate. Every heuristic test generating
strategy on this subject has benefits and drawbacks. This work creates software
test data automatically by combining HHOA and ML. The ML was used to iden-
tify the non-error-propagating instructions of the input program. The main merits of
the suggested method are avoiding the coverage of non-error-propagating instruc-
tions, maximizing the coverage of error-propagating instructions, and maximizing
the fault discovery capability and success rate. The process is evaluated using sev-
eral standard software programs and fault injection tools, including Pittest, Jester,
Mujava, Muclipse, and Jumble. The results indicate that in terms of average cov-
erage, success rate, average execution time, and fault-detection capability, the sug-
gested method performs better than the current algorithms. ML + HHOA detects
around 89.40% of the faults in benchmark software by various tools. The suggested
approach attains 99.93% coverage and 98.93% success rate. The suggested method’s
starting population is created randomly.
Chaos theory [22] can be applied to enhance the performance of the proposed
HHOA in the test-generation problem. Considering the difficult-to-detect faults in
the fitness function is advised as another future study. One possible future direction
is the investigation of alternative hybrid evolutionary algorithms, suggested by refer-
ences [20, 21, 23], in the test-generation problem. The extension of the suggested
test generator to cover only the faulty modules that are identified by the fault predic-
tion methods [26–28] is another future study.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 34 of 36 B. Arasteh et al.

Authors contribution B. Arasteh and K. Arasteh performed method suggestions, algorithm design, cod-
ing, and implementation. B. Arasteh, A. Ghaffari performed data gathering and data analysis. Experi-
ments, results analysis and manuscript writing were performed by B. Arasteh, K. Arasteh and A. Ghaffari.

Funding Open access funding provided by the Scientific and Technological Research Council of Türkiye
(TÜBİTAK).

Data availability statement The data and subject programs created throughout the investigation are
publicly available on Google Drive. https://​drive.​google.​com/​drive/​folde​rs/​1n8oX​Y2k1U​PwipU​5Z5Pm​
qUAmb​k8ku9​XCa?​usp=​drive_​link.

Declarations
Conflict of interests The author declares that throughout this research, they did not receive any funds or
grants. There are no financial or non-financial conflicts of interest that the author needs to report.

Ethical approval The data and information utilized in this study were created by the researchers them-
selves. They do not belong to any other individual or organization. Other researchers will have access to
the research’s data. The authors from Iran are not employed by the Iranian government and don’t have any
governmental job or duty. They are preparing articles in their personal capacity and Interests.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permis-
sion directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.

References
1. Nouwou Mindom PS, Nikanjam A, Khomh F (2023) A comparison of reinforcement learn-
ing frameworks for software testing tasks. Empir Software Eng 28:111. https://​doi.​org/​10.​1007/​
s10664-​023-​10363-2
2. Lin JC, Yeh PL (2001) Automatic test data generation for path testing using GAs. J Inform Sci
131(1):47–64
3. Khatun S, Rabbi KF, Yaakub CY, Klaib MFJ (2011) A Random search based effective algorithm for
pairwise test data generation, International Conference on Electrical, Control and Computer Engi-
neering 2011 (InECCE), pp. 293–297, https://​doi.​org/​10.​1109/​INECCE.​2011.​59538​94.
4. Eler MM, Endo AT, Durelli VHS (2016) An empirical study to quantify the characteristics of
Java programs that may influence symbolic execution from a unit testing perspective. J Syst Softw
121:281–297
5. Cristian C, Koushik SS (2013) Symbolic execution for software testing: Three decades later. Com-
mun ACM 56(2):82–90
6. Cohen MB, Colbourn CJ, Ling ACH (2003) Augmenting Simulated Annealing to Build Interac-
tion Test Suites, In: Proceedings of the Fourteenth International Symposium on Software Reliability
Engineering (ISSRE’03), pp. 394–405.
7. Sharma C, Sabharwal S, Sibal R (2014) A survey on software testing techniques using genetic algo-
rithm. Int J Comput Sci 10(1):381–393
8. Esnaashari M, Damia AH (2021) Automation of software test data generation using genetic algo-
rithm and reinforcement learning. Expert Syst Appl 183:115446
9. Mao C (2014) Generating Test Data for Software Structural Testing Based on Particle Swarm Opti-
mization. Arab J Sci Eng 39(6):4593–4607

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


An automatic software test‑generation method to discover… Page 35 of 36 741

10. Kaur A, Bhatt D (2011) Hybrid particle swarm optimization for regression testing. Int J Comput Sci
Eng 3(5):1815–1824
11. Ahmed BS, Zamli KZ (2011) A variable strength interaction test suites generation strategy using
particle swarm optimization. J Syst Softw 84:2171–2185
12. Ghaemi A, Arasteh B (2020) SFLA-based heuristic method to generate software structural test data,
J Softw Evol, 32(1).
13. Aghdam ZK, Arasteh B (2017) An efficient method to generate test data for software structural test-
ing using artificial bee colony optimization algorithm. Int J Softw Eng Knowl Eng 27(6):951–966
14. Mao C, Xiao L, Yu X, Chen J (2015) Adapting ant colony optimization to generate test data for soft-
ware structural testing. J Swarm Evolut Comput 20:23–36
15. Arasteh B, Hosseini SMJ (2022) Traxtor: An automatic software test suit generation method
inspired by imperialist competitive optimization algorithms. J Electron Test 38:205–215. https://​doi.​
org/​10.​1007/​s10836-​022-​05999-9
16. Martou P, Mens K, Duhoux B, Legay A (2023) Test scenario generation for feature-based context-
oriented software systems. J Syst Softw 197:111570. https://​doi.​org/​10.​1016/j.​jss.​2022.​111570
17. Sulaiman RA, Jawawi DN, Halim SA (2023) Cost-effective test case generation with the hyper-
heuristic for software product line testing. Adv Eng Softw 175:103335. https://​doi.​org/​10.​1016/j.​
adven​gsoft.​2022.​103335
18. Arasteh B, Arasteh K, Kiani F, Sefati SS, Fratu O, Halunga S, Tirkolaee EB (2024) A bioinspired
test generation method using discretized and modified bat optimization algorithm. Mathematics
12:186. https://​doi.​org/​10.​3390/​math1​20201​86
19. MiarNaeimi F, Azizyan G, Rashki M (2021) Horse herd optimisation algorithm: A nature-inspired
algorithm for high-dimensional optimisation problems. Knowledge-Based Syst 213:106711. https://​
doi.​org/​10.​1016/j.​knosys.​2020.​106711
20. Hosseini MJ, Arasteh B, Isazadeh A, Mohsenzadeh M, Mirzarezaee M (2020) An error-propagation
aware method to reduce the software mutation cost using genetic algorithm. Data Technol Appl
55(1):118–148. https://​doi.​org/​10.​1108/​DTA-​03-​2020-​0073
21. Shomali N, Arasteh B (2020) Mutation reduction in software mutation testing using firefly optimi-
zation algorithm. Data Technol Appl 54(4):461480. https://​doi.​org/​10.​1108/​DTA-​08-​2019-​0140
22. Gharehchopogh FS, Abdollahzadeh B, Arasteh B (2023) An improved farmland fertility algorithm
with hyper-heuristic approach for solving travelling salesman problem. CMES-Comput Model Eng
Sci 135(3):1981–2006. https://​doi.​org/​10.​32604/​cmes.​2023.​024172
23. Arasteh B, Sadegi R, Arasteh K (2021) Bölen: software module clustering method using the com-
bination of shuffled frog leaping and genetic algorithm. Data Technol Appl 55(2):251–279. https://​
doi.​org/​10.​1108/​DTA-​08-​2019-​0138
24. Arasteh B (2022) Clustered design-model generation from a program source code using chaos-
based metaheuristic algorithms. Neural Comput Appl. https://​doi.​org/​10.​1007/​s00521-​022-​07781-6
25. Arasteh B, Abdi M, Bouyer A (2022) Program source code comprehension by module clustering
using combination of discretized gray wolf and genetic algorithms. Adv Eng Softw 173:103252.
https://​doi.​org/​10.​1016/j.​adven​gsoft.​2022.​103252
26. Arasteh B, Bouyer A, Pirahesh S (2015) An efficient vulnerability-driven method for hardening a
program against soft-error using genetic algorithm. Comput Elect Eng 48:25–43. https://​doi.​org/​10.​
1016/j.​compe​leceng.​2015.​09.​020
27. Arasteh B, Arasteh K, Ghaffari A et al (2024) A new binary chaos-based metaheuristic algorithm
for software defect prediction. Cluster Comput. https://​doi.​org/​10.​1007/​s10586-​024-​04486-4
28. Khleel NAA, Nehéz K (2023) Software defect prediction using a bidirectional LSTM network com-
bined with oversampling techniques. Cluster Comput. https://​doi.​org/​10.​1007/​s10586-​023-​04170-z
29. Malhotra R, Bansal A, Kessentini M (2024) Deployment and performance monitoring of docker
based federated learning framework for software defect prediction. Cluster Comput. https://​doi.​org/​
10.​1007/​s10586-​024-​04266-0

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


741 Page 36 of 36 B. Arasteh et al.

Authors and Affiliations

Bahman Arasteh1,2,3 · Keyvan Arasteh1 · Ali Ghaffari1,2

* Bahman Arasteh
[email protected]
1
Department of Software Engineering, Faculty of Engineering and Natural Science, Istinye
University, Istanbul, Türkiye
2
Department of Computer Science, Khazar University, Baku, Azerbaijan
3
Applied Science Research Center, Applied Science Private University, Amman, Jordan

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:

1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at

[email protected]

You might also like