Unit Test Generation Using Machine Master Thesis Laurence Saes PDF
Unit Test Generation Using Machine Master Thesis Laurence Saes PDF
Unit Test Generation Using Machine Master Thesis Laurence Saes PDF
learning
Laurence Saes
[email protected]
Test suite generators could help software engineers to ensure software quality by detecting software
faults. These generators can be applied to software projects that do not have an initial test suite, a
test suite can be generated which is maintained and optimized by the developers. Testing helps to
check if a program works and, also if it continues to work after changes. This helps to prevent software
from failing and aids developers in applying changes and minimizing the possibility to introduce errors
in other (critical) parts of the software.
State-of-the-art test generators are still only able to capture a small portion of potential software
faults. The Search-Based Software Testing 2017 workshop compared four unit test generation tools.
These generators were only capable of achieving an average mutation coverage below 51%, which is
lower than the score of the initial unit test suite written by software engineers.
We propose a test suite generator driven by neural networks, which has the potential to detect
mutants that could only be detected by manually written unit tests. In this research, multiple
networks, trained on open-source projects, are evaluated on their ability to generate test suites.
The dataset contains the unit tests and the code it tests. The unit test method names are used to
link unit tests to methods under test.
With our linking mechanism, we were able to link 27.41% (36,301 out of 132,449) tests. Our
machine learning model could generate parsable code in 86.69% (241/278) of the time. This high
number of parsable code indicates that the neural network learned patterns between code and tests,
which indicates that neural networks are applicable for test generation.
ACKNOWLEDGMENTS
This thesis is written for my software engineering master project at the University of Amsterdam.
The research was conducted at Info Support B.V. in The Netherlands for the Business Unit Finance.
First, I would like to thank my supervisors Ana Oprescu of the University of Amsterdam and Joop
Snijder of Info Support. Ana Oprescu was always available to give me feedback, guidance, and helped
me a lot with finding optimal solutions to resolve encountered problems. Joop Snijder give me a lot
of advice and background in machine learning. He helped me to understand how machine learning
could be applied in the project and what methods have great potential.
I would also like to thank Terry van Walen and Clemens Grelck for their help and support during
this project. The brainstorm sessions with Terry were very helpful and gave a lot of new insights in
solutions to the encountered problem. I am grateful for Clemens help too in order to prepare me for
the conferences. My presentation skills have improved a lot and I really enjoyed the opportunity
Finally, I would like the thank the University of Amsterdam and Info Support B.V. for all their
help and funding so that I could present at CompSys in The Netherlands and Sattose in Greece. This
was a great learning experience, and I am very grateful to have had these opportunities.
i
Contents
Abstract
ACKNOWLEDGMENTS i
1 Introduction 1
1.1 Types of testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Test generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Test oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Code analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Machine learning techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Evaluation Setup 10
4.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.2 Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.3 Comparing machine learning models . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Experimental Setup 12
5.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.1 Additional project criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.2 Collecting projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.3 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Extraction training examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.1 Building the queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 Training machine learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
ii
5.3.1 Tokenized view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3.2 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3.3 BPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3.4 Abstract syntax tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.4.1 The ideal subset of training examples and basic network configuration . . . . . 16
5.4.2 SBT data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4.3 BPE data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4.4 Compression (with various levels) data representation . . . . . . . . . . . . . . 17
5.4.5 Different network configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4.6 Compression timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4.7 Compression accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4.8 Finding differences between experiment . . . . . . . . . . . . . . . . . . . . . . 18
6 Results 19
6.1 Linking experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1.1 Removing redundant tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1.2 Unit test support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1.3 Linking capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.1.4 Total links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.5 Linking difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Experiments for RQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2.1 Naive approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2.2 Training data simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2.3 Training data simplification follow-up . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.4 Combination of simplifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2.5 Different data representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2.6 Different network configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2.7 Generated predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2.8 Experiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.3 Experiments for RQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3.1 Compression timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3.2 Compression accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.4 Applying SBT in the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.4.1 Output length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4.3 First steps to a solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Discussion 43
7.1 Summary of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 RQ1: What neural network solutions can be applied to generate test suites in order to
achieve a higher test suite effectiveness for software projects? . . . . . . . . . . . . . . 43
7.2.1 The parsable code metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2.2 Training on a limited sequence size . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.3 Using training examples with common subsequences . . . . . . . . . . . . . . . 44
7.2.4 BPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.5 Network configuration of related research . . . . . . . . . . . . . . . . . . . . . 44
7.2.6 SBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.7 Comparing our models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3 RQ2: What is the impact of input and output sequence compression on the training
time and accuracy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3.1 Training time reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3.2 Increasing loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.4.1 The used mutation testing tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
iii
7.4.2 JUnit version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.4.3 More links for AST analysis depends on data . . . . . . . . . . . . . . . . . . . 46
7.4.4 False positive links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.5 The dataset has an impact on the test generators quality . . . . . . . . . . . . . . . . 46
7.5.1 Generation calls to non-existing methods . . . . . . . . . . . . . . . . . . . . . 46
7.5.2 Testing a complete method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.5.3 The machine learning model is unaware of implementations . . . . . . . . . . . 47
7.5.4 Too less data for statistical proving our models . . . . . . . . . . . . . . . . . . 47
7.5.5 Replacing manual testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8 Related work 48
9 Conclusion 49
10 Future work 50
10.1 SBT and BPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10.2 Common subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10.2.1 Filtering code complexity with other algorithms . . . . . . . . . . . . . . . . . 50
10.3 Promising areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10.3.1 Reducing the time required to train models . . . . . . . . . . . . . . . . . . . . 50
10.3.2 Optimized machine learning algorithm . . . . . . . . . . . . . . . . . . . . . . . 51
Bibliography 52
iv
List of Figures
v
List of Tables
vi
Chapter 1
Introduction
Test suites are used to ensure software quality when a program’s code base evolves. The capability
of producing the desired effect (effectiveness) of a test suite is often measured as the ability to un-
cover faults in a program [ZM15]. Although intensively researched [AHF+ 17, KC17, CDE+ 08, FZ12,
REP+ 11], state-of-the-art test suite generators lack test coverage that could be achieved with manual
testing. Almasi et al. [AHF+ 17] explained a category of faults that are not detectable by these test
suite generators. These faults are usually surrounded by complex conditions and statements for which
complex objects have to be constructed and populated with specific values.
The SBST workshop of 2017 had a competition of Java unit test generators. In the competition,
test suite effectiveness of test suite generators and manually written test suites were evaluated. The
effectiveness of the test suites are measured by their ability to find faults and is measured with
the mutation score metric. The ability to find faults can be measured with mutation score because
mutations are a valid substitute for software faults [JJI+ 14]. The mutation score of a test suite
represents the test suite’s ability to detect syntactic variations of the source code (mutants), and is
computed using a mutation testing framework. In the workshop, manually written test suites score
on average 53.8% mutation coverage, while the highest score obtained by a generated test suite is
50.8% [FRCA17].
However, it is impossible to conclude that all possible mutants are detected even when all generated
mutants are covered. It is more likely that the used mutation testing tool missed a mutation, since it
is impossible to know if all non-redundant mutants were generated because this list is infinite. Some
methods can have an infinite amount of output values. There could be an infinite number of mutants
be introduced that only change one of these output values.
We need to leverage the ability to automatically test many different execution paths and the capa-
bility to learn how to test complex situations of generated and manually written test suites. Therefore,
we propose a test suite generator that uses machine learning techniques.
A neural network is a machine learning algorithm, that can learn complex tasks without being
programmed with rules. The network learns from examples and captures the logic that is inside.
Thus, new rules can be taught to the neural network by just showing examples of how the translation
is done.
Our solution uses neural networks and combines manual and automated test suites by learning
patterns between tests and code to generate test suites with higher effectiveness.
1
specification has to be evaluated, while white box testing is more efficient in testing hidden logic.
Our unit test generator can be categorized as white box testing since we use the source code to
generate tests.
The configuration of the neural network cells is the logic behind the predictions. The configuration
has to be taught to the network by giving training examples. For example, we can teach the neural
network the concept of a house by giving examples of houses and non-houses. The non-houses are
needed to teach the difference between a house and something else. The neurons are configured in a
way to classify based on the training data. This configuration can later be used to make predictions
on unseen data.
For our research, we have to translate a sequence of tokens into another sequence of tokens. This
problem requires a particular type of neural network [CVMG+ 14]. A combination of multiple networks
with a specific type of cells is required for the translation [CVMG+ 14]. These cells are designed so
that they can learn long-term dependencies. Thus, during the predictions, the context of the input
sequence is also considerer. Cho et al. [CVMG+ 14] have designed a network with these specifications.
They have used a network with one encoder and one decoder. The encoder network translates the
variable length input sequences to a fixed size buffer. The decoder network translates this buffer into
a variable size output. We can use this setup for our translations by using the variable sized input
2
and output as our input and output sequences.
RQ1 What neural network solutions can be applied to generate test suites in order to achieve a higher
test suite effectiveness for software projects?
RQ2 What is the impact of input and output sequence compression on the training time and accuracy?
1.4 Contribution
In this work, we contribute an algorithm to link unit tests to the method under test, a training set
for translating code to tests with more than 52,000 training examples, software to convert code to
different representations and also support the translation back, and a neural network configuration
with the ability to learn patterns between code and tests. Finally, we also contribute a pipeline that
takes as input GitHub repositories and has as output a machine learning model that can be used to
predict tests. As far as we know, we are the first to perform experiments in this area. Therefore, the
linking algorithm and the neural network configuration can be used as a baseline for future research.
The dataset can also be used on varies other types of machine learning algorithms for the development
of a test generator.
1.5 Outline
We address the background of test generation, code analysis and machine learning in Chapter 2. In
Chapter 3, we discuss how a test generator could be designed in general that uses machine learning.
In Chapter 4, we list projects that can be used for evaluation baseline, and we introduce metrics to
measure the progress of developing the test suite generator and how well it performed compared to
other generators on a baseline. How we develop our test generator can be found in Chapter 5. Our
results are presented in Chapter 6 and discussed in Chapter 7. Related work is listed in Chapter 8.
We conclude our work in Chapter 9. Finally, an overview of related work to this thesis can found in
Chapter 10.
3
Chapter 2
Background
Multiple approaches address the challenge of achieving a high test suite effectiveness. Tests could be
generated based on the project’s source code by analyzing all possible execution paths. An alternative
is using test oracles, which can be trained to distinguish between correct and incorrect method output.
Additionally, many code analysis techniques can be used to gather training examples and many
machine learning algorithms can be used to translate from and/or to code.
4
generate the call graph and provides functionality that can be applied on the graph. With the analysis
of source code, the source code in the representation of an abstract syntax tree (AST) is analyzed. For
AST analysis, JavaParser 2 can be used to construct an AST, and the library provides functionality
to perform operations on the tree.
2 https://javaparser.org/
5
Chapter 3
Our solution focuses on generating test suites for Java projects that have no test suite at all. The
solution requires the project’s method bodies and the name of the classes in which they belong. The
test generator sends the method bodies in a textual representation to the neural network to transform
them into test method bodies. The test generator places these generated methods in test classes.
The collection of all the new test classes is the new test suite. This test suite can be used to test the
project’s source code on faults.
In an ideal situation, a model is already trained. When this is not the case, then additional actions
are required. Training projects are selected to train the network to generate usable tests. For instance,
all training projects should use the same unit test framework. A unit test linking algorithm is used
to extract training examples from these projects. The found methods and the unit test method are
given as training examples to the neural network. The model can then be created by training a neural
network on these training examples. A detailed example of a possible flow can be found in Figure 3.1.
Figure 3.1: Possible development flow for the test suite generator
6
not be included because a mismatch in behavior could teach the neural network patterns that prevent
testing correctly. So, the tests have to be analyzed in order to filter out tests that fail. For the
filtering, we execute the unit test from the projects, analyze the reports, and extract all the tests that
succeed.
7
valid to perform the operation that is tested, as now additional statements are included that have
nothing to do with the method under test.
Listing 3.1: Unit tests that will be incorrectly linked without statement elimination
1 push()
2 getList () [assing to AList]
3 ...
4 new ArrayList<>() [assing to tmp] || new LinkedList() [assing to tmp]
5 return tmp
6
7 AList.empty() [assing to tmp2]
8 assertEquals(tmp2)
Resolving concrete types is impossible with AST analysis. It is possible to list what concrete
classes implement the List interface. However, when this is used as the candidate list during the
linking process, it could result in false positive matches. The candidate list could contain classes that
were not used. This makes it impossible to support interfaces and abstract methods with the AST
analysis.
8
An advantage of the AST analysis is that it does not require the project’s bytecode, meaning that
the project does not have to be compilable. The code could also be partially processed because only
a single class and some class dependencies have to be analyzed. Partial processing reduces the chance
of unsupported classes since less has to be analyzed. With bytecode analysis, all dependencies have
to be present and every class that is required to build the call graph.
9
Chapter 4
Evaluation Setup
For the evaluation of our approach, we introduce in total three goal metrics that indicate how far we
are from generating working code to the ability to find bugs. However, the results of the metrics could
be biased because we are using machine learning. The nodes of a neural network are initialized with
random values before training starts. From this point, the network is optimized so that it can predict
the training set as good as possible. Different initial values will result in different clusters within the
neural network, what impacts the prediction capabilities. This means that metric scores could be due
to chance. In this chapter, we will address this issue.
We created a baseline with selected test projects to enable comparisons of our results with the
generated test suite of alternative test generators and manually written test suites. This baseline can
be used when unit tests can be generated. Otherwise, we do not have to use these projects. The
evaluation of any method is fine because we do not have to be able to calculate the effectiveness of
the tests. In our research, for RQ1 we perform multiple experiments with different configurations
(different datasets and different network configurations). We have to prove that a change in the
configuration will result in a higher metric score. For RQ2 we prove that with compression the
accuracy will increase, and the required time will decrease.
4.1 Evaluation
The test suite capability should be evaluated if the generator can generate test code. Nevertheless,
when the test generator is in a phase where it is unable to produce valid tests, a simple metric should
be applied which does not test the testing capability. However, it would qualify how far we are from
generating executable code because code that is not executable is unable to test something. This
set of metrics enables us to make comparisons over the whole phase of the development of the test
generator.
4.1.1 Metrics
The machine learning models can be compared to its ability to generate parsable code (parsable rate),
compilable code (compilable rate), and code that can detect faults (mutation score). The parsable
rate and compilable rate measure the test generator’s ability to write code. The difference between
these two is that compilable code is executable, while this is not necessarily true for parsable code.
The mutation score measures the test generator’s test quality.
These metrics should be used in different phases. The mutation score should be used when the
model can generate working unit tests to measure the test suite effectiveness. The compilable code
metric should be used when the machine learning model is aware of the language’s grammar to measure
the ability of writing working code. If the mutation score and compilable code metric cannot be used,
the parsable code metric should be applied. This measures how well the model knows the grammar
of the programming language.
10
4.1.2 Measurement
The parsable rate can be measured by calculating the percentage of the code that can be parsed
with the grammar of the programming language. The parsable rate can be calculated by dividing
the amount of code that parses with the total amount of analyzed code. The same calculation as for
parsable rate can be applied for the compilable rate. However, instead of parsing the code with the
grammar, the code should be compiled with a compiler.
The mutation score is measured by a fork of the PIT Mutation Testing1 . This fork is used because it
is combining multiple mutation generations, which leads to a more realistic mutation score [PMm17].
4.2 Baseline
The effectiveness of a project’s manually written test suite and of automatically generated test suites
are used as the baseline. Only test suit generators that implement search-based testing and random
testing are considered, because many open-source tools are available for these methods and they are
often used in related work. We use Evosuite2 for search-based testing and Randoop3 for random
testing, as these are the highest scoring open-source test suite generators in the 2017 SBST Java Unit
Testing Tool Competition in their respective categories [PM17].
Once tests can be generated, we will evaluate our approach on six projects. These projects are
selected based on a set of criteria. State-of-the-art mutation testing tool (see sec. 4.1) have to support
these projects, and the projects should have a varying mutation coverage. The varying mutation
coverage is needed to evaluate projects with a different test suite effectiveness. This way we could
determine how much test suite effectiveness we can contribute to projects with a low, medium, and
high test suite effectiveness. We divided projects with a test suite effectiveness around 30% into the
low category, projects around 55% into the medium category, and around 80% into the high category.
These percentages are based on the mutation coverage of the analyzed projects. The selected projects
can be found in Table 4.1.
1 https://github.com/pacbeckh/pitest
2 http://www.evosuite.org/
3 https://randoop.github.io/randoop/
11
Chapter 5
Experimental Setup
How a test suite generator can be developed in general was discussed in Chapter 3. The chapter
contains details on how training examples can be collected and explains how these examples can be
used in machine learning models. How the test generator can be evaluated is discussed in Chapter 4.
Several metrics are included, and a baseline is given. This chapter gives insight on how the test suite
generator is developed for this research. We included additional criteria to simplify the development of
the test generator. For instance, we only extract training examples from projects with a dependency
manager to relieve ourselves of manually building projects.
• As mentioned, we have to partition the projects to cope with the limitation of maximum 1,000
results per query. The partitioning is performed by obtaining the Java projects starting from a
certain project size and are obtained by performing 10 requests to get the results in batches of
100. This step is repeated with an updated start size until all projects are obtained.
1 https://maven.apache.org/
2 https://gradle.org/
3 https://github.com/
12
• For each project, a call has to be made to determine if the project uses JUnit in at least one
of their unit tests. This can be done by searching for test files that use JUnit. The project is
excluded when it does not meet this criterion.
• Additional calls have to be made to determine if the project has a build.gradle (for Gradle
projects) with the JUnit4 dependency or a pom.xml (for Maven projects) with the JUnit 4 de-
pendency. An extra call is required for Maven projects to check if it has the JUnit 4 dependency.
The dependency name ends either with junit4, or is called JUnit and has version 4.* inside the
version tag.
The number of requests needed for each operation could be used to limit the total number of
requests required. This can improve the total time required for this process. In our case, we expect
that it is best to check first if a project is a Gradle project before checking if it is a Maven project,
because more requests are required for Maven projects.
In conclusion, to list the available projects, one request has to be made for every 100 projects. Each
project requires additional requests: one request to check if a project has tests, one additional request
for projects that use Gradle, and at most two extra requests for projects that use Maven.
So, to analyze n projects, at minimum n ∗ ((1/100) + 1)/30 and at maximum n ∗ ((1/100) + 4)/30
minutes are required.
13
The test class and test methods can be extracted based on the test report of each project. The test
report consists of test classes with the number of unit tests that succeeded and how many failed or
did not run for any other reason. From the report we cannot differentiate between test methods that
succeeded or failed. So, we only consider test classes for which all test methods succeeded. The test
methods from the test class can be extracted by listing all methods with a @Test annotation inside
the test class.
5.3.2 Compression
Neural networks perform worse when making predictions over long input [BCS+ 15]. Compression
could improve the results by limiting the sequence length. For this view, we used the algorithm
proposed by Ling et al. [LGH+ 16]. They managed to reduce code size without impacting the quality.
On the contrary, it improved their results.
This view is an addition to the tokenized view, described in Section 5.3.1. The view compresses
the training data by replacing token combinations with a new token. An example is displayed in
Figure 5.2. The input code is converted to the tokenized view, and a new number replaces repeated
combination of tokens. Additionally, in an ideal situation, it also could improve the results. For
example, when learning a pattern on a combined token is easier than learning them on a group of
tokens.
14
Figure 5.2: Example of compression view
5.3.3 BPE
The tokenization system mentioned in Section 5.3.1 split on words. Nevertheless, these words somehow
belong together. This information could be usable during prediction and can be given to the neural
network by using BPE. BPE introduces a new token ”@@ ” to connect subsequences. The network
learns patterns based on these subsequences, and they are also applied to words that have similar
subsequences. Figure 5.3 shows an example of this technique applied to source code. In this figure,
the sequence ”int taxTotal = taxRate * total” is converted into ”int tax@@ Total = tax@@ Rate *
total” so that the first ”tax” in connected with Total and the last ”tax” is connected with Rate.
15
Figure 5.4: Example of AST view
1 (ifStatement
2 (assign ( variable a)variable (number 1)number)assign
3 (equals (number 1)number (variable b)variable)equals
4 (assign ( variable a)variable (number 2)number)assign
5 )ifStatement
5.4 Experiments
In Section 5.3 we explained how the machine learning models can be trained. In this section, we are
going to list all the experiments that we are going to perform. First, an ideal set of training examples
and basic network configuration are selected where an as high as possible metric score can be achieved
on. For RQ1, models with different data representations mentioned in Section 5.3 (SBT, BPE, and
Compression) are trained. Additional, a model that uses a network configuration of related research
is trained and a model with an optimized configuration is also trained. For RQ2, a model is trained
to measure the time required to train various levels of compression and a model is trained to evaluate
the development of accuracy when compression is applied.
To evaluate which experiment has the best results, we have to compare their results. In Section
4.1.3 we stated that we do test this with statistics. In this section, we go into more detail.
5.4.1 The ideal subset of training examples and basic network configura-
tion
We created a basic configuration for our experiments. This configuration is used as the basis for all
experiments. It is important that this configuration contains the best performing dataset. Otherwise,
it is unclear if bad predictions are due to the dataset or the newly used method.
For our experiments, we use the Google seq2seq project4 from April 17, 2017. We use attention
in our models, as attention enables a neural network to learn from long sentences [VSP+ 17]. With
attention, a soft search on the input sequence is done during predicting in order to add contextual
information. This should make it easier to make predictions on large sequence sizes. The network has
a single decode and encode layer of 512 LSTM nodes, has an input dropout of 20%, uses the Adam
optimizer with a learning rate of 0.0001, and has a sequence cut-off at 100 tokens.
4 https://github.com/google/seq2seq
16
5.4.2 SBT data representation
Hu et al. [HWLJ18] achieved better results when translating source code into a textual representation
instead of source code into text. Applying the SBT to our experiment is beneficial because it also
could improve our results. We use a seq2seq neural network with LSTM nodes for this experiment
because this setup is the most comparable to their setup.
However, when we train our model on how an SBT can be translated into another SBT, it will
output an SBT representation as the prediction. Thus, we need to build software that can convert
the SBT back into code.
In the research where SBT is proposed, code was converted into text [HWLJ18]. There was no
need to convert the SBT back. We had to make a small modification to support back-translation.
We extended the algorithm with an escape character. In the AST information, every parenthesis and
underscore are prepended with a backslash. This makes it possible to differentiate between the tokens
that are used to display the structure of the AST and the content of the AST.
For the translation from the SBT to the AST, we developed an ANTLR5 grammar that can interpret
the SBT and can convert it back to the original AST. For the conversion from the AST to code, we
did not have to develop anything. This was built-in our AST library (JavaParser6 ). For the validate
of our software, we converted all our code pairs described in Section 6.1.3 back from the SBT to code.
17
is from predicting the truth. We can conclude that the model is not learning when the loss increases
from the start of the experiment. This would mean that compression does not work on our dataset.
We have only used the compression level 1 dataset for this experiment.
18
Chapter 6
Results
In Chapter 3 and Chapter 5, it is discussed how our experiments are performed in order to answer
RQ1 and RQ2. In this chapter, we report on the obtained results. In Addition, we also report on
techniques used to generate training sets. This does not directly answer a research question. However,
the training examples are both used to train models for both RQ1 and RQ2.
19
Figure 6.1: Supported unit test by bytecode analysis and
AST analysis
20
(a) Total link with bytecode analysis (b) Total link with AST analysis
Figure 6.3: Total link with AST analysis and bytecode analysis
In Figure 6.2 it is shown that bytecode analysis can create more links compare to AST analysis. In
Figure 6.3a and Figure 6.3b, it is shown that AST analysis could create more links as it had support
for more tests.
Concrete classes
In 83 of the cases, a concrete class was tested. Bytecode analysis, unlike AST analysis, could link
these because it knows that they were used. AST analysis tries to match the interfaces without having
knowledge of the real class under test and incorrectly links it to another class that has some matching
names by coincidence.
Additionally, in 37 other cases, a class was tested that overrides methods from another class. AST
analysis lacks the information about what class is used when a base class is used in the type definition.
So again, AST analysis fails to create a correct link, due to its lack of awareness of the real class under
test.
However, in 25 cases there were multiple concrete classes tested within one unit test class. These
were included in the test class, since they all have the same base type. Bytecode analysis will treat
every concrete class as a different class and will divide all the matches among all classes. With
bytecode analysis, an incorrect link was made to another class that had a similar interface. AST
analysis was not aware of the concrete classes and linked the tests to the base class. Additionally,
in 6 other cases, bytecode analysis failed to link to the correct concrete class that was tested. AST
analysis linked these tests to the base class.
Subclasses
For 24 unit tests, the method that was tested was within a subclass or was a lambda function.
Bytecode analysis could detect these, because these calls are included in the call graph. We did not
support this in the AST analysis.
21
Unfortunately, this also has disadvantages. In 14 other cases mocking was used. Bytecode analysis
knew that a mock object was used and linked some tests to the mock object. However, the mock
object is not tested. The unit tests validated, for example, that a specific method was called during
the unit test. Bytecode analysis incorrectly linked the test with the method that should be called and
not the method that made the call. AST analysis did not recognize the mock objects, and therefore
it could link the test to the method that was under test.
Unclear naming
The naming of a unit test does not always match the intent of the unit test. In 13 cases, multiple
methods were tested in the unit test. We only allowed our analysis tools to link to one method. Both
AST analysis and bytecode analysis were correct, but they selected a different method. In 19 cases
for AST analysis and 3 cases for bytecode analysis, an incorrect method was linked because of unclear
naming. In 8 cases, it was not clear what method performed the operation that was tested. Multiple
methods could do the operation. In 2 cases, it was not clear what was tested.
22
Figure 6.4: Roadmap of all experiments
23
To resolve this issue, examples that have the same method body were grouped and assigned to a
single dataset. This will prevent that code with the same implementation is assigned across the test,
validation, and training set. However, because groups are assigned to sets instead of single tests, it
could happen that the sets will not have the intended size. Nevertheless, a couple more or less training
examples in a set will not make a difference. Therefore, we performed our experiments with the sets
even if they were of different sizes.
Experiment results
We expect the cause of the low parsable code rate is due to the high complexity of the training data.
During manual analysis, we found the following complexities:
• Training examples were cut-off to a maximum of 100 tokens, because of our configured sequence
length
• There are training examples that use more complex language features such as inheritance and
default methods
• The training set contains token sequences that are used only once
An overview of the current results are shown in Figure 6.5.
24
6.2.2 Training data simplification
In these experiments, we are trying to reduce the complexity that was encountered with the naive
approach. A list of complexities can be found in Section 6.2.1. We tried to train a model on less
complex training examples, training examples that have common subsequences, and training examples
with a limited size.
25
Limited size
In our default network configuration, we cut off sequences at a sequence length of 100 tokens. To
prevent the training examples being cut off, we filtered the examples on a maximum size of 100 tokens.
With this criterion, we achieved 69.79% parsable code. More details on the experiment can be found
in Table 6.4.
Experiment results
As shown in Figure 6.6, only the experiment where we limited the sequence size results in a higher
percentage of parsable code.
26
accepted examples with a maximum sequence length of 200 tokens. We kept the same number of
training examples, so we can monitor what happens when we allow longer sequences. As shown in
Table 6.5, we achieved 31.78% parsable code with this configuration.
Table 6.5: Details on experiment with maximum sequence size 200 and limited number of training
examples
Table 6.6: Details on experiment with maximum sequence size 200 and more training examples
27
Table 6.7: Details on experiment with maximum sequence size 300 and limited number of training
examples
Experiment results
As shown in 6.7, a sequence length of 100 led to the highest parsable rate. Reducing the size further
could improve the results. However, with these criteria, we ended up with only 11,699 training
examples out of 52,585 training examples. This is a reduction of 77.75%. We did not perform more
experiments with limiting the sequence length further, because this also limits the number of training
examples that we can use. Experimenting further with a small set and filtering can lead to unexpected
results. In conclusion, the experiments in this section did not contribute to a higher parsable score.
An overview of the experiments is shown in Figure 6.7.
28
Table 6.8: Details on experiment with maximum sequence size 100 and common subsequences
Table 6.9: Details on experiment with maximum sequence size 100 and no concrete classes and no
default methods
Experiment results
As shown in Figure 6.8, we were able to improve the percentage of parsable code from 69.79% to
70.50% by only using training examples with common subsequences. This is a small improvement
and could be caused randomly. We address this issue in Section 6.2.8.
29
Figure 6.8: Experiments in combination of optimizations
Compression
With compression, tokens that belong together could be merged. Therefore, the neural network does
not need to learn that a combination of multiple tokens has a specific effect. This effect only has to be
linked to one token. We applied compression on the last best-performing dataset. We trained models
with compression levels 1 to 10 (S10). The change in the maximum size of the training examples are
shown in Figure 6.9. When we analyze the models after training, we can observe that the model’s
loss increases immediately. This means that the model is unable to find patterns in the dataset and
that compression is not suitable for our dataset. This is further investigated in Section 6.3.2. More
details on this experiment can be found in Table 6.10.
30
Table 6.10: Details on experiment with maximum sequence size 100, common subsequences, and
compression
BPE
With BPE, we tell the neural network that two tokens belong together. We applied BPE to the last
best performing dataset. With these criteria, we achieved 57.55% parsable code. More details on this
experiment can be found in Table 6.11.
Table 6.11: Details on experiment with maximum sequence size 100, common subsequences, and
BPE
Experiment results
Both compression and BPE did not improve our results. An overview of the experiments can be found
in Figure 6.10.
31
Figure 6.10: Experiments with different data representations
32
Table 6.13: Details on experiment with maximum sequence size 100, common subsequences, and
different network configurations
SBT
We also performed tests with SBT. We were unable to generate parsable code with this technique.
More information about the results can be found in Section 6.4.
Experiment results
An overview of all the experiments can be found in Figure 6.11. It can be concluded that S20 performs
the best. This model uses two output layers and a dropout of 50%.
33
Figure 6.11: Experiments in different network configurations
34
1 @Test public void testGetSequence() {
2 assertEquals(1, m SIZE.getBytes());
3 }
35
Figure 6.12: All experiment results
The percentages in the figure are not statistically checked. As mentioned in Section 5.4.8, we only
analyzed the experiments that improved our results. These are experiments S4, S8, S13, and S20. In
total, we trained five models for each experiment. The parsable scores are shown in Table: 6.14. The
columns represent the experiments, and the rows are the different runs.
Table 6.14: Parsable code score for the most important experiments with different seeds
36
• H1: There is a significant difference between the samples
We apply the algorithms discussed in Section 5.4.8 to evaluate if H0 can be rejected. The results
of the experiments can be found in Table: 6.15. We have a p-value lower than 0.05, so we can reject
H0. This means that there is a difference between the groups.
We apply the algorithms discussed in Section 5.4.8 to evaluate what the differences in results are.
The results of the experiments can be found in Table: 6.16. We can reject H0 for the experiments
that have a p-value lower than 0.05. That means that there is a statistically significant difference
between S8 and S13, S8 and S20, and S13 and S20.
In Section 6.2.4, we discussed that common subsequences with a maximum sequence length of 100
(S8) improved the parsable score of a maximum sequence length of 100 (S4) by 0.71 percentage points.
When we calculate the averages of the follow-up experiments in Table 6.14, then S4 has an average
of 57.63%, while S8 has an average of 64.12%. We did not find a significant difference between these
experiments. Therefore, we cannot conclude what filter on the dataset is better.
As shown in Table 6.16, S20 should have better results than S8 and S13, and S13 should have better
results than S8. We came up with the following hypothesis for the directional t-test:
We apply the directional t-test discussed in Section 5.4.8 to evaluate the hypothesis. The results
are displayed in Table 6.17. For these experiments, we corrected the alpha to 0.05/3 to keep the same
chance on an error. In these tables, it is shown that the p-value is lower than 0.02 (0.05/3). This
means that H0 can be rejected in all cases.
37
Table 6.17: Directional t-test of significantly different experiments
We hypothesized that compression will impact the time needed to make predictions. This leads to
the following hypotheses:
• H0: There is no significant difference between the samples
• H1: There must be an alternative hypothesis
We applied the algorithms discussed in Section 5.4.8 to evaluate if H0 can be rejected. The results
of the experiments can be found in Table: 6.18. We have a p-value lower than 0.05. Thus, H0 can be
rejected. This means that there is a significant difference between the groups.
38
Table 6.18: ANOVA on compression timing
We applied the algorithms discussed in Section 5.4.8 to evaluate what the differences are in duration.
The results of the experiments can be found in Table: 6.19. We found a p-value lower than 0.05 for
all groups, except for no compression paired with compression level 1. This means that there is
a significant difference between all groups, except for the group with no compression paired with
compression level 1.
To test if the time needed for compression reduces when the compression level increases, we formu-
lated the following hypothesis:
We applied the directional t-test discussed in Section 5.4.8 to evaluate the hypothesis. An overview
of the p-values can be found in Table 6.20. H0 can be rejected for all tested compression levels.
Experiment P-value
No compression vs. Compression level 2 <0.0001
No compression vs. Compression level 10 <0.0001
Compression level 1 vs. Compression level 2 <0.0001
Compression level 1 vs. Compression level 10 <0.0001
Compression level 2 vs. Compression level 10 <0.0001
39
parsable code (based on S8 in Figure 6.8). We visualized the loss on the validation set in Figure 6.14.
The higher the loss, the worse the model performs.
The first step in this experiment is to show that our results are not because of chance. Therefore, we
want to determine if the loss between every epoch is different. We formulated the following hypotheses:
We applied the algorithms discussed in Section 5.4.8 to evaluate whether H0 can be rejected. The
results of the experiments can be found in Table: 6.21. We have a p-value lower than 0.05, so we can
reject H0. This means that there is a significant difference between the groups.
We applied the algorithms discussed in Section 5.4.8 to evaluate what the differences in the loss
are. The results of the experiments can be found in Table: 6.22. We found a p-value lower than 0.05
for all groups. This means that there is a significant difference between all the groups.
40
Table 6.22: Difference in loss groups
Experiment P-value
Epoch 200 vs. Epoch 100 <0.0001
Epoch 300 vs. Epoch 200 <0.0001
Epoch 400 vs. Epoch 300 <0.0001
Epoch 500 vs. Epoch 400 <0.0001
Epoch 600 vs. Epoch 500 <0.0001
Epoch 700 vs. Epoch 600 <0.0001
Epoch 800 vs. Epoch 700 <0.0001
Epoch 900 vs. Epoch 800 <0.0001
Epoch 1,000 vs. Epoch 900 <0.0001
41
6.4.2 Training
To support the sequence length of 714, we could only use a neural network with LSTM nodes of size
64 instead of 512. Using LSTM nodes of size 512 was impossible because of memory limitations. We
selected the smallest 9,998 training examples for method code and 9,995 training examples for SBT.
We trained a model for both sets with this network setup. We could not generate valid SBT trees. We
were able to generate parsable method code. However, the generated code did not make any sense.
It consists of method bodies that asserted two empty string. The results are inconclusive, because it
was impossible to use a reasonable neural network in combination with SBT.
42
Chapter 7
Discussion
In this chapter, we discuss our experiment results on i) what neural network solution performs best to
generate unit tests, and ii) how compression affects accuracy and performance. First, we start with a
summary of our most important results.
43
how well the code tests. It will give an idea if the machine learning model starts to understand the
programming language. This is a first step towards generating test code. However, more research on
other metrics could give a better insight into the progress towards a unit test generator using machine
learning.
7.2.4 BPE
Applying BPE on our datasets reduced the amount of parsable code that could be generated. We
used BPE to include relations between words and symbols instead of word part. Other research found
that in NLP the addition of the relation between word part improves results [SHB15]. A reason why
this does not work on words in code could be because there are too many relations that make no
sense. For example, because of BPE, the word stop in both stopSign and stopWalking share some level
of meaning. These two identifiers do not have much in common when translating to code. This could
be the reason why we observed a drop in the percentage of generated parsable code. However, we did
not perform a statistical check to validate these results.
7.2.6 SBT
We were unable to use SBT to generate code, since it outputs sequences which are too long. We were
unable to reduce the sequences to a level where we can train a sufficient neural network. Also, we
could not compress the sequences too much, because the whole point is to add additional information.
44
When we compress it too much, we lose this information. In future research, other machine learning
solutions could be applied which are capable of handling this kind of data. For instance, instead of
translating based on words, translations could be done based on sections [HWZD17].
7.4 Limitations
The scope of this project is to find out if neural networks can be used for unit test generation. We
still left many questions and problems unanswered, of which some are listed in this chapter.
45
7.4.1 The used mutation testing tool
To calculate the mutation score of our baseline, we used an alternative version of PIT Mutation
Testing to calculate the mutation testing score. During our study, this tool achieved the most reliable
result to measure test suite effectiveness. However, new versions of the original tool were released and
could outperform this version. The mutation generation ability of the original tool should be checked
in future research and used when it performs better. Our research is not invalid when the original
tool performs better, because we did not perform comparisons on the baseline yet. Even if we did, it
would not have made a difference. The result should only be more accurate when comparing to real
faults.
46
7.5.3 The machine learning model is unaware of implementations
During training, the machine learning model learns the usage of many methods. When the model
is applied for making predictions on unseen data, it will see new method calls which it has not
encountered before. The model could know what the call does based on its naming and context.
However, it would be easier for the machine learning algorithm to have more detailed information
such as the implementation of the method. Thus, the information in our training examples is limited.
Adding more detail could make predictions easier.
47
Chapter 8
Related work
In our research, we translate source code (method body) to source code (unit test). We are unaware of
research that does a similar translation. However, in research, many projects apply machine learning
to translate from or to source code. An overview of recent projects can be found in Table 8.1. The
difference between these projects and our work is that they do not perform a translation to test code.
The work of Devlin et al. [DUSK17] is closely related because they address the issue of finding software
faults. However, compared to our work, they developed a static analysis tool instead of a test suite
generator. They try to find locations in the code that could have bugs while we generate code that
tests if the behavior of a piece of code is still the same then when the test was written.
48
Chapter 9
Conclusion
Incorporating machine learning in a test generator could lead to better test generation. A generator
like this cannot only be used to generate a complete test suite; it can also be used to aid software
engineers in writing tests. Nevertheless, more work has to be done to get to this point.
To our knowledge, this research is the first study towards test generation with the application of
machine learning. We were able to create a training set for this problem, and managed to train
neural networks with this dataset. We extracted more than 52,000 training examples from 1,106
software projects. When training machine learning models, we noticed that the quality of our models
improved when configurations are tweaked and when the training data was filtered. Our first results
are promising, and we see many opportunities to continue the development of the test generator. We
believe that our results could be further improved when a machine learning algorithm specific for code
translation is used, if the neural network is trained on how small parts of the code are tested instead
of whole methods, and when the neural network has more detailed implementation information. Our
results can be used as a baseline for future research.
We also experimented with machine learning approaches that improved the results of other studies.
We applied compression, SBT, and BPE on our training set. A correlation was found that indicates
that the time needed to train a model is reduced when compression is applied. An improved com-
pression algorithm or a dataset optimized for the compression technique could be used to create a
model that can make good predictions in less time, what will make research in this field more efficient.
We also improved the application of SBT in translations to code. We developed an algorithm that
can translate an SBT representation back to code (Appendix A). This tool could already be used for
experiments when the output length is limited. BPE was also applied, to let our model use additional
information on how parts of the training data are connected. In our experiment, it did not have a
positive effect. However, optimizing the number of connections could work for our datasets when only
connections that add relevant information are included.
All steps from finding suitable GitHub projects until the calculation of the parsable score are
automated in scripts. A copy can be found in Appendix A. With these scripts, all results of our
experiments can be reproduced. The first part of the scripts can be executed to retrieve training
examples or the provided backups with the used training examples can be restored. When it is
preferred to download all projects, then the overview of all used repositories should be used. We
included the used commits in the GitHub links. The second part is to configure the dataset generator
according to the parameters in our experiments. From this point, the rest of the scripts can be used
to execute the rest of the pipeline in order to calculate the parsable score.
To summarize, with this research we contribute an algorithm that can be used to generate training
examples (for method to test translations), the training sets used in our experiments, SBT to code
implementation, and a neural network configuration that can learn basic patterns between methods
and test code. Finally, we also contribute software that takes GitHub repositories as input and a
model that can be used to predict tests with machine learning as output.
49
Chapter 10
Future work
There are still some additional experiments which could add value to the presented work. We did not
manage to do experiments with SBT, and we did not investigate further if the percentage of common
subsequences has an impact on the results of the neural network. Additionally, we found some areas
that are promising for a machine learning based test generator that we did not address.
50
that can eliminate predictions when they are only partially processed. Predictions could be eliminated
when they are not compliant with the language’s grammar. This could have a positive impact since
models are applied to the validation set during training.
51
Bibliography
[AHF+ 17] M Moein Almasi, Hadi Hemmati, Gordon Fraser, Andrea Arcuri, and Jānis Benefelds.
An industrial evaluation of unit test generation: Finding real faults in a financial ap-
plication. In Proceedings of the 39th International Conference on Software Engineering:
Software Engineering in Practice Track, 2017.
[APS16] Miltiadis Allamanis, Hao Peng, and Charles Sutton. A convolutional attention network
for extreme summarization of source code. In International Conference on Machine
Learning, 2016.
[BCB14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation
by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL: http:
//arxiv.org/abs/1409.0473, arXiv:1409.0473.
[BCS+ 15] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua
Bengio. End-to-end attention-based large vocabulary speech recognition. CoRR,
abs/1508.04395, 2015. URL: http://arxiv.org/abs/1508.04395, arXiv:1508.04395.
[BdCHP10] Jose Bernardo Barros, Daniela da Cruz, Pedro Rangel Henriques, and Jorge Sousa
Pinto. Assertion-based slicing and slice graphs. In Proceedings of the 2010 8th
IEEE International Conference on Software Engineering and Formal Methods, SEFM
’10, pages 93–102, Washington, DC, USA, 2010. IEEE Computer Society. URL:
http://dx.doi.org/10.1109/SEFM.2010.18, doi:10.1109/SEFM.2010.18.
[Bel17] Tony Beltramelli. pix2code: Generating code from a graphical user interface screen-
shot. CoRR, abs/1705.07962, 2017. URL: http://arxiv.org/abs/1705.07962, arXiv:
1705.07962.
[CDE+ 08] Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. Klee: Unassisted and automatic
generation of high-coverage tests for complex systems programs. In OSDI, volume 8,
pages 209–224, 2008.
[CGCB14] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empir-
ical evaluation of gated recurrent neural networks on sequence modeling. CoRR,
abs/1412.3555, 2014. URL: http://arxiv.org/abs/1412.3555, arXiv:1412.3555.
[CVMG+ 14] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using
rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078,
2014.
[DUSK17] Jacob Devlin, Jonathan Uesato, Rishabh Singh, and Pushmeet Kohli. Semantic code
repair using neuro-symbolic transformation networks. CoRR, abs/1710.11054, 2017.
arXiv:1710.11054.
[FRCA17] Gordon Fraser, José Miguel Rojas, José Campos, and Andrea Arcuri. Evosuite at
the sbst 2017 tool competition. In Proceedings of the 10th International Workshop on
Search-Based Software Testing, 2017.
52
[FZ12] Gordon Fraser and Andreas Zeller. Mutation-driven generation of unit tests and oracles.
IEEE Transactions on Software Engineering, 38(2), 2012.
[GAG+ 17] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin.
Convolutional sequence to sequence learning. CoRR, abs/1705.03122, 2017. URL: http:
//arxiv.org/abs/1705.03122, arXiv:1705.03122.
[Git] Github. Rest api v3 search. https://developer.github.com/v3/search/. Accessed:
2018-06-04.
[HWLJ18] Xing Hu, Yuhan Wei, Ge Li, and Zhi Jin. Deep code comment generation. ICPC 2018,
2018.
[HWZD17] Po-Sen Huang, Chong Wang, Dengyong Zhou, and Li Deng. Neural phrase-based ma-
chine translation. CoRR, abs/1706.05565, 2017. URL: http://arxiv.org/abs/1706.
05565, arXiv:1706.05565.
[JJI+ 14] René Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and
Gordon Fraser. Are mutants a valid substitute for real faults in software testing? In
Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of
Software Engineering, 2014.
[Juna] Junit. Junit 4 download and install. https://github.com/junit-team/junit4/wiki/
Download-and-Install. Accessed: 2018-06-04.
[Junb] Junit. Junit 4 release notes. https://sourceforge.net/projects/junit/files/
junit/4.0/. Accessed: 2018-06-04.
[Junc] Junit. Junit 5 release notes. https://junit.org/junit5/docs/snapshot/
release-notes/. Accessed: 2018-06-04.
[KC17] Timotej Kapus and Cristian Cadar. Automatic testing of symbolic execution engines
via program generation and differential testing. In Proceedings of the 32nd IEEE/ACM
International Conference on Automated Software Engineering, 2017.
[KRV14] Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. Phrase-based statistical
translation of programming languages. In Proceedings of the 2014 ACM International
Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software,
2014.
[LGH+ 16] Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomás Kociský, Andrew
Senior, Fumin Wang, and Phil Blunsom. Latent predictor networks for code gen-
eration. CoRR, abs/1603.06744, 2016. URL: http://arxiv.org/abs/1603.06744,
arXiv:1603.06744.
[LKT09] Hui Liu and Hee Beng Kuan Tan. Covering code behavior on input validation in func-
tional testing. Inf. Softw. Technol., 51(2):546–553, February 2009. URL: http://dx.
doi.org/10.1016/j.infsof.2008.07.001, doi:10.1016/j.infsof.2008.07.001.
[Maa97] Wolfgang Maass. Networks of spiking neurons: the third generation of neural network
models. Neural networks, 10(9):1659–1671, 1997.
[MB89] Keith E Muller and Curtis N Barton. Approximate power for repeated-measures anova
lacking sphericity. Journal of the American Statistical Association, 84(406):549–555,
1989.
[ND12] Srinivas Nidhra and Jagruthi Dondeti. Black box and white box testing techniques-a
literature review. International Journal of Embedded Systems and Applications (IJESA),
2(2):29–50, 2012.
[Ost02a] Thomas Ostrand. Black-box testing. Encyclopedia of Software Engineering, 2002.
53
[Ost02b] Thomas Ostrand. White-box testing. Encyclopedia of Software Engineering, 2002.
[PM17] Annibale Panichella and Urko Rueda Molina. Java unit testing tool competition-fifth
round. In IEEE/ACM 10th International Workshop on Search-Based Software Testing
(SBST), 2017.
[PMm17] Van Beckhoven P, Oprescu A M, and Bruntink m. Assessing test suite effectiveness using
static metrics. 10th Seminar on Advanced Techniques and Tools for Software Evolution,
2017.
[Poi] Yolande Poirier. What are the most popular libraries java developers
use? based on github’s top projects. https://blogs.oracle.com/java/
top-java-libraries-on-github. Accessed: 2018-06-04.
[PV16] Terence Parr and Jurgen J. Vinju. Technical report: Towards a universal code formatter
through machine learning. CoRR, abs/1606.08866, 2016. arXiv:1606.08866.
[REP+ 11] Brian Robinson, Michael D Ernst, Jeff H Perkins, Vinay Augustine, and Nuo Li. Scaling
up automated test generation: Automatically generating maintainable regression unit
tests for programs. In Proceedings of the 26th IEEE/ACM International Conference on
Automated Software Engineering, 2011.
[SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare
words with subword units. CoRR, abs/1508.07909, 2015. URL: http://arxiv.org/
abs/1508.07909, arXiv:1508.07909.
[SSN12] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. Lstm neural networks for lan-
guage modeling. In Thirteenth Annual Conference of the International Speech Commu-
nication Association, 2012.
[SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Advances in neural information processing systems, 2014.
[VSP+ 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in
Neural Information Processing Systems, pages 6000–6010, 2017.
[YKYS17a] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Comparative study of
cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923, 2017.
[YKYS17b] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Comparative study of
CNN and RNN for natural language processing. CoRR, abs/1702.01923, 2017. URL:
http://arxiv.org/abs/1702.01923, arXiv:1702.01923.
[YN17] Pengcheng Yin and Graham Neubig. A syntactic neural model for general-purpose code
generation. CoRR, abs/1704.01696, 2017. URL: http://arxiv.org/abs/1704.01696,
arXiv:1704.01696.
[ZM15] Yucheng Zhang and Ali Mesbah. Assertions are strongly correlated with test suite
effectiveness. In Proceedings of the 10th Joint Meeting on Foundations of Software
Engineering, 2015.
[ZZLW17] Wenhao Zheng, Hong-Yu Zhou, Ming Li, and Jianxin Wu. Code attention: Translating
code to comments by exploiting domain features. CoRR, abs/1709.07642, 2017. arXiv:
1709.07642.
54
Appendix A
The source-code, training examples, and everything else used to perform our experiments can be
found on GitHub1 and Stack2 . An overview of the GitHub repository is given in Table A.1, and an
overview of the data on Stack is given in Table A.2.
Folder Description
Fetch projects Contains code to crawl and filter Java projects hosted on GitHub
Last version of the Java Includes compiled Java programs used in this research. The source of
linking applications these programs can be found in the folder ”Test extraction”. Programs
are included to fill a linking queue based on test reports, filter training
examples, convert code to SBT, convert code to BPE, tokenize code,
precaching to speed up the linking process, token to code converter,
linking programs, SBT to code converter, training data exporter, and a
tool to check if code is parsable
Linux script Contains scripts that can be used to automate all experiments in the
research. In the subfolder ”Compile and Test” a file pipeline.sh is in-
cluded which can be used to convert GitHub repository links to train-
ing examples. In trainingData.sh the output format can be specified
(SBT/BPE/tokenized)
Machine learning The script run.sh takes training sets as input and gives a machine learn-
ing model as output. The script predict.sh script can be used to make
predictions with a model
1 https://github.com/laurenceSaes/Unit-test-generation-using-machine-learning
2 https://lauw.stackstorage.com/s/SdX5310wWLUvEQs
55
Table A.2: Overview of the Stack directory
Folder Description
All files for linking server This folder contains a zip archive that can be extracted on an Ubuntu
18.04 server. It contains the pipeline.sh script with all its dependencies
in place
Compression loss models The compression loss experiment data
Experiment S1-S20 ran- Models S1 until S20 that are trained with the random value 1234
dom number 1234
Extra models for statistics Models for S1 until S20
S1-S20
Other data Database backups with all training examples, logs, and a flow diagram
of the research
Predictions best model Predictions (with input) of the best model
Raw data All raw data used to generate diagrams in this thesis
SBT folders All information for the SBT experiments
Seq2Seq server A pre-configured machine learning server. The folder can be extracted
on an Ubuntu 18.04 server, and the run.sh can be used to train models
and parsableTest.sh to generate a Parsable score
Slicing Early work on a slicing tool that can be used with linking
Suite validation Mutation testing used to test the test suite effectiveness of our baseline
Training Contains training examples
Unit test report generator From the scripts to download GitHub repositories to scripts to generate
test reports
56