0% found this document useful (0 votes)
8 views16 pages

Mining Software Repair Models

Mining software

Uploaded by

Tarik Loucif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views16 pages

Mining Software Repair Models

Mining software

Uploaded by

Tarik Loucif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Mining Software Repair Models for Reasoning on the

Search Space of Automated Program Fixing


Matias Martinez, Martin Monperrus

To cite this version:


Matias Martinez, Martin Monperrus. Mining Software Repair Models for Reasoning on the Search
Space of Automated Program Fixing. Empirical Software Engineering, 2015, 20 (1), pp.176-205.
�10.1007/s10664-013-9282-8�. �hal-00903808�

HAL Id: hal-00903808


https://inria.hal.science/hal-00903808
Submitted on 13 Nov 2013

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Mining Software Repair Models for Reasoning on
the Search Space of Automated Program Fixing
Matias Martinez Martin Monperrus
University of Lille & INRIA University of Lille & INRIA

Empirical Software Engineering, Springer, 2013 (accepted for publication on Sep. 11, 2013).

There is a key difference between a repair action and a


Abstract—This paper is about understanding the nature of bug repair: a repair action is a kind of repair, a repair is a concrete
fixing by analyzing thousands of bug fix transactions of software patch. In object-oriented terminology, a repair is an instance
repositories. It then places this learned knowledge in the context
of automated program repair. We give extensive empirical results of a repair action. For instance, “adding a method call” is a
on the nature of human bug fixes at a large scale and a fine repair action, “adding x.foo()” is a repair. A repair action
granularity with abstract syntax tree differencing. We set up is program- and domain-independent, it contains no domain-
mathematical reasoning on the search space of automated repair specific data such as variable names or literal values.
and the time to navigate through it. By applying our method on 14 First we present an approach to mine repair actions from
repositories of Java software and 89,993 versioning transactions,
we show that not all probabilistic repair models are equivalent. patches written by developers. We find traces of human-
based program fixing in software repositories (e.g. CVS,
I. I NTRODUCTION SVN or Git), where there are versioning transactions (a.k.a
Automated program fixing consists of generating source commits) that only fix bugs. We use those “fix transactions”
code in order to fix bugs in an automated manner [1], [2], [3], to mine AST-level repair actions such as adding a method call,
[4], [5]. The generated fix is often an incremental modification changing the condition of a “if”, deleting a catch block. Repair
(a “patch” or “diff”) over the software version exhibiting the actions are extracted with the abstract differencing algorithm
bug. The previous contributions in this new research field make of Fluri et al. [7]. This results in repair models that are much
different assumptions on what is required as input (e.g. good bigger (41 and 173 repair actions) compared to related work
test suites [2], pre- and post-conditions [3], policy models which considers at most a handful of repair actions.
[1]). The repair strategies also vary significantly. Examples Second, we propose to decorate the repair models with a
of radically different models include genetic algorithms [2] probability distribution. Our intuition is that not all repair ac-
and satisfiability models (SAT) [6]. tions are equal and certain repair actions are more likely to fix
In this paper, we take a step back and look at the problem bugs than others. We also take an empirical viewpoint to define
from an empirical perspective. What are real bug fixes made those probability distributions: we learn them from software
of? The kind of results we extensively discuss later are for repositories. We show that those probability distributions are
instance: in bug-fixes of open source software projects, the independent of the application domain.
most common source code change consists of inserting a Third, we demonstrate that our probabilistic repair models
method invocation. Can we reuse the knowledge for reasoning enable us to reason on the search space of automated program
on automated program repair? We propose a framework to do repair. The multinomial theorem [8, p.73] comes into play to
so, by reasoning on the kind of bug fixes. This framework analyze the time to navigate into the search space of automated
enables us to show that the granularity of the analysis of real repair from a theoretical viewpoint.
commits (which we call “repair models”) has a big impact To sum up, our contributions are:
on the navigation into the search space of program repair. • An extensive analysis of the content of software version-
We further show that the heuristics used to build probability ing transactions: our analysis is novel both with respect of
distributions on top the repair models also make a significant size (89,993 transactions of 14 open-source Java projects)
difference: not all repair actions are equals! and granularity (173 repair actions at the level of the
Let us now make precise what we mean with repair actions AST).
and repair models. A software repair action is a kind of • A probabilistic mathematical reasoning on automated
modification on source code that is made to fix bugs. We repair showing that depending on the viewpoint one may
can cite as examples: changing the initialization of a variable; quickly navigate – or not – into the search space of
adding a condition in an “if” statement; adding a method call, automated repair. Despite being theoretical, our results
etc. In this paper, we use the term “repair model” to refer to a highlight an important property of the deep structure of
set of repair actions. For instance, the repair model of Weimer this search space: the likely-correct repairs are highly
et al. [2] has three repair actions: deleting a statement, inserting concentrated in some parts of the search space, as stars
a statement taken from another part of the software, swapping are concentrated into galaxies in our universe.
two statements This article is a revised version of a technical report [9]. It
reads as follows. Section II describes how we map concrete descriptive statistics on those AST-based differences. Let us
versioning transactions to change actions. Section III discusses first discuss the dataset.
how to only select bug fix transactions. Section IV then
shows that those change actions are actually repair actions A. Dataset
under certain assumptions. Section V presents our theoretical CVS-Vintage is a dataset of 14 repositories of open-
analysis on the time to navigate in the search space of source Java software [18]. The inclusion criterion of CVS-
automated repair. Finally, we compare our results with the Vintage is that the repository mostly contains Java code
related work (in Section VII) and concludes. and has been used in previous published academic work
on mining software repositories and software evolution. This
II. D ESCRIBING V ERSIONING T RANSACTIONS WITH A dataset covers different domains: desktop applications, server
C HANGE M ODEL applications, libraries such as logging, compilation, etc. It
In this section, we describe the contents of versioning includes the repositories of the following projects: ArgoUML,
transactions of 14 repositories of Java software. Previous Columba, JBoss, JHotdraw, Log4j, org.eclipse.ui.workbench,
empirical studies on versioning transactions [10], [11], [12], Struts, Carol, Dnsjava, Jedit, Junit, org.eclipse.jdt.core, Scarab
[13], [14] focus on metadata (e.g., authorship, commit text) or and Tomcat. In all, the dataset contains 89,993 versioning
size metrics (number of changed files, number of hunks, etc.). transactions, 62,179 of them have at least one modified Java
On the contrary, we aim at describing versioning transactions file. Overtime, 259,264 Java files have been revised (which
in terms of contents: what kind of source code change they makes a mean number of 4.2 Java files modified per transac-
contain: addition of method calls; modification of conditional tion).
statements; etc. There is previous work on the evolution of B. Abstract Syntax Tree Differencing
source code (e.g. [15], [16], [17]). However, to our knowledge,
they are all at a coarser granularity compared to what we There are different propositions of AST differencing algo-
describe in this paper. rithms in the literature. Important ones include Raghavan et
Note that other terms exist for referring to versioning trans- al.’s Dex [19], Neamtiu et al’s AST matcher [20] and Fluri
actions: “commits”, “changesets”, “revisions”. Those terms et al’s ChangeDistiller [7]. For our empirical study on the
reflect the competition between versioning tools (e.g. Git contents of versioning transactions, we have selected the latter.
uses “changeset” while SVN “revision”) and the difference ChangeDistiller [7] is a fine-grain AST differencing tool for
between technical documentation and academic publications Java. It expresses fine granularity source code changes using
which often use “transaction”. In this paper, we equate those a taxonomy of 41 source changes types, such as “statement
terms and generally use the term “transaction”, as previous insertion” of “if conditional change”. ChangeDistiller handles
research does. changes that are specific to object-oriented elements such as
“field addition”. Fluri and colleagues have published an open-
Software versioning repositories (managed by version con-
source stable and reusable implementation of their algorithm
trol systems such as CVS, SVN or Git) store the source code
for analyzing AST changes of Java code.
changes made by developers during the software lifecycle.
ChangeDistiller produces a set of “source code changes”
Version control systems (VCS) enables developers to query
for each pair of Java files from versioning transactions. For
versioning transactions based on revision number, authorship,
a source code change, the main output of ChangeDistiller
etc. For a given transaction, VCS can produce a difference
is a “change type” (from the taxonomy aforementioned).
(“diff”) view that is a line-based difference view of source
However, for our analysis, we also consider two other pieces
code. For instance, let us consider the following diff:
of information. We reformulate the output of ChangeDistiller,
1 while(i < MAX_VALUE){
2 op.createPanel(i);
each AST source code change is represented as a 2-value
3 - i=i+1; tuple: scc = (ct, et) where ct is one of the 41 change types,
4 + i=i+2;
5 }
et (for entity type) refers to the source code entity related
to the change (for instance, a statement update may change
The difference shows one line replaced by another one. a method call or an assignment). Since ChangeDistiller is an
However, one could also observe the changes at the abstract AST differencer, formatting transactions (such as changing the
syntax tree (AST) level, rather than at the line level. In this indentation) produce no AST-level change at all. The short
case, the AST diff is an update of an assignment statement listing above would be represented as one single AST change
within a for loop. In this section, our research question is: what that is a statement update (ct) of an assignment (et).
are versioning transactions made of at the abstract syntax tree
level?. C. Change Models
To answer this question, we have followed the following All versioning transactions can be expressed within a
methodology. First, we have chosen an AST differencing al- “change model”. We define a change model as a set of “change
gorithm from the literature. Then, we have constituted a dataset actions”. For instance, the change model of standard Unix diff
of software repositories to run the AST differencing algorithm is composed of two change actions: line addition and line
on a large number of transactions. Finally, we have computed deletion. A change model represents a kind of feature space,

2
Change Action αi Prob. χi
and observations in that space can be valued. For instance, a Statement insert of method invocation 83,046 6.9%
standard Unix diff produces two integer values: the number of Statement insert of if statement 79,166 6.6%
Statement update of method invocation 76,023 6.4%
added lines and the number of deleted lines. ChangeDistiller Statement delete of method invocation 65,357 5.5%
enables us to define the following change models. Statement delete of if statement 59,336 5%
Statement insert of variable declaration statement 54,951 4.6%
CT (Change Type) is composed of 41 features, the 41 Statement insert of assignment 49,222 4.1%
change types of ChangeDistiller. For instance, one of this Additional functionality of method 49,192 4.1%
Statement delete of variable declaration statement 44,519 3.7%
feature is “Statement Insertion” (we may use the shortened Statement update of variable declaration statement 41,838 3.5%
name “Stmt_Insert”). CTET (Change Type Entity Type) is Statement delete of assignment 41,281 3.5%
Condition expression change of if statement 40,415 3.4%
made of all valid combinations of the Cartesian product Statement update of assignment 34,802 2.9%
between change types and entity types. CTET is a refinement Addition of attribute 29,328 2.5%
Removal of method 26,172 2.2%
of CT. Each repair action of CT is mapped to [1 . . . n] repair Statement insert of return statement 24,184 2%
actions of CTET. Hence the labels of the repair actions Statement parent change of method invocation 21,010 1.8%
Statement delete of return statement 20,880 1.7%
of CTET always contain the label of CT. There are 104 Insert of else statement 20,227 1.7%
entity types and 41 change types but many combinations Deletion of else statement 17,197 1.4%
Total 1,196,385
are impossible by construction, as a result CTET contains
Table I
173 features. For instance, since there is one entity type T HE ABUNDANCE OF AST- LEVEL CHANGES OF CHANGE MODEL CTET
representing assignments, one feature of CTET is “statement OVER 62,179 VERSIONING T RANSACTIONS . T HE PROBABILITY χi IS THE
insertion of an assignment”. RELATIVE FREQUENCY OVER ALL CHANGES ( E . G . 6.9% OF SOURCE CODE
CHANGES ARE INSERTIONS OF METHOD INVOCATION ).
In the rest of this paper, we express versioning transactions
within those two change models. There is no better change
model per se: they describe versioning transactions at different
granularity. We will see later that, depending on the perspec- composed of inserting a method invocation (6.9%), insert an
tive, both change models have pros and cons. “if” conditionals (6.6%), and insert a new variable (4.6%).
Since change model CTET is at a finer granularity, there are
D. Measures for Change Actions less observations: both αi and χi are lower. The probability
We define two measures for a change action i: αi is the distribution (χi ) over the change model is less sharp (smaller
absolute number of change action i in a dataset; χi is the values) since the feature space is bigger. High value of χi
probability of observing a change actionP i as given by its means that we have a change action that can frequently be
frequency over all changes (χi = αi / αi ). For instance, found in real data: those change actions have of a high
let us consider feature space CT and the change action “coverage” of data. CTET features describe modifications
“statement insertion” (StmtIns). If there is αStmtIns = 12 of software at a finer granularity. The differences between
source code changes related to statement insertion among those two change models illustrate the tension between a high
100, the probability of observing a statement insertion is coverage and the analysis granularity.
χStmtIns = 12%.
F. Project-independence of Change Models
E. Empirical Results An important question is whether the probability distribu-
We have run ChangeDistiller over the 62,179 Java transac- tion (composed of all χi ) of Table I is generalizable to Java
tions of our dataset, resulting in 1,196,385 AST-level changes software or not. That is, do developers evolve software in a
for both change models. For change model CT, which is similar manner over different projects? To answer this ques-
rather coarse-granularity, the three most common changes are tion, we have computed the metric values not for the whole
“statement insert” (28% of all changes), “statement delete” dataset, but per project. In other words, we have computed
(23% of all changes) and “statement update” (14% of all the frequency of change actions in 14 software repositories.
changes). Certain changes are rare, for instance, “addition We would like to see that the values do not vary between
of class derivability” (adding keyword final to the class projects, which would mean that the probability distributions
declaration) only appears 99 times (0.0008% of all changes). over change actions are project-independent. Since our dataset
The complete results are given in the companion technical covers many different domains, having high correlation values
report [21]. would be a strong point towards generalization.
Table I presents the top 20 change actions and the associated As correlation metric, we use Spearman’s ρ. We choose
measures for change model CTET. The comprehensive table Spearman’s ρ because it is non-parametric. In our case, what
for all 173 change actions is given in the companion tech- matters is to know whether the importance of change actions is
nical report [21]. In Table I, one sees that inserting method similar (for instance that “statement update” is more common
invocations as statement is the most common change, which than“condition expression change”). Contrary to parametric
makes sense for open-source object-oriented software that is correlation metric (e.g. Pearson), Spearman’s ρ only focuses
growing. on the ordering between change actions, which is what we are
Let us now compare the results over change models CT interested in.
and CTET. One can see that statement insertion is mostly We compute the Spearman correlation values between the

3
and 3. The lowest correlation value is 0.80 and it corresponds
40 to Spearman correlation values between projects Tomcat and
Carol. In this case, the maximum rank change is 23 (for change
35
action “Removing Method Overridability” — removing final
30 for methods). In total, between Tomcat and Carol, there are
25
six change actions for which the importance changes of at
# of project pairs

least 10 ranks. Those high values trigger the 0.80 Spearman


20
correlation. However, for common changes, it turns out that
15 their ranks do not change at all (e.g. for “Statement Insert”,
10
“Statement Update”, etc.).
We have also computed the correlation between projects
5
within change model CTET (see the companion technical
0
report [21]). They are all above 0.301, the critical value for
0.75 0.80 0.85 0.90 0.95 1.00
Spearman correlation value vectors of size 173 at α = 0.01 confidence level, showing
that in change model CTET, the change action importance
Figure 1. Histogram of the Spearman Correlation between Changes Action
is project-independent as well, in a statistically significant
Frequencies of Change Model CT Mined on Different Projects. There is no manner. Despite being high, we note that they are slightly
outlier, there are all higher than 0.75, meaning that the importance of change lower than for change model CT, this is due to the fact that
actions is project-independent.
Spearman’s ρ generally decreases with the vector size (as
shown by the statistical table).
probability distributions of all pairs of project of our datasets G. Recapitulation
(i.e. 14∗13
2 = 91 combinations). One correlation value takes To sum up, we provide the empirical importance of 173
as input two vectors representing the probability distributions source code change actions; we show that the importance
(of size 41 for change model CT and 173 for change model of change actions are project independent; we show that the
CTET). probability distribution of change actions is very unbalanced.
The critical value of Spearman’s ρ depends on size of the Our results are based on the analysis of 62,179 transactions.
vectors being compared and on the required confidence level. To our knowledge, those results have never been published
At confidence level α = 0.01, the critical value for change before, given this analysis granularity and the scale of the
model CT (41 features) is 0.364 and is 0.3011 for change empirical study.
model CTET (values from statistical tables, we used [22]). The threats to the validity of our results are of two kinds.
If the correlation is higher than the critical value, the null From the internal validity viewpoint, a bug somewhere in the
hypothesis (a random distribution) is rejected. implementation may invalidate our results. From the external
For instance, in change model CT, the Spearman correlation validity viewpoint, there is risk that our dataset of 14 projects
between Columba and ArgoUML is 0.94 which is much is not representative of Java software as a whole, even if they
higher than the critical value (0.364). This means that the are written by different persons from different organizations
correlation is statistically significant at α = 0.01 confidence in different application domains. Also, our results may not
level. The high value shows that those two projects were generalize to other programming languages.
evolved in a very similar manner. All values are given in the
companion technical report [21]. Figure 1 gives the distribution III. S LICING T RANSACTIONS TO F OCUS ON B UG F IXES
of Spearman correlation values for change model CT. 75% In Section II, we have defined and discussed two measures
of the pairs of projects have a Spearman correlation higher per change action i: αi and χi . For instance, χStmtInsert
than 0.852 . For all pairs of projects, in change model CT, gives the frequency of a statement insertion. Those measures
Spearman’s ρ is much higher that the critical value. This shows implicitly depend on a transaction bag to be computed. So
that the likelihood of observing a change action is globally far we have considered all versioning transactions of the
independent of the project used for computing it. repository. For defining a repair space, we need to apply those
To understand the meaning of those correlation values, let two measures on a transaction bag representative of software
us now analyze in detail the lowest and highest correlation repair. How should we slice transactions to focus on bug fixes?
values. The highest correlation value is 0.98 and it corresponds An intuitive method, that we will use as baseline, is to rely
to the project pair Eclipse-Workbench and Log4j. In this case, on the commit message (by slicing only those transactions that
33 out of 41 change actions have a rank difference between 0 contain a given word or expression related to bug fixing). Be-
1 Most statistical tables of Spearman’s ρ stop at N=60, however since the
fore going further, let us clarify the goal of the classification:
critical values decreases with N, if ρ > 0.301 the null hypothesis is still
the goal is to have a good approximation of the probability
rejected. distribution of change actions for software repair3 . Later is the
2 Spearman correlation is based on ranks, a value of 0.85 means either that
most change actions are ranked similarly or that a single change action has a 3 Note that our goal is not to have a good classification in terms of precision
really different rank. or recall.

4
Full Agreement (3/3) Majority (2/3)
paper, we will define a mathematical criterion to tell whether Transaction is a Bug Fix 74 21
one approximation is better than another. Transaction is not a Bug Fix 22 23
I don’t know 0 1
A. Slicing Based on the Commit Message
Table II
When committing source code changes, developers may T HE R ESULTS OF T HE M ANUAL I NSPECTION OF 144 T RANSACTIONS BY
write a comment/message explaining the changes they have T HREE R ATERS .
made. For instance when a transaction is related to a bug
fix, they may write a comment referencing the bug report or
describing the fix.
To identify transaction bags related to bug fix, previous work 3) Sampling Versioning Transactions: We use stratified
focused on the content of the commit text: whether it contains sampling to randomly select 1-SC versioning transactions
a bug identifier, or whether it contains some keywords such from the software history of 16 open source projects (mostly
as “fix” (see [23] for a discussion on those approaches). To from [18]). Recall that a “1-SC” versioning transaction only
identify bug fix patterns, Pan et al. [24] select transactions introduces one AST change. The stratification consists of
containing at least one occurrence of “bug”, “fix” or “patch”. picking 10 items (if 10 are found) per project. In total, the
We call this transaction bag BFP. We will compute αi and χi sample set contains 144 transactions sampled over 6,953 1-
based on this definition. SC transactions present in our dataset.
Such a transaction bag makes a strong assumption on the
4) Evaluation Procedure: The 144 evaluation items were
development process and the developer’s behavior: it assumes
evaluated by three people called the raters: the paper authors
that developers generally put syntactic features in commit texts
and a colleague, member of the faculty at the University of
enabling to recognize repair transactions, which is not really
Bordeaux. During the evaluation, each item (see III-C2) is
true in practice [23], [25], [26].
presented to a rater, one by one. The rater has to answer the
B. Slicing Based on the Change Size in Terms of Number of question Is a bug fix change?. The possible answers are a) Yes,
AST Changes the change is a bug fix, b) No, the change is not a bug fix and
We may also define fixing transaction bags based on their c) I don’t know. Optionally, the rater can write a comment to
“AST diffs”, i.e.; based on the type and numbers of change explain his decision.
actions that a versioning transaction contains. This transaction 5) Experiment Results:
bag is called N-SC (for N Abstract Syntactic Changes), e.g. a) Level of Agreement: The three raters fully agreed that
5-SC represents the bag of transactions containing five AST- 74 of 144 (51.8%) transactions from the sample transactions
level source code change. are bug fixes. If we consider the majority (at least 2/3 agree),
In particular, we assume that small transactions are very 95 of 144 transactions (66%) were considered as bug fix trans-
likely to only contain a bug fix and unlikely to contain a new actions. The complete rating data is given in the companion
feature. Repair actions may be those that appear atomically in technical report [21].
transactions (i.e. the transaction only contains one AST-level Table II presents the number of agreements. The column
source code change). “1-SC” (composed of all transactions of Full Agreement shows the number of transactions for which
one single AST change) is the transaction bag that embodies all raters agreed. For example, the three rates agreed that there
this assumption. Let us verify this assumption. is a bug fix in 74/144 transactions. The Majority column shows
C. Do Small Versioning Transactions Fix Bugs? the number of transactions for which two out of three raters
agree. To sum up, small transactions predominantly consists
1) Experiment: We set up a study to determine whether of bug fixes.
small transactions correspond to bug fixes changes. We define
Among the transactions with full agreement on the absence
small as those transactions that introduce only one AST
of bug fix changes, the most common case found was the
change.
addition of a method. This change indeed consists of the
2) Overview: The study consists in manual inspection and
addition of one single AST change (the addition of a “method”
evaluation of source code changes of versioning transactions.
node). Interestingly, in some cases, adding a method was
First, we randomly take a sample set of transactions from
indeed a bug fix, when polymorphism is used: the new method
our dataset (see II-A). Then, we create an “evaluation item”
fixes the bug by replacing the super implementation.
for each pair of files of the sample set (the file before and
after the revision). An evaluation item contains data to help b) Statistics: Let us assume that pi measures the degree
the raters to decide whether a transaction is a bug fix or of agreement for a single item (in our case in { 31 , 23 , 33 }. The
not: the syntactic line-based differencing between the revision overall agreement P̄ [27] is the average over pi . We have
pair of the transaction (it helps to visualize the changes), the P̄ = 0.77. Using the scale introduced by [28], this value means
AST change between them (type and location – e.g. insertion there is a Substantial overall agreement between the rates,
of method invocation at line 42) and the commit message close to an Almost perfect agreement.
associated with the transaction. The coefficient κ (Kappa) [27], [29] measures the confi-

5
dence in the agreement level by removing the chance factor4 . Furthermore, we have made the following observations from
The κ degree of agreement in our study is 0.517, a value the experiment results:
distant from the critical value (it is 0). The null hypothesis is First, the order of repair actions (i.e. their likelihood of con-
rejected, the observed agreement was not due to chance. tributing to bug repair) varies significantly depending on the
6) Conclusion: The manual inspection of 144 versioning transaction bag used for computing the probability distribution.
transaction shows that there is a relation between the one AST For instance: a statement insertion is #1 when we consider
change transactions and bug fixing. Consequently, we can use all transactions (column ALL), but only #4 when considering
the 1-SC transaction bag to estimate the probability of change transactions with a single AST change (column 1-SC). In this
actions for software repair. case, the probability of observing a statement insertion varies
from 29% to 12%.
IV. F ROM C HANGE M ODELS TO R EPAIR M ODELS Second, even when the orders obtained from two different
This section presents how we can transform a “change transaction bags resemble such as for ALL and 20-SC, the
model” into a “repair model” usable for automated software probability distribution still varies: for instance χStmt_U pd is
repair. As discussed in Section II, a change model describes 29% for transaction bag ALL, but jumps to 33% for transaction
all types of source code change that occur during software bag 20-SC.
evolution. On the contrary, we define a “repair action” as a Third, the probability distributions for transaction bags ALL
change action that often occurs for repairing software, i.e. and BFP are close: repair action has similar probability values.
often used for fixing bugs. As consequence, transaction bag BFP maybe is a random
By construction, a repair model is equal to a subset of a subset of ALL transactions. All those observations also hold
change model in terms of features. But more than the number for repair model CTET, the complete table is given in the
of features, our intuition is that the probability distribution companion technical report [21].
over the feature space would vary between change models and Those results are a first answer to our question: different
repair models. For instance, one might expect that changing definitions of “repair transactions” yield different probability
the initialization of a variable has a higher probability in a distribution over a repair model.
repair model. Hence, the difference between a change model
and a repair model is matter of perspective. Since we are C. Discussion
interested in automated program repair, we now concentrate We have shown that one can base repair models on different
on the “repair” perspective hence use the terms “repair model” methods to extract repair transaction bags. There are certain
and “repair action” in the rest of the paper. analytical arguments against or for those different repair space
topologies. For instance, selecting transactions based on the
A. Methodology
commit text makes a very strong assumption on the quality of
We have applied the same methodology as in II. We have software repository data, but ensures that the selected trans-
computed the probability distributions of repair model CT and actions contain at least one actual repair. Alternatively, small
CTET based on different definitions of fix transactions, i.e. transactions indicate that they focus on to a single concern,
we have computed αi and χi based on the transactions bags they are likely to be a repair. However, small transactions
discussed in III: ALL transactions, N-SC and BFP. For N-SC, may only see the tip of the fix iceberg (large transactions
we choose four values of N: 1-SC, 5-SC, 10-SC and 20-SC. may be bug fixing as well), resulting in a distorted probability
Transactions larger than 20-SC have almost the same topology distribution over the repair space. At the experimental level,
of changes as ALL, as we will show later (see section IV-C2). the threats to validity are the same as for Section II.
The main question we ask is whether those different defi-
nitions of “repair transactions” yield different topologies for 1-SC 5-SC 10-SC 20-SC BFP
repair models. ALL 0.68 0.95 0.97 0.98 0.99

B. Empirical Results Table IV


T HE S PEARMAN CORRELATION VALUES BETWEEN REPAIR ACTIONS OF
Table III presents the top 10 change types of repair model TRANSACTION BAG “ALL” AND THOSE FROM THE TRANSACTION BAGS
BUILT WITH 5 DIFFERENT HEURISTICS .
CT associated with their probability χi for different versioning
transaction bags. The complete table for all repair actions is
given in the companion technical report [21]. Overall, the
1) Correlation between Transaction Bags: To what extent
distribution of repair actions over real bug fix data is very
are the 6 transactions bags different? We have calculated
unbalanced, the probability of observing a single repair action
the Spearman correlation values between the probabilities
goes from more than 30% to 0.000x%. We observe the Pareto
over repairs actions between all pairs of distributions. In
effect: the top 10 repair actions account for more than 92%
particular, we would like to know whether the heuristics
of the cumulative probability distribution.
yield significantly different results compared to all transactions
4 Some degree of agreement is expected when the ratings are purely (transaction bag ALL). Table IV presents these correlation
random[27], [29]. values.

6
ALL BFP 1-SC 5-SC 10-SC 20-SC
Stmt_Insert-29% Stmt_Insert-32% Stmt_Upd-38% Stmt_Insert-28% Stmt_Insert-31% Stmt_Insert-33%
Stmt_Del-23% Stmt_Del-23% Add_Funct-14% Stmt_Upd-24% Stmt_Upd-19% Stmt_Del-16%
Stmt_Upd-15% Stmt_Upd-12% Cond_Change-13% Stmt_Del-11% Stmt_Del-14% Stmt_Upd-16%
Param_Change-6% Param_Change-7% Stmt_Insert-12% Add_Funct-10% Add_Funct-8% Param_Change-7%
Order_Change-5% Order_Change-6% Stmt_Del-6% Cond_Change-7% Param_Change-7% Add_Funct-7%
Add_Funct-4% Add_Funct-4% Rem_Funct-5% Param_Change-5% Cond_Change-6% Cond_Change-5%
Cond_Change-4% Cond_Change-3% Add_Obj_St-3% Add_Obj_St-3% Add_Obj_St-3% Add_Obj_St-3%
Add_Obj_St-2% Add_Obj_St-2% Order_Change-2% Rem_Funct-3% Rem_Funct-2% Order_Change-3%
Rem_Funct-2% Alt_Part_Insert-2% Rem_Obj_St-2% Order_Change-1% Order_Change-2% Rem_Funct-2%
Alt_Part_Insert-2% Rem_Funct-2% Inc_Access_Change-1% Rem_Obj_St-1% Alt_Part_Insert-1% Alt_Part_Insert-2%
Table III
T OP 10 C HANGE T YPES OF C HANGE M ODEL CT AND THEIR P ROBABILITY χi FOR D IFFERENT T RANSACTION BAGS . T HE DIFFERENT HEURISTICS USED
TO COMPUTE THE FIX TRANSACTIONS BAGS HAS A SIGNIFICANT IMPACT ON BOTH THE RANKING AND THE PROBABILITIES .

Stmt update of method invocation


For instance, the Spearman correlation value between ALL 0.16 Add funct of method
Condition change of If

and 1-SC is 0.68. This value shows, as we have noted before, 0.14
Stmt update of variable declaration
Stmt Insert of method invocation
Stmt update of assignment
that there is not a strong correlation between the order of their Stmt update of return
Remove funct of method
0.12
repair actions of both transaction bags. In other words, heuris- Stmt delete of method invocation

AST change probability


Add Object State of attribute
Stmt Insert of assignment
tic 1-SC indeed focuses on a specific kind of transactions. 0.10
Remove obj State of attribute

0.08

On the contrary, the value between ALL and BFP is 0.99. 0.06

This means the order of the frequency of repair actions are


0.04
almost identical. Moreover, Table IV shows the correlation
values between N-SC (N = 1, 5, 10 and 20) and ALL tend 0.02

to 1 (i.e. perfect alignment) when N grows. This validates 0.00


1 2 3 4 5 6 7 8 9 10 ...
ALL
the intuition that the size of transactions (in number of AST Transaction size (In AST changes)
changes) is a good predictor to focus on transactions that
are different in nature from the normal software evolution. Figure 2. Probabilities of the 12 most frequent AST changes for 11 different
transaction bags: 10 that include transactions with i AST changes, with i =
Crossing this result with the results of our empirical study 1...10, and the ALL transaction bag.
of 144 -SC transactions, there is some evidence that by
concentrating on small transactions, we probably have a good
approximation of repair transactions. V. AUTOMATED A NALYSIS OF THE T IME TO NAVIGATE
INTO THE S EARCH S PACE OF AUTOMATED P ROGRAM
R EPAIR
2) Skewness of Probability Distributions: Figure 2 shows
the probability for the most frequent repair actions of repair This section discusses the nature of the search space size of
model CTET according to the transaction size (in number automated program repair. We show that the two repair models
of AST changes). For instance, the probability of updating defined in IV allow mathematical reasoning. We present a way
a method invocation decreases from 15% in 1-SC transactions of comparing repair models and their probability distribution
to 7% in all transactions. In particular, we observe that: a) for based on data from software repositories.
transaction with 1 AST change, the change probabilities are
more unbalanced (i.e. less uniform than for all transactions). A. Decomposing The Repair Search Space
There are 5 changes that are much more frequent than the
rest. b) for transactions with more than 10 AST changes, the The search space of automated program repair consists of
probabilities of top changes are less dispersed and all smaller all explorable bug fixes for a given program and a given bug
than 0.9% c) the probabilities of those 5 most frequent changes (whether compilable, executable or correct). If one bounds
decrease when the transaction size grows. This is a further the size of the repair (e.g. all patched of at most 40 lines),
piece of evidence that heuristics N-SC provide a focus on the search space size is finite. A naive search space is huge,
transactions that are of specific nature, different from the bulk because even in a bounded size scenario, there are a myriad
of software evolution. of elements to be added, removed or modified: statements,
variables, operators, literals.
A key point of automated program repair research consists
3) Conclusion: Those results on repair actions are espe- of decreasing the time to navigate the repair search space.
cially important for automated software repair: we think it There are many ways to decrease this time. For instance,
would be fruitful to devise automated repair approaches that fault localization enables the search to first focus on places
“imitate” how human developers fix programs. To us, using where fixes are likely to be successful. This one and other
the probabilistic repair models as described in this section is components of a repair process may participate in an efficient
a first step in that direction. navigation. One of them is the “shaping” of fixes.

7
Informally, the shape of a bug fix is a kind of patch. that is needed to find a given repair shape R (demonstration
For instance, the repair shape of adding an “if” throwing an given in the companion technical report [21]):
exception for signaling an incorrect input consists of inserting
an if and inserting a throw. The concept of “repair shape” X
k
N = k such that p(1 − p)i−1 ≥ 0.5 (1)
is equivalent to what Wei et al. [3] call a “fix schema”, and
i=1
Weimer et al [2] a “mutation operator”.
In this paper, we define a “repair shape” as an unordered n!
with p = × Πr∈R PP (r)
tuple of repair actions (from a set of repair actions called R)5 . Πj (ej !)
In the if/throw example aforementioned, in repair space CTET, where ej is the number of occurrences of rj inside R
the repair shape of this bug fix consists of two repair actions:
statement insertion of “if” and statement insertion of “throw”. For instance, the repair of revision 1.2 of Eclipse’s
The shaping space consists of all possible combinations of CheckedTreeSelectionDialog7 consists of two inserted state-
repair actions. ments. Equation 1 tells us that in repair model CT, we would
The instantiation of a repair shape is what we call fix need in average 12 attempts to find the correct repair shape
synthesis. The complexity of the synthesis depends on the for this real bug.
repair actions of the shaping space. For instance, the repair Having only a repair shape is far from having a real fix.
actions of Weimer et al. [2] (insertion, deletion, replace) have However, the concept of repair shape associated with the
an “easy” and bounded synthesis space (random picking in the mathematical formula analyzing the time to navigate the repair
code base). space is key to compare ways to build a probability distribution
over repair models.
To sum up, we consider that the repair search space can
be viewed as the combination of the fault localization space C. Comparing Probability Distributions Over Repair Actions
(where the repair is likely to be successful), the shaping space From Versioning History
(which kind of repair may be applied) and the synthesis space We have seen in Section V-B that the time for finding
(assigning concrete statements and values to the chosen repair correct repair shapes depends on a probability distribution over
actions). The search space can then be loosely defined as the repair actions. The probability distribution P is crucial for
Cartesian product of those spaces and its size then reads: minimizing the search space traversal: a good distribution P
results in concentrating on likely repairs first, i.e. the repair
|FAULT L OCALIZATION| × |S HAPE| × |S YNTHESIS|
space is traversed in a guided way, by first exploring the parts
In this paper, we concentrate on the shaping part of the of the space that are likely to be more fruitful. This poses
space. If one can find efficient strategies to navigate through two important questions: first, how to set up a probability
this shaping space, this would contribute to efficiently navi- distribution over repair actions; second, how to compare the
gating through the repair search space as a whole, thanks to efficiency of different probability distributions to find good
the combination. repair shapes.
To compute a probability distribution over repair actions,
we propose to learn them from software repositories. For
B. Mathematical Analysis Over Repair Models
instance, if many bug fixes are made of inserted method
To analyze the shaping space, we now present a mathemati- calls, the probability of applying such a repair action should
cal analysis of our probabilistic repair models. So far, we have be high. Despite our single method (learning the probability
two repair models CT and CTET (see IV) and different ways distributions from software repositories), we have shown in IV
to parametrize them. that there is no single way to compute them (they depend on
According to our probabilistic repair model, a good naviga- different heuristics). To compare different distributions against
tion strategy consists on concentrating on likely repairs first: each other, we set up the following process.
the repair shape is more likely to be composed of frequent One first selects bug repair transactions in the versioning
repair actions. That is a repair shape of size n is predicted by history. Then, for each bug repair transaction, one extracts its
drawing n repair actions according to the probability distribu- repair shape (as a set of repair actions of a repair model). Then
tion over the repair model. Under the pessimistic assumption one computes the average time that a maximum likelihood
that repair actions are independent6 , our repair model makes approach would need to find this repair shape using equation 1.
it possible to know the exact median number of attempts N Let us assume two probability distributions P1 and P2
over a repair model and four fixes (F1 . . . F4 ) consisting
5 Since a bug fix may contain several instances of the same repair actions of two repair actions and observed in a repository. Let us
(e.g. several statement insertions), the repair shape may contain several times assume that the time (in number of attempts) to find the
the same repair action.
6 Equation (1) holds if and only if we consider them as independent. If exact shape of F1 . . . F4 according to P1 is (5, 26, 9, 12) and
they are not, it means that we under-estimate the deep structure of the repair according to P2 (25, 137, 31, 45). In this case, it’s clear that the
space, hence we over-approximate the time to navigate in the space to find the
correct shape. In other words, even if the repair actions are not independent 7 “Fix for 19346 integrating changes from Sebastian Davids” http://goo.gl/
(which is likely for some of them) our conclusions are sound. d4OSi

8
Input: C ⊲ A bag of transactions
Output: The median number of attempts to find good repair shapes
begin
Ω ← {} ⊲ Result set
T, E ← split(C) ⊲ Cross-validation: split C into Training and Evaluation data
M ← train_model(T ) ⊲ Train a repair model (e.g. compute a probability distribution over repair actions)
for s ∈ E ⊲ For all repairs observed in the repository
do
n ← compute_repairability(s, M ) ⊲ How long to find this repair according to the repair model
Ω←R∪n ⊲ Store the “repairability” value of s
return median(Ω) ⊲ Returning the median number of attempts to find the repair shapes
Figure 3. An Algorithm to Compare Fix Shaping Strategies. There may be different flavors of functions split, f and computeRepairability.

Repair /Repair Size 1 2 3 4 5 6 7 8


ArgoUML 6 (996) 13 (638) 86 (386) 267 (362) 1394 (254) 5977 (234) 16748 (197) 73430 (166)
Carol 7 (30) 13 (15) 121 (10) 466 (10) 494 (7) 24117 (13) 14019 (6) 30631 (9)
Columba 3 (382) 13 (255) 68 (144) 552 (146) 940 (113) 2111 (108) 10908 (73) 64606 (94)
Dnsjava 6 (165) 13 (139) 101 (71) 218 (82) 1553 (54) 5063 (50) 16363 (33) ∞(44)
jEdit 3 (115) 13 (84) 58 (53) 251 (48) 2906 (32) 3189 (30) 5648 (29) 23395 (32)
jBoss 6 (514) 15 (353) 88 (208) 272 (189) 1057 (147) 6034 (150) 13148 (86) 38485 (113)
jHotdraw6 7 (21) 13 (21) 159 (9) 187 (10) 1779 (10) 611 (3) ∞(5) 56391 (2)
jUnit 3 (40) 42 (39) 596 (18) ∞(11) 49345 (7) ∞(11) 31634 (9) ∞(6)
Log4j 6 (223) 15 (134) 146 (68) 665 (70) 6459 (64) 16879 (42) 55582 (41) ∞(48)
org.eclipse.jdt.core 6 (1606) 26 (1025) 93 (657) 291 (631) 1704 (392) 4639 (416) 18344 (314) 74863 (309)
org.eclipse.ui.workbench 3 (1184) 13 (783) 74 (414) 311 (464) 1023 (326) 6035 (305) 22864 (215) 77532 (192)
Scarab 6 (653) 16 (346) 113 (202) 420 (159) 764 (113) 3914 (137) 13104 (89) 59232 (77)
Struts 3 (221) 17 (133) 100 (86) 222 (103) 675 (61) 4785 (77) 16796 (39) 95588 (34)
Tomcat 3 (281) 13 (167) 135 (111) 431 (120) 1068 (84) 3497 (87) 7407 (61) 34240 (51)

Table V
T HE MEDIAN NUMBER OF ATTEMPTS ( IN BOLD ) REQUIRED TO FIND THE CORRECT REPAIR SHAPE OF FIX TRANSACTIONS . T HE VALUES IN BRACKETS
INDICATE THE NUMBER OF FIX TRANSACTIONS TESTED PER PROJECT AND PER TRANSACTION SIZE FOR REPAIR MODEL CT. T HE REPAIR MODEL CT IS
MADE FROM THE DISTRIBUTION PROBABILITY OF CHANGES INCLUDED IN 5-SC TRANSACTION BAGS . F OR SMALL TRANSACTIONS , FINDING THE
CORRECT REPAIR SHAPE IN THE SEARCH SPACE IS DONE IN LESS THAN 100 ATTEMPTS .

probability distribution P1 enables us to find the correct repair compute the probability distributions). We repeat the process
shapes faster (the shaping time for P1 are lower). Beyond this 14 times, by testing each of the 14 projects separately. In
example, by applying the same process over real bug repairs other words, we try to predict real repair shapes found in one
found in a software repository, our process enables us to select repository from data learned on other software projects.
the best probability distributions for a given a repair model. Figure 3 sums up this algorithm to compare fix shaping
Since equation 1 is parametrized by a number of repair ac- strategies. From a bag of transactions C, function split creates
tions, we instantiate this process for all bug repair transactions a set of testing transactions and a set of evaluation transactions.
of a certain size (in terms of AST changes). This means that Then, one trains a repair model (with function trainM odel),
our process determines the best probability distribution for a for repair models CT and CTET it means computing a proba-
given bug fix shape size. bility distribution on a specific bag of transactions. Finally, for
each repair of the testing data, one computes its “repairability”
D. Cross-Validation according to the repair model (with Equation 1). The algorithm
We compute different probability distributions Px from returns the median repairability, i.e. the median number of
transaction bags found in repositories. We evaluate the time to attempts required to repair the test data.
find the shape of real fixes that are also found in repositories,
which may bias the results. To overcome this problem, we use E. Empirical Results
cross-validation: we always use different sets of transactions We run our fix shaping process on our dataset of 14
to estimate P and to calculate the average number of attempts repositories of Java software considering two repair models:
required to find a correct repair shape. Using cross-validation CT and CTET (see Section II-C). We remind that CT consists
reduces the risk of overfitting. of 41 repair actions and CTET of 173 repair actions. For both
Since we have a dataset of 14 independent software repos- repair models, we have tested the different heuristics of IV-A
itories, we use this dataset structure for cross-validation. to compute the median repair time: all transactions (ALL); one
We take one repository for extracting repair shapes and the AST change (1-SC); 5 AST changes (5-SC); 10 AST changes
remaining 13 projects to calibrate the repair model (i.e. to (10-SC); 20 AST changes (20-SC); transactions with commit

9
500
500
20000 12000020000 EQ EQ
EQ
1-SC 1-SC
1-SC
5-SC 5-SC
400

400
5-SC 10-SC10-SC

Median # repair attempts


10-SC 20-SC20-SC
Median # repair attempts

300
100000
300 20-SC 15000 BFP BFP
ALL ALL

Median # repair attempts


BFP 200
15000
ALL
200
80000 100

10000
100 0
1.0 2.0 3.0 4.0

10000 60000
0
1.0 2.0 3.0 4.0

5000

40000

5000
0
20000 0 1 2 3 4 5 6 7 8 9

Repair size (In abstract syntactic changes)


0 0
0 1 2 3 4 5 6 7 8 9 1 2 3 4

Repair size (in # AST changes) Repair size (in # AST changes)

Figure 4. The repairability of small transactions in repair model CT. Certain Figure 5. The repairability of small transactions in repair space CTET. There
probability distributions yield a median repair time that is much lower than is no way to find the repair shapes of transactions larger than 4 AST code
others. changes.

text containing “bug”, “fix”, “patch” (BFP); a baseline of a us confidence that one could apply our approach to any new
uniform distribution over the repair model (EQP for equally- project using the probability distributions mined in our dataset.
distributed probability). Furthermore, finding the correct repair shapes of larger
We extracted all bug fix transactions with less than 8 transactions (up to 8 AST changes) has an order of magnitude
AST changes from our dataset. For instance, the versioning of 104 and not more. Theoretically, for a given fix shape
repository of DNSJava contains 165 transactions of 1 repair of n AST changes, the size of the repair model is the
action, 139 transactions of size 2, 71 transactions of size 3, number of repair actions of the model at the power of n
etc. The biggest number of available repair tests are in jdt.core (e.g. |CT |n ). For CT and n = 4, this results in a space of
(1,605 fixes consist of one AST change), while Jhotdraw has 414 = 2,825,761 possible shapes (approx 106 ). In practice,
only 2 transactions of 8 AST changes. We then computed overall all projects, for small shapes (i.e. less or equal than 3
the median number of attempts to find the correct shape changes), a well-defined probability distribution can guide to
of those 23,048 fix transactions. Since this number highly the correct shape in a median time lower than 200 attempts.
depends on the probability distributions Px , we computed the This again show that the probability distribution over the repair
median repair time for all combinations of fix size transactions, model is so unbalanced that the likelihood of possible shapes
projects, and heuristics discussed above (8 × 14 × 6). is concentrated on less than 104 shapes (i.e. that the probability
Table V presents the results of this evaluation for repair density over |CT |n is really sparse).
space CT and transaction bag 5-SC. For each project, the Now, what is the best heuristic, with respect to shaping, to
bold values give the median repairability in terms of number train our probabilistic repair models? For each repair shape
of attempts required to find the correct repair shape with a size of Table V and heuristic, we computed the median
maximum likelihood approach. Then, the bracketed values repairability over all projects of the dataset (a median of
give the number of transactions per transaction size (size in median number of attempts). We also compute the median
number of AST changes) and per project. For instance, over repairability for a baseline of a uniform distribution (EQP)
996 fix transactions of size 1 in the ArgoUML repository, over the repair model (i.e. ∀i, P (ri ) = 1/|CT |)). Figure 4
it takes an average of 6 attempts to find the correct repair presents this data for repair model CT. It shows the median
shape. On the contrary, for the 51 transactions of size 8 in the number of attempts required to identify correct repair shapes as
Tomcat repository, it takes an average of 34,240 attempts to Y-axis. The X-axis is the number of repair actions in the repair
find the correct repair shape. Those results are encouraging: test (the size). Each line represents probability estimation
for small transactions, it takes a handful of attempts to find heuristics.
the correct repair shape. The probability distribution over the Figure 4 gives us important pieces of information. First, the
repair model seems to drive the search efficiently. The other heuristics yield different repair time. For instance, the repair
heuristics yield similar results – the complete results (6 tables time for heuristic 1-SC is generally higher than for 20-SC.
– one per heuristic) are given in [21]. Overall, there is a clear order between the repairability time:
About cross-validation, one can see that the performance for transactions with less than 5 repair actions heuristic 5-SC
over the 14 runs (one per project) is similar (all columns gives the best results, while for bigger transactions 20-SC is
of Table V contain numbers that are of similar order of the best. Interestingly, certain heuristics are inappropriate for
magnitude). Given our cross-validation procedure, this means maximum-likelihood shaping of real bug fixes: the resulting
that for all projects, we are able to predict the correct shapes distributions of probability results in a repair time that ex-
using only knowledge mined in the other projects. This gives plodes even for small shape (this is the case for a uniform

10
distribution EQP even for shape of size 3). Also, all median Finally, we think that our results empirically explore some
repair times tend toward infinity for shape of size larger than of the foundations of “repairing”: there is a difference between
9. Finally, although 1-SC is not good over many shape size, prescribing aspirin (it has a high likelihood to contribute to
we note that that for small shape of size 1 is better. This is healing, but only partially) and prescribing a specific medicine
explained by the empirical setup (where we also decompose (one can try many medicines before finding the perfect one).
transactions by shape size).
1) On The Best Heuristics for Computing Probability Dis- VI. ACTIONABLE G UIDELINES FOR AUTOMATED
tributions over Repair Actions: To sum up, for small repair S OFTWARE R EPAIR
shapes heuristic 1-SC is the best with respect to probabilistic Our results blend empirical findings with theoretical in-
repair shaping, but it is not efficient for shapes of size greater sights. How can they be used within a approach for automated
than two AST-level changes. Heuristics 5-SC and 20-SC are software repair? This section presents actionable guidelines
the best for changes of size greater than 2. An important point arising from our results. We apply those guidelines in a case
is that some probability distributions (in particular built from study that consists of reasoning on a simplified version of
heuristics EQP and 1-SC) are really suboptimal for quickly GenProg within our probabilistic framework.
navigating into the search space.
Do those findings hold for repair model CTET, which has A. Consider Using a Probability Distribution over Repair
a finer granularity? Actions
2) On The Difference between Repair Models CT and Automated software repair embed a set of repair actions,
CTET: We have also run the whole evaluation with the repair either explicitly or implicitly. On two different repair models,
model CTET (see II-C). The empirical results are given in the we have shown that the importance of each repair action
companion technical report [21](in the same form as Table V). greatly varies. Furthermore, our mathematical analysis has
Figure 5 is the sibling of figure 4 for repair model CTET. proved that considering a uniform distribution over repair
They look rather different. The main striking point is that with actions is extremely suboptimal.
repair model CTET, we are able to find the correct repair shape Hence, from the viewpoint of the time to fix a bug, we rec-
for fixes that are no larger than 4 AST changes. After that, the ommend to set up a probability distribution over the considered
arithmetic of very low probabilities results in virtually infinite repair actions. This probability distribution can be learned on
time to find the correct repair shape. On the contrary, in the past data as we do in this paper or simply tuned with an
repair model CT, even for fixes of 7 changes, one could find incremental evaluation process. For instance, Le Goues et al.
the correct shape in a finite number of attempts. Finally, in this [30] have done similar probabilistic tuning over their three
repair model the average time to find a correct repair shape is repair actions. Overall, using a probability distribution over
several times larger than in CT (in CT, the shape of fixes of repair actions could significantly fasten the repair process.
size 3 can be find in approx. 200 attempts, in CTET, it’s more
around 6,000). B. Be Aware of the Interplay between Shaping and Synthesis
For a given repair shape, the synthesis consists of finding We have shown that having more precise shapes has a real
concrete instances of repair actions. For instance, if the pre- impact on shaping time. In repair model CT, for fix shapes of
dicted repair action in CTET consists of inserting a method size 3, the logical shaping time is approximately 150 attempts.
call, it remains to predict the target object, the method and its In repair model CTET, for shapes of the same size, the average
parameters. We can assume that the more precise the repair logical time jumps around 4,000, which represents more than
action, the smaller the “synthesis space”. For instance, in a ten-fold increase. Our work quantitatively highlights the
CTET, the synthesis space is smaller compared to CT, because impact of consider more precise repair actions. By being aware
it only composed of enriched versions of basic repair actions of the interplay between shaping and synthesis, the research
of repair model CT (for instance inserting an “if” instead of community will be able to create a disciplined catalog of
inserting a statement). repair actions and to identify where the biggest synthesis
Our results illustrate the tension between the richness of the challenges lie.
repair model and the ease of fixing bugs automatically. When
we consider CT, we find likely repair shapes quickly (less C. Analyze the Repairability depending on The Fix Size
than 5,000 attempts), even for large repair, but to the price We have shown that certain repair shapes are impossible to
of a larger synthesis space. In other words, there is a balance find because of their size. In repair model CT, the shapes of
between finding correct repair actions and finding concrete more than 10 repair actions are not found in a finite time. In
repair actions. When the repair actions are more abstract, it repair model CTET, the repair shapes of more than 5 actions
results in a larger synthesis space, when repair actions are more are not found either. Given that a repair shape is an abstraction
concrete, it hampers the likelihood of being able to concentrate over a concrete bug fix, if one can not find the abstraction,
on likely repair shapes first. We conjecture that the profile there is no chance to find the concrete bug fix.
based on CT is better because of the following two points: Our analysis for identifying this limit is agnostic of the re-
it enables us to find bigger correct repair shapes (good) in a pair actions. Hence one can use our methodology and equation
smaller amount of time (good). to analyze the size of the “findable” fixes. Our probabilistic

11
pinsert
1 // insert 1 repair distribution as: pinsert (asti , placek ) = nplace ∗nast ,
2 if (a == 0) { // ast 1 pdelete pinsert
3 // insert 2 pdelete (astj ) = nast , pswap (asti , astj ) = (nast )2 .
4 System.out.println(b); // ast 2 With a uniform distribution pinsert = pdelete = pswap =
5 // insert 3
6 } 1/3, formula 1 yields that the logical time to fix this particular
7 // insert 4 bug (insertion of node #8 at place #3) is 219 attempts (not
8 while (b != 0) { // infinite loop // ast 3
9 // insert 5 that it is not anymore a shaping time, but the real number
10 if (a > b) { // ast 4 of required runs). However, we observed over real bug fix
11 // insert 6
12 a = a - b; // ast 5 that pinsert > pdelete (see Table III). What if we distort the
13 // insert 7 uniform distribution over the repair model to favor insertion?
14 } else {
15 // insert 8 The following table gives the results for arbitrary distributions
16 b = b - a; // ast 6 spanning different kinds of distribution:
17 // insert 9
18 }
19 // insert 10 pinsert pdelete pswap Logical time
20 } .33 .33 .33 219
21 // insert 11
22 System.out.println(a); // ast 7 .39 .28 .33 185
23 // insert 12 .45 .22 .33 160
24 return; // ast 8
25 // insert 13 .40 .40 .20 180
26 } .50 .30 .20 144
Listing 1. The infinite loop bug of Weimer et al’s bug [2]. Code insertion
can be made on 13 places, 8 AST subtrees can be deleted or copied.
.60 .20 .20 120
This table shows that as soon as we favor insertion over
deletion of code, the logical time to find the repair do actually
framework enables one to understand the theoretical limits of decrease.
certain repair processes. Interestingly, the same kind of reasoning applies to fault
Let us now apply those three guidelines on a small case localization. Let’s assume that a fault localizer filters out half
study. of the possible places where to modify code (i.e. nplace = 7).
Under the uniform distribution and the space concrete repair
D. Case Study: Reasoning on GenProg within our Probabilis- space, the logical time to find the fix decreases from 219 to
tic Framework 118 runs.
We now aim at showing than our model also enables to c) Repairability and Fix Size: We consider the same
reason on Weimer et al’s [2] example program. This program, model but on larger programs with fault localization, for
shown in Listing 1, implements Euclid’s greatest common instance 100 AST nodes and 20 potential places for changes.
divisor algorithm, but runs in an infinite loop if a = 0 and Let us assume that the concrete fix consists of inserting
b > 0. The fix consists of adding a “return” statement on line node #33 at place #13. Under a uniform distribution, the
6. corresponding repair time according to formula 1 is ≥ 20,000
a) Probability Distribution: In Weimer et al’s repair runs. Let us assume that the concrete fix consists of two repair
approach, the repair model consists of three repair actions: actions: inserting node #33 at place #13 and deleting node #12.
inserting statements, deleting statements, and swapping state- Under a uniform distribution, the repair time becomes 636,000
ments8 . By statements, they mean AST subtrees. With a runs, a 30-fold increase.
uniform probability distribution, the logical time to find the Obviously, for sake of static typing and runtime semantics,
correct shape is 4 (from Equation 1). If one favors insertion the nodes can not be inserted anywhere, resulting in lower
over deletion and swap, for instance by setting pinsert=0.6 , the number of runs. However, we think that more than the logical
median logical time to find the correct repair action becomes time, what matters is the order of magnitude of the difference
2 which is twice faster. Between 2 and 4, it seems negligible, between the two scenarios. Our results indicate that it is
but for larger repair models, the difference might be counted very hard to find concrete fixes that combine different repair
in days, as we show now. actions.
b) Shaping and Synthesis: In the GCD program, there are Let us now be rather speculative. Those simulation results
nplace = 13 places where nast = 8 AST statements can be contribute to the debate on whether past results on evolution-
inserted. In this case, the size synthesis space can be formally ary repair are either evolutionary or guided random search
approximate: the number of possible insertion is nplace ∗ nast ; [31]. According to our simulation results, it seems that the
the number of possible deletion is nast ; the number of possible evolutionary part (combining different repair actions) is indeed
swap is (nast )2 . extremely challenging. On the other hand, our simulation does
This enables us to apply our probabilistic reasoning at not involve fitness functions, it is only guided random search,
the level of concrete fix as follows. We define the concrete what we would call “Monte Carlo” repair. A good fitness
8 In more recent versions of GenProg, swapping has been replaced by function might counter-balance the combinatorial explosion of
“replacing”. repair actions.

12
VII. R ELATED W ORK code. It also remains at granularity that is coarser compared
to our analysis. Fluri et al. [7] gives some frequency numbers
d) Empirical Studies of Versioning Transactions: Pu- of their change types in order to validate the accuracy and
rushothaman and Perry [14] studied small commits (in terms the runtime performance of their distilling algorithm. Those
of number of lines of code) of proprietary software at Lucent numbers were not — and not meant to be — representative of
Technology. They showed the impact of small commits with the overall abundance of change types. Giger et al. [37] discuss
respect to introducing new bugs, and whether they are oriented the relations between 7 categories of change types and not the
toward corrective, perfective or adaptive maintenance. German detailed change actions as we do.
[11] asked different research questions on what he calls f) Automated Software Repair: We have already men-
“modification requests” (small improvements or bug fix), in tioned many pieces of work on automated software repair (incl.
particular with respect to authorship and change coupling [1], [2], [3], [4], [5], [38]). We have discussed in details the
(files that are often changed together). Alali and colleagues relationship of our work with GenProg. Let us now compare
[13] discussed the relations between different size metrics for with the other close papers.
commits (# of files, LOC and # of hunks), along the same line Wei et al. [3] presented AutoFix-E, an automated repair
as Hattori and Lanza [12] who also consider the relationship tool which works with contracts. In our perspective, AutoFix-
between commit keywords and engineering activities. Finally, E is based on two repair actions: adding sequences of state-
Hindle et al. [10], [32] focus on large commits, to determine changing statements (called “mutators”) and adding a precon-
whether they reflect specific engineering activities such as dition (of the form of an “if” conditional). Their fix schemas
license modifications. Compared to these studies on commits are combinations of those two elementary repair actions. In
that mostly focus, on metadata (e.g. authorship, commit text) contrast, we have 173 basic repair actions and we are able
or size metrics (number of changer files, number of hunks, to predict repair shapes that consist of combinations of 4
etc.), we discuss the content of commits and the kind of source repair actions. However, our approach is more theoretical
code change they contain. Fluri et al. [33] and Vaucher et al. than theirs. Our probabilistic view on repair may fasten their
[34] studied the versioning history to find patterns of change, repair approach: it is likely that not all “fix schemas” are
i.e. groups of similar versioning transactions. equivalent. For instance, according to our experience, adding
Pan et al. [24] manually identified 27 bug fix patterns on a precondition is a very common kind of fix in real bugs.
Java software. Those patterns are precise enough to be auto- Debroy et al. [39] invented an approach to repair bugs
matically extractable from software repositories. They provide using mutations inspired from the field of mutation testing.
and discuss the frequencies of the occurrence of those patterns The approach uses a fault localization technique to obtain
in 7 open source projects. This work is closely related to the candidate faulty locations. For a given location, it applies
ours: we both identify automatically extractable repair actions mutations, producing mutants of the program. Eventually, a
of software. The main difference is that our repair actions mutant is classified as “fixed” if it passes the test suite of
are discovered fully automatically based on AST differencing the program. Their repair actions are composed of mutations
(there is no prior manual analysis to find them). Furthermore, of arithmetic, relational, logical, and assignment operators.
since our repair actions are meant to be used in an automated Compared to our work, mutating a program is a special kind
program repair setup, they are smaller and more atomic. of fix synthesis where no explicit high-level repair shapes
Kim and et al. [35] use versioning history to mine project- are manipulated. Also, in the light of our results, we assume
specific bug fix patterns. Williams and Hollingsworth [36] also that a mutation-based repair process would be faster using
learn some repair knowledge from versioning history. They probabilities on top of the mutation operators.
mine how to statically recognize where checks on return values Kim et al. [40] introduced PAR, an algorithm that generates
should be inserted. Livshits and Zimmermann [15] mine co- program patches using a set of 10 manually written fix
changed method calls. The difference with those close pieces templates. As GenProg, the approach leverages evolutionary
of research is that we enlarge the scope of mined knowledge: computing techniques to generate program patches. We share
from project-specific knowledge [35] to domain-independant with PAR the idea of extracting repair knowledge from human-
repair actions, and from one single repair action [36], [15] to written patches. Beyond this high-level point in common,
41 and 173 repair actions. there are three important differences. First, they do a manual
e) Abstract Syntax Tree Differencing: The evaluation of extraction of fix patterns (by reading 62,656 patches) while
AST differencing tools often gives hints about common change we automatically mine them from the past commits. Second,
actions of software. For instance, Raghavan et al. [19] showed PAR patterns and our repair actions are expressed at a different
the six most common types of changes for the Apache web granularity. PAR patterns contain a specification of the context
server and the GCC compiler, the number one being “Altering that matches a piece of AST, a specification of analysis (e.g.
existing function bodies”. This example clearly shows the to collect compatible expressions in the current scope), and a
difference with our work: we provide change and repair actions specification of change. Our repair actions correspond to this
at a very fine granularity. Similarly, Neamtiu et al. [20] gives last part. While their patterns are operational, their change
interesting numerical findings about software evolution such specifications are ad hoc (due to the process of manually
as the evolution of added functions and global variables of C specifying templates). On the contrary, our specification of

13
repair actions are systematic and automatically extracted, but [10] A. Hindle, D. M. German, and R. Holt, “What do large commits tell
our approach is more theoretical and we do not fix concrete us?: a taxonomical study of large commits,” in Proceedings of the
International Working Conference on Mining Software Repositories,
bugs. This shows again that the foundations of their approach 2008.
contains more manual work than ours: a PAR pattern is a [11] D. M. German, “An empirical study of fine-grained software modifica-
manually identified repair schema where all the synthesis rules tions,” Empirical Softw. Engineering, vol. 11, no. 3, pp. 369–393, 2006.
[12] L. Hattori and M. Lanza, “On the nature of commits,” in Proceedings
are manually encoded. Finally, we think it is possible to marry of 4th International ERCIM Workshop on Software Evolution and
our approaches by decorating their templates with probability Evolvabillity (EVOL), pp. 63 –71, 2008.
distributions (whether mined or not) so as to speed up the [13] A. Alali, H. Kagdi, and J. Maletic, “What’s a typical commit? a
characterization of open source software repositories,” in Proceedings of
repair. the IEEE International Conference on Program Comprehension, 2008.
[14] R. Purushothaman and D. Perry, “Toward understanding the rhetoric of
VIII. C ONCLUSION small source code changes,” IEEE Transactions on Software Engineer-
In this paper, we have presented the idea that one can ing, vol. 31, pp. 511 – 526, june 2005.
[15] B. Livshits and T. Zimmermann, “Dynamine: finding common error
mine repair actions from software repositories. In other words, patterns by mining software revision histories,” in Proceedings of the
one can learn from past bug fixes the main repair actions European software engineering conference held jointly with Interna-
(e.g. adding a method call). Those repair actions are meant tional Symposium on Foundations of Software Engineering, 2005.
[16] R. Robbes, Of Change and Software. PhD thesis, University of Lugano,
to be generic enough to be independent of the kinds of bug 2008.
and the software domains. We have discussed and applied a [17] E. Giger, M. Pinzger, H. Gall, T. Xie, and T. Zimmermann, “Comparing
methodology to mine the repair actions of 62,179 versioning fine-grained source code changes and code churn for bug prediction,”
in Working Conference on Mining Software Repositories, 2011.
transactions extracted from 14 repositories of 14 open-source [18] M. Monperrus and M. Martinez, “Cvs-vintage: A dataset of 14 cvs
projects. We have largely discussed the rationales and conse- repositories of java software,” Tech. Rep. hal-00769121, INRIA, 2012.
quences of adding a probability distribution on top of a repair [19] S. Raghavan, R. Rohana, D. Leon, A. Podgurski, and V. Augustine,
“Dex: a semantic-graph differencing tool for studying changes in large
model. We have shown that certain distributions over repair code bases,” in Proceedings of the 20th IEEE International Conference
actions can result in an infinite time (in average) to find a on Software Maintenance, 2004.
repair shape while other fine-tuned distributions enable us to [20] I. Neamtiu, J. S. Foster, and M. Hicks, “Understanding source code
evolution using abstract syntax tree matching,” in Proceedings of the
find a repair shape in hundreds of repair attempts. International Workshop on Mining Software Repositories, 2005.
The main direction of future work consists of going be- [21] M. Martinez and M. Monperrus, “Appendix of "On Mining Software
yond empirical results and theoretical analysis. We are now Repair Models and their Relations to the Search Space of Automated
Program Fixing",” Tech. Rep. hal-00903804, INRIA, 2013.
exploring how to use this learned knowledge (of the form [22] D. of Mathematics of the University of York, “Statistical tables.” http:
of probabilistic repair models) to fix real bugs. In particular, //www.york.ac.uk/depts/maths/tables/, Last visited: April 9 2013.
we are planning to work on using probabilistic models to see [23] A. Murgia, G. Concas, M. Marchesi, and R. Tonelli, “A machine learning
approach for text categorization of fixing-issue commits on CVS,” in
whether one can faster repair the bugs of PAR’s and GenProg’s Proceedings of the International Symposium on Empirical Software
datasets. The latter involves having a Java implementation Engineering and Measurement, 2010.
of GenProg and would advance our knowledge on whether [24] K. Pan, S. Kim, and E. J. Whitehead, “Toward an understanding of bug
fix patterns,” Empirical Software Engineering, vol. 14, no. 3, pp. 286–
GenProg’s efficiency is really language-independent (Segfaults 315, 2008.
and buffer overruns do not exists in Java). [25] R. Wu, H. Zhang, S. Kim, and S.-C. Cheung, “Relink: recovering links
between bugs and changes,” in Proceedings of the 2011 Foundations of
R EFERENCES Software Engineering Conference, pp. 15–25, 2011.
[1] W. Weimer, “Patches as better bug reports,” in Proceedings of the [26] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and
International Conference on Generative Programming and Component P. Devanbu, “Fair and balanced?: bias in bug-fix datasets,” in Proceed-
Engineering, 2006. ings of the 7th joint meeting of the European Software Engineering
[2] W. Weimer, T. Nguyen, C. L. Goues, and S. Forrest, “Automatically Conference and the ACM SIGSOFT Symposium on the Foundations of
finding patches using genetic programming,” in Proceedings of the Software Engineering, ESEC/FSE ’09, pp. 121–130, ACM, 2009.
International Conference on Software Engineering, 2009. [27] J. Cohen et al., “A coefficient of agreement for nominal scales,”
[3] Y. Wei, Y. Pei, C. A. Furia, L. S. Silva, S. Buchholz, B. Meyer, and Educational and psychological measurement, vol. 20, no. 1, pp. 37–46,
A. Zeller, “Automated fixing of programs with contracts,” in Proceedings 1960.
of the International Symposium on Software Testing and Analysis, AC, [28] J. R. Landis and G. G. Koch, “The measurement of observer agreement
2010. for categorical data.,” Biometrics, vol. 33, no. 1, pp. 159–174, 1977.
[4] V. Dallmeier, A. Zeller, and B. Meyer, “Generating fixes from object [29] F. L. Joseph, “Measuring nominal scale agreement among many raters,”
behavior anomalies,” in Proceedings of the International Conference on Psychological bulletin, vol. 76, no. 5, pp. 378–382, 1971.
Automated Software Engineering, 2009. [30] C. L. Goues, W. Weimer, and S. Forrest, “Representations and operators
[5] A. Arcuri, “Evolutionary repair of faulty software,” Applied Soft Com- for improving evolutionary software repair,” in Proceedings of GECCO,
puting, vol. 11, no. 4, pp. 3494–3514, 2011. pp. 959–966, 2012.
[6] D. Gopinath, M. Z. Malik, and S. Khurshid, “Specification-based pro- [31] A. Arcuri and L. Briand, “A practical guide for using statistical tests to
gram repair using sat,” in Proceedings of the International Conference assess randomized algorithms in software engineering,” in Proceedings
on Tools and Algorithms for the Construction and Analysis of Systems, of the 33rd International Conference on Software Engineering, pp. 1–10,
2011. ACM, 2011.
[7] B. Fluri, M. Wursch, M. Pinzger, and H. Gall, “Change distilling: [32] A. Hindle, D. German, M. Godfrey, and R. Holt, “Automatic classication
Tree differencing for fine-grained source code change extraction,” IEEE of large changes into maintenance categories,” in Proceedings of the
Transactions on Software Engineering, vol. 33, pp. 725 –743, nov. 2007. debroInternational Conference on Program Comprehension, 2009.
[8] M. Bóna, A Walk Through Combinatorics: An Introduction to Enumer- [33] B. Fluri, E. Giger, and H. C. Gall, “Discovering patterns of change
ation and Graph Theory. World Scientific, 2011. types,” in Proceedings of the International Conference on Automated
[9] M. Martinez and M. Monperrus, “Mining repair actions for guiding Software Engineering, 2008.
automated program fixing,” tech. rep., INRIA, 2012.

14
[34] S. Vaucher, H. Sahraoui, and J. Vaucher, “Discovering new change [38] A. Carzaniga, A. Gorla, N. Perino, and M. Pezzè, “Automatic
patterns in object-oriented systems,” in Proceedings of the Working workarounds for web applications,” in Proceedings of the 2010 Foun-
Conference on Reverse Engineering, 2008. dations of Software Engineering Conference, pp. 237–246, ACM, 2010.
[35] S. Kim, K. Pan, and E. J. Whitehead, “Memories of bug fixes,” in [39] V. Debroy and W. Wong, “Using mutation to automatically suggest fixes
Proceedings of the 14th ACM SIGSOFT International Symposium on for faulty programs,” in Proceedings of the International Conference on
Foundations of Software Engineering, 2006. Software Testing, Verification and Validation (ICST), pp. 65–74, IEEE,
[36] C. C. Williams and J. K. Hollingsworth, “Automatic mining of source 2010.
code repositories to improve bug finding techniques,” IEEE Transactions [40] D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch generation
on Software Engineering, vol. 31, no. 6, pp. 466–480, 2005. learned from human-written patches,” in Proceedings of the 2013
[37] E. Giger, M. Pinzger, and H. C. Gall, “Can we predict types of International Conference on Software Engineering, pp. 802–811, IEEE
code changes? an empirical analysis,” in Proceedings of the Working Press, 2013.
Conference on Mining Software Repositories, pp. 217–226, 2012.

15

You might also like