0% found this document useful (0 votes)
28 views17 pages

Statistical Debugging: A Hypothesis Testing-Based Approach

SOBER is a new statistical method for automatically localizing software faults without prior program knowledge. It models predicate evaluation patterns in correct and incorrect executions, and identifies predicates with significantly different patterns as fault-relevant. An evaluation study shows SOBER locates 68/130 faults in the Siemens suite by examining 10% of code, outperforming previous techniques. SOBER is also effective under imperfect testing, indicating robustness to inadequate or partially labeled test suites.

Uploaded by

vvid vvays
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views17 pages

Statistical Debugging: A Hypothesis Testing-Based Approach

SOBER is a new statistical method for automatically localizing software faults without prior program knowledge. It models predicate evaluation patterns in correct and incorrect executions, and identifies predicates with significantly different patterns as fault-relevant. An evaluation study shows SOBER locates 68/130 faults in the Siemens suite by examining 10% of code, outperforming previous techniques. SOBER is also effective under imperfect testing, indicating robustness to inadequate or partially labeled test suites.

Uploaded by

vvid vvays
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO.

10, OCTOBER 2006 1

Statistical Debugging: A Hypothesis


Testing-Based Approach
Chao Liu, Member, IEEE, Long Fei, Member, IEEE, Xifeng Yan, Member, IEEE,
Jiawei Han, Senior Member, IEEE, and Samuel P. Midkiff, Member, IEEE

Abstract—Manual debugging is tedious, as well as costly. The high cost has motivated the development of fault localization techniques,
which help developers search for fault locations. In this paper, we propose a new statistical method, called SOBER, which automatically
localizes software faults without any prior knowledge of the program semantics. Unlike existing statistical approaches that select
predicates correlated with program failures, SOBER models the predicate evaluation in both correct and incorrect executions and regards
a predicate as fault-relevant if its evaluation pattern in incorrect executions significantly diverges from that in correct ones. Featuring a
rationale similar to that of hypothesis testing, SOBER quantifies the fault relevance of each predicate in a principled way. We
systematically evaluate SOBER under the same setting as previous studies. The result clearly demonstrates the effectiveness: SOBER
could help developers locate 68 out of the 130 faults in the Siemens suite by examining no more than 10 percent of the code, whereas the
Cause Transition approach proposed by Holger et al. [6] and the statistical approach by Liblit et al. [12] locate 34 and 52 faults,
respectively. Moreover, the effectiveness of SOBER is also evaluated in an “imperfect world,” where the test suite is either inadequate or
only partially labeled. The experiments indicate that SOBER could achieve competitive quality under these harsh circumstances. Two
case studies with grep 2.2 and bc 1.06 are reported, which shed light on the applicability of SOBER on reasonably large programs.

Index Terms—Debugging aids, statistical methods, statistical debugging.

1 INTRODUCTION
checks whether accesses to the buffer ever exceed the upper
T HE last decade has witnessed great advances in fault
localization techniques [1], [2], [3], [4], [5], [6], [7], [8], [9].
These techniques aim to assist developers in finding fault
bound. Statistics on the evaluations of predicates are
collected over multiple executions at runtime and analyzed
locations, which is one of the most expensive debugging afterward.
activities [10]. Fault localization techniques can be roughly The method described in this paper shares the principle
classified as static or dynamic. A static analysis detects of predicate-based dynamic analysis. However, by explor-
program defects by checking the source codes with or ing detailed statistics about predicate evaluation, our
without referring to a well-specified program model [1], [2], method can detect more and subtler faults than the state-
[3]. A dynamic analysis, on the other hand, typically tries to of-the-art statistical debugging approach proposed by Liblit
locate defects by contrasting the runtime behavior of correct et al. [12]. For easy reference, we denote this method as
and incorrect executions. Dynamic techniques usually do LIBLIT05. For each predicate P in a program P, LIBLIT05
not assume any prior knowledge of program semantics estimates two conditional probabilities:
other than the labeling of each execution as either correct or
P r1 ¼ P rðP failsjP is ever observedÞ
incorrect. Previous studies deploy a variant of program
runtime behaviors for fault localization, such as program and
spectra [11], [4], memory graphs [5], [6], and program
predicate evaluation history [7], [12]. P r2 ¼ P rðP failsjP is ever observed as trueÞ:
Within dynamic analyses, techniques based on predicate
It then treats the probability difference P r2  P r1 as an
evaluations have been shown to be promising for fault
indicator of how relevant P is to the fault. Therefore,
localization [13], [14], [7], [12]. Programs are first instru-
LIBLIT05 essentially regards a predicate fault-relevant if its
mented with predicates such that the runtime behavior in
true evaluation correlates with program failures.
each execution is encoded through predicate evaluations. While LIBLIT05 succeeded in isolating faults in some
Consider the predicate “idx < LENGTH,” where the variable widely used software [12], it has a potential problem in its
idx is an index into a buffer of length LENGTH. This predicate ranking model. Because LIBLIT05 only considers whether a
predicate has ever been evaluated as true or not in each
. C. Liu, X. Yan, and J. Han are with the Department of Computer Science, execution, it loses its power to discriminate when a
University of Illinois at Urbana-Champaign, Urbana, IL 61801. predicate P is observed as true at least once in all
E-mail: {chaoliu, xyan, hanj}@cs.uiuc.edu. executions. In this case, P r1 is equal to P r2 , which suggests
. L. Fei and S.P. Midkiff are with the School of Electronic and Computer that the predicate P has no relevance to the fault. In
Engineering, Purdue University, West Lafayette, IN 47907. Section 2, we will present an example where the most fault-
E-mail: {lfei, smidkiff}@purdue.edu.
relevant predicate reveals only a small difference between
Manuscript received 12 Dec. 2005; revised 10 May 2006; accepted 4 Aug. P r1 and P r2 . We found that similar cases are not rare in
2006; published online DD Mmmm, YYYY.
Recommended for acceptance by R. Lutz. practice, as suggested by the experiments in Section 4.
For information on obtaining reprints of this article, please send e-mail to: The above issue motivates us to develop a new approach
[email protected], and reference IEEECS Log Number TSE-0327-1205. that can exploit multiple evaluations of a predicate within
0098-5589/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society
2 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006

each execution. We start by treating the evaluations of a


predicate P as independent Bernoulli trials: Each evaluation
of P gives either true or false. We then estimate the
probability of P being true in each execution, which we call
the evaluation bias. While the evaluation bias of P may
fluctuate from one execution to another, its observed values
from multiple executions constitute a random sample from
a statistical model. Specifically, if we let X be the random
variable standing for the evaluation bias of predicate P ,
then there are two statistical models, fP ðXjCorrectÞ and
fP ðXjIncorrectÞ, which govern the evaluation bias observed
from correct and incorrect executions respectively. Intui-
tively, if the model fP ðXjIncorrectÞ is significantly different
Fig. 1. Faulty-code version 3 of replace.
from fP ðXjCorrectÞ, it is indicated that P ’s evaluation in
incorrect runs captures abnormal activity, and the predicate
P is likely relevant to the fault. Therefore, instead of same setting as previous studies. Seven existing fault
selecting predicates correlated with program failures as localization techniques are compared with SOBER in
done by LIBLIT05, our approach statistically models pre- this study, which demonstrates the superior accu-
dicate evaluations in both correct and incorrect runs, racy achieved by S OBER in fault localization.
respectively, and treats the model difference as a measure Furthermore, the effectiveness of SOBER is also
of the fault relevance. evaluated in an “imperfect world,” where the test
In quantifying the model difference between suite is either inadequate or partially labeled. The
fP ðXjCorrectÞ and fP ðXjIncorrectÞ, there are two major experimental results shows that SOBER is statistically
robust to these circumstances.
obstacles. First, we have no idea what family of distributions
the two models are in. Second, we are not authorized to 4. Finally, two case studies with grep 2.2 and bc 1.06
are reported, which illustrate the applicability of
impose model assumptions on fP ðXÞ because improper
SOBER on reasonably large programs. In particular, a
model assumptions can result in misleading inferences [15]. previously unreported fault is found in bc 1.06,
Therefore, without prior knowledge of the statistical based on the fault localization result from SOBER.
models, a direct measurement of the model divergence is
The rest of the paper is organized as follows: Section 2
difficult, if not fully impossible.
first presents a motivating example, which illustrates the
In this paper, we propose a hypothesis testing-based
advantages of modeling predicate evaluations within each
approach, which indirectly quantifies the model difference.
execution. We elaborate on the statistical model, ranking the
Aiming at the model difference, we first propose the null
hypothesis that the two models are identical. We then algorithm and its relationship with program invariants in
derive a test statistic that conforms to a normal distribution Sections 3. An extensive comparison between SOBER and
under the null hypothesis through the Central Limit existing techniques is presented in Section 4, followed by the
Theorem [15]. Finally, given observed evaluation biases evaluation of SOBER in an “imperfect world” in Section 5.
from multiple executions (both correct and incorrect), the The two case studies with grep 2.2 and bc 1.06 are reported
instantiated test statistic quantifies the likelihood that the in Section 6. With related work and threats to validity
evaluation biases observed from incorrect runs were discussed in Section 7, Section 8 concludes this study.
generated as if from fP ðXjCorrectÞ. Therefore, a smaller
likelihood suggests a larger discrepancy between the two 2 A MOTIVATING EXAMPLE
models, and, hence, a greater likelihood that the predicate P
is fault-relevant. Using this quantification, we can rank all In this section, we present a detailed example that illustrates
the instrumented predicates, getting a ranked list of the advantage of modeling predicates in a probabilistic
suspicious predicates. Developers can then examine the list way. This example inspires us to locate faults by quantify-
from the top down in debugging. ing the divergence between the models of correct and
In summary, we make the following contributions in this incorrect executions.
paper: The program in Fig. 1 is excerpted from the third faulty
version of the replace program in the Siemens suite. The
1. We propose a probabilistic treatment of program program replace has 507 lines of C code (LOC) and it
predicates that models how a predicate is evaluated performs regular expression matching and substitutions.
within each execution, which exploits more detailed The second subclause in line 7 was intentionally commen-
information than previous methods [7], [12]. In ted out by the Siemens researchers to simulate a type of
addition, this probabilistic treatment naturally en- fault that may sneak in if the developer fails to think fully
compasses the concept of program invariants [16] as about the if condition. Since this is essentially a logic error
a special case. that does not incur program crashes, even experienced
2. On top of the probabilistic treatment of predicates, developers would have to use a conventional debugger for
we develop a theoretically well-motivated ranking step-by-step tracing. Our question is: Can we guide developers
algorithm, SOBER, that ranks predicates according to the faulty location or its vicinity by contrasting the runtime
to how abnormally each predicate evaluates in behaviors between correct and incorrect executions?
incorrect executions. Intuitively, the more abnor- For clarity in what follows, we denote the program with
mal the evaluations, the more likely the predicate
is fault-relevant. the subclause ðlastm ! ¼ mÞ commented out as the incorrect
3. We systematically evaluate the effectiveness of (or faulty) program P, and the one with the subclause (i.e.,
SOBER on the Siemens suite [17], [18] under the ðlastm ! ¼ mÞ is not commented out) as the correct
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 3

b
Fig. 2. Branching actions in (a) P and (b) P.
Fig. 3. (a) A correct and (b) an incorrect execution in P.

program P. b Because Pb is certainly not available when P is


because neither A ¼ true nor A ¼ false is an invariant in
debugged, Pb is used here to illustrate how our method is correct or incorrect executions, invariant-based methods
motivated. As shown in Section 3, our method collects cannot detect that A is a suspicious predicate. LIBLIT05 does
statistics only from the faulty program P, not from P.b not regard A as suspicious either because it does not model
We declare two Boolean variables, A and B, as follows: the predicate evaluation within each execution (see Sec-
tion 3.7 for details).
A ¼ ðm >¼ 0Þ; The above example illustrates a simple but representa-
B ¼ ðlastm ! ¼ m Þ; tive case where a probabilistic treatment of predicates
captures detailed information about predicate evaluations.
Let us consider the four possible evaluation combinations of In the next section, we describe the statistical model and the
A and B, and their corresponding branching actions (either ranking algorithm that implement this intuition.
enter or skip the block from lines 8 through 11) in both P and P.b
b
Fig. 2 explicitly lists the actions in P (Fig. 2a) and P (Fig. 2b). 3 PREDICATE RANKING MODELS
The left panel shows the actual actions taken in the faulty 3.1 Problem Settings
program P, while the right one lists the expected actions in P.b Let T ¼ ft1 ; t2 ;    ; tn g be a test suite for program P. Each
Differences between the above two tables reveal that in test case ti ¼ ðdi ; oi Þ ð1  i  nÞ has an input di and the
the faulty program P, unexpected actions take place if and expected output oi . The execution of P on each test case ti
only if A ^ :B evaluates to true. Explicitly, when A ^ :B is gives the output oi0 ¼ Pðdi Þ. We say P passes the test case ti
true, the control flow actually enters the block, whereas it is (i.e., ti is a passing case) if and only if o0i is identical to oi ;
expected to skip the block if the logic was correct. This otherwise, P fails on ti (i.e., ti is a failing case). In this way,
incorrect control flow will likely lead to incorrect outputs. the test suite T is partitioned into two disjoint subsets Tp
Therefore, for the faulty program P, an execution is incorrect and Tf , corresponding to the passing and failing cases
if and only if there exist true evaluations of A ^ :B at line 7;
respectively:
otherwise, the execution is correct even though the program
contains a fault. Tp ¼ fti jo0i ¼ Pðdi Þ and o0i ¼ oi g;
While the predicate P : ðA ^ :BÞ ¼ true precisely char-
acterizes the scenario under which incorrect executions take Tf ¼ fti jo0i ¼ Pðdi Þ and o0i 6¼ oi g:
place, there is little chance for any fault locator to spot P as Since program P passes test case ti if and only if P executes
fault-relevant. The obvious reason is that while we are correctly, we use “correct” and “passing,” as well as
debugging P, Pb is not available. Therefore, we have no idea “incorrect” and “failing,” interchangeably in the following
of what B is, let alone its combination with A. On the other discussion.
hand, because the evaluation of A is observable in P, we are Given a faulty program P together with a test suite
interested in whether the evaluation of A can actually point
T ¼ Tp [ Tf , our task is to localize the suspicious fault region by
to the fault. Explicitly, if the evaluation of A in incorrect
executions significantly diverges from that in correct ones, contrasting P’s runtime behaviors on Tp and Tf .
the if statement at line 7 may be regarded as fault-relevant, 3.2 Probabilistic Treatment of Predicates
which exactly points to the fault location.
We, therefore, contrast how A is evaluated differently in In general, a program predicate is a proposition about any
correct and incorrect executions of P. Fig. 3 shows the program property, such as “idx < LENGTH,” “!emptyðlistÞ,”
number of true evaluations for the four combinations of A and “fooðÞ > 0.” As any instrumentation site can be touched
and B in one correct (Fig. 3a) and one incorrect (Fig. 3b) more than once due to program control flows, a predicate P
execution. The major difference between the two is that in a can be evaluated multiple times in one execution, and each
correct run, A ^ :B never evaluates true ðnAB ¼ 0Þ while evaluation produces either true or false. In order to model
n0AB must be nonzero for an execution to be incorrect. Since this within-execution behavior of P , we propose the concept
the true evaluation of A ^ :B implies A ¼ true, we of evaluation bias, which estimates the probability of the
expect that the probability for A to be true is different in
predicate P being evaluated as true.
correct and incorrect executions. In running 5,542 test cases,
the true evaluation probability is 0.2952 in a correct Definition 1 (Evaluation Bias). Let nt be the number of times
execution and 0.9024 in an incorrect execution, on average. that predicate P evaluates to true, and nf the number of
This divergence suggests that the fault location (i.e., line 7) times it evaluates to false, in one execution. ðP Þ ¼ nt nþn
t
is
does exhibit detectable abnormal behaviors in incorrect f
the observed evaluation bias of predicate P in this particular
executions. Our method, as described in Section 3, nicely
captures this divergence and ranks A ¼ true as the top execution.
fault-relevant predicate. This predicate readily leads the Intuitively, ðP Þ estimates the probability that P takes
developer to the fault location. Meanwhile, we note that the value true in each evaluation. If the instrumentation site
4 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006

of P is touched at least once (i.e., nt þ nf 6¼ 0), ðP Þ varies in 3.4 Predicate Ranking
the range of [0, 1]: ðP Þ is equal to 1 if P always holds, to 0 if The lack of prior knowledge about fP ðXjÞ constitutes one
it never holds, and in between for all other sets of outcomes. of the major obstacles in calculating the similarity (or
If the predicate is never evaluated, ðP Þ has a singularity 0/ difference, equivalently) between fP ðXjp Þ and fP ðXjf Þ. If
0. In this case, since we have no evidence to favor either the closed form of fP ðXjp Þ and fP ðXjf Þ were given,
true or false, we set ðP Þ to 0.5 for fairness. Finally, if a measures used in information theory [19], such as the
predicate is never evaluated in any failing runs, it has relative entropy, would immediately apply. Meanwhile, we
nothing to do with program failures and is hence eliminated are not authorized to impose model assumptions, like
normality, on fP ðXÞ because improper assumptions can
from the predicate ranking.
lead to misleading inferences. Therefore, given the above
3.3 Methodology Overview difficulties in directly measuring the model difference, in
this paper we propose an indirect approach that measures
We formulate the main idea of our method in this section the difference between fP ðXjp Þ and fP ðXjf Þ without any
and then develop its details in Section 3.4. Following the model assumption.
convention in statistics, we use uppercase letters for Aiming at the model difference, we first propose the null
random variables and the corresponding lowercase letters hypothesis that H0 : fP ðXjp Þ ¼ fP ðXjf Þ, i.e., there is no
for their realizations. Moreover, fðXjÞ is a general notation difference between the two models. Letting X ¼
of the probability model for the random variable X that is ðX1 ; X2 ;    ; Xm Þ be a random sample from fP ðXjf Þ (i.e.,
indexed by the parameter . observed evaluation bias from m failing cases), we derive a
Let the entire test case space be T , which conceptually statistic Y , which, under the null hypothesis H0 , conforms to a
contains all the possible inputs and expected outputs. known distribution. If the realized statistic Y ðXÞ corresponds
According to the correctness of P on the test cases in T , T to an event that has a small likelihood of happening, the null
can be partitioned into two disjoint sets T p and T f for hypothesis H0 is likely invalid, which suggests that a
passing and failing cases. Therefore, the available test suite T nontrivial difference exists between fP ðXjp Þ and fP ðXjf Þ.
and its partitions Tp and Tf can be treated as a random We choose to characterize fP ðXjÞ through its population
sample from T , T p , and T f , respectively. Let X be the mean  and variance 2 , so that the null hypothesis H0 is
random variable for the evaluation bias of predicate P . We
then use fP ðXjp Þ and fP ðXjf Þ to denote the statistical p ¼ f and 2p ¼ 2f : ð3Þ
model for the evaluation bias of P in T p and T f , respectively. Let X ¼ ðX1 ; X2 ;    ; Xm Þ be an independent and identi-
Therefore, the evaluation bias from running a test case t can cally distributed (i.i.d.) random sample from fP ðXjf Þ.
be treated as an observation from fP ðXjÞ, where  is either Under the null hypothesis, we have EðXi Þ ¼ f ¼ p
p or f depending on whether t is passing or failing. Given and V arðXi Þ ¼ 2f ¼ 2p . Because Xi 2 ½0; 1, both EðXi Þ
the statistical models for both passing and failing runs, we and V arðXi Þ are finite. According to the Central Limit
then define the fault relevance of P as follows: Theorem [15], the following statistic
Definition 2 (Fault Relevance). A predicate P is relevant to Pm
Xi
the hidden fault if its underlying model fP ðXjf Þ diverges Y ¼ i¼1 ; ð4Þ
m
from fP ðXjp Þ, where X is the random variable for the
2
evaluation bias of P . asymptotically conforms to Nðp ; mp Þ, a normal distribution
2
The above definition relates fP ðXjÞ, the statistical model with mean p and variance mp .
for P ’s evaluation bias, to the hidden fault. Naturally, the Let fðY jp Þ be the probability density function of the
2
larger the difference between fP ðXjf Þ and fP ðXjp Þ, the normal distribution Nðp ; mp Þ. Then, the likelihood Lðqp jY Þ
more relevant P is to the fault. Let LðP Þ be an arbitrary of p given the observed Y is
similarity function,
Lðp jY Þ ¼ fðY jp Þ: ð5Þ
LðP Þ ¼ SimðfðXjp Þ; fðXjf ÞÞ: ð1Þ
A smaller likelihood implies that H0 is less likely to hold,
The ranking score sðP Þ can be defined as gðLðP ÞÞ, where which, in turn, indicates a larger difference between fP ðXjp Þ
gðxÞ can be any monotonically decreasing function. We and fP ðXjf Þ. Therefore, we can reasonably instantiate the
here choose gðxÞ ¼ logðxÞ because logðxÞ effectively similarity function in (1) with the likelihood function
measures the relative magnitude even when xs are closed
to 0 (certainly, x must be positive). Therefore, the fault LðP Þ ¼ Lðp jY Þ: ð6Þ
relevance score sðP Þ is defined as According to the property of normal distribution, the
sðP Þ ¼ logðLðP ÞÞ: ð2Þ normalized statistic

Using this fault relevance score, we can rank all the Y  p


Z¼ pffiffiffiffiffi ð7Þ
instrumented predicates, and the top-ranked ones are p = m
regarded more likely to be fault-relevant. Therefore, the
asymptotically conforms to the standard normal distribu-
fault localization problem boils down to the setting of the tion Nð0; 1Þ, and
similarity function, which, in turn, consists of two subpro-
blems: 1) What is a suitable similarity function LðP Þ, and pffiffiffiffiffi
m
2) how is LðP Þ computed when the closed form of fP ðXjÞ fðY jp Þ ¼ ’ðZÞ; ð8Þ
p
is unknown? In Sections 3.4 and 3.5, we examine the two
problems in detail. where ’ðZÞ is the probability density function of Nð0; 1Þ.
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 5

Combining (2), (6), (5), and (8), we finally get the fault- Without loss of generality, a predicate P is a program
relevance ranking score for predicate P as invariant on a test suite C if and only if it always evaluates
  true during the execution of C. In practice, the test suite C
p is usually chosen to be a set of passing cases so that the
sðP Þ ¼ logðLðP ÞÞ ¼ log pffiffiffiffiffi : ð9Þ
m’ðZÞ summarized invariants characterize the correct behavior of
the subject program [16]. During the failing executions,
3.5 Discussions on Score Computation these invariants are either conformed (i.e., still evaluate
First, in order to calculate sðP Þ using (9), we need to true) or violated (i.e., evaluate false at least once), and
estimate the population mean p and the standard error those violated invariants are regarded as hints for debug-
p of fP ðXjp Þ. Let X0 ¼ ðX10 ; X20 ;    ; Xn0 Þ be a random ging. In some special cases, the test suite C is chosen to be a
sample from fP ðXjp Þ (which corresponds to the observed time interval during which the execution is believed to be
evaluation bias from the n passing runs), then p and p correct. One typical example is that for software that runs
can be estimated as for a long time, such as Web servers, the execution is likely
Pn correct at the beginning of the execution [14].
X0 According to Definition 1, the evaluation bias of an
p ¼ X0 ¼ i¼1 i ð10Þ
n invariant is always 1. Taking the set of passing cases Tp as
and C, we know that, if the predicate P is an invariant, p ¼ 1
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and p ¼ 0. Moreover, the following theorem proves that
1 X n the fault relevance score function of (9) naturally identifies
p ¼ SX 0 ¼ ðX 0  X 0 Þ2 : ð11Þ both invariant violations and conformations.
n  1 i¼1 i
pffiffiffiffiffi Theorem 1. Let P be any invariant summarized from a set of
Second, because the m in (9) does not affect the relative correct executions Tp . sðP Þ ¼ þ1 if P is violated in at least
order between predicates, it can be safely dropped in one faulty execution, and sðP Þ ¼ 1 if P is conformed in all
practice. However, as simple algebra would reveal, the m in faulty executions.
pffiffiffiffiffi
(4) and the m in (7) cannot be discarded, because they Proof. Let x ¼ ðx1 ; x2 ;    ; xm Þ be a realized random sample,
properly scale the statistics for standard normality as which corresponds to the observed evaluation biases
required by the Central Limit Theorem. from the mP failing runs. Once P is violated in at least one
Finally, we note that although the derivation of (9) is execution, m i¼1 xi 6¼ m. It then follows from (7) that
based on the asymptotic behavior, i.e., when m ! þ1,
Pm
statistical inference suggests that the asymptotic result is c xi  mp
still valid even when the sample size is nowhere near z ¼ ; where c ¼ i¼1 pffiffiffiffiffi 6¼ 0;
p m
infinity [15]. In the fault localization scenario, it is true that
we cannot have an infinite number of failing cases. But as and then
shown in experiments, (9) still works well in ranking rffiffiffiffiffiffi rffiffiffiffiffiffi c2 t2
abnormal predicates even when only a small number of p 2 p 2 e2
failing cases are available. lim pffiffiffiffiffi ¼ lim ¼ lim
p !0 m’ðzÞ m p !0 e12 ðc Þ2 m t!1 t
We now use a concrete example to conclude the rffiffiffiffiffiffi
p

discussion in this subsection. The example illustrates how 2 c2 t2


2
the fault relevance score of the predicate P ¼ ðA ¼ trueÞ is ¼c lim te 2 ¼ þ1:
m t!1
calculated for the program in Fig. 1.
First, by running the 130 failing and the 5,412 passing cases Thus, (9) gives sðP Þ ¼ þ1. This means that SOBER treats
(i.e., m ¼ 130 and n ¼ 5;412) on the instrumented program, violated invariants as the most abnormal predicates and
the numbers of true and false evaluations are recorded at ranks them at the top.
runtime for each execution. Then, the evaluation bias of P in On the other hand, if the invariant P is not violated in
each execution is calculated based on Definition 1. Next, the any failing execution, we have
statistic Y ¼ 0:9024 is directly obtained from the evaluation Pm
xi  mp 0
biases in failing cases according to (4). Similarly, from passing lim z ¼ lim i¼1pffiffiffiffiffi ¼ lim pffiffiffiffiffi ¼ 0;
cases, we get p ¼ 0:2952 and p ¼ 0:2827 according to (10) p !0 p !0 mp p !0 mp

and (11), respectively. Plugging the calculated Y , p , p and and, therefore,


m ¼ 130 into (7), we get Z ¼ 24:4894. Finally, from (9), the
fault relevance score for predicate P is 297.2. p p
lim pffiffiffiffiffi ¼ lim pffiffiffiffiffi ¼ 0;
Besides illustrating how sðP Þ is calculated, this example p !0m’ðzÞ p !0 m’ð0Þ
also shows the role played by the log operator in (9).
which immediately leads to sðP Þ ¼ 1. This suggests
Although the log operator does not influence the ranking of
that conformed invariants are regarded as the least
predicates, it helps scale down the calculated score, which
might otherwise overflow in numeric computation. abnormal, and are ranked at the bottom by our method. t
u
Theorem 1 indicates that, if a fault can be caught by
3.6 Generalizing Invariants invariant violations as implemented in the DIDUCE [14]
In this section, we demonstrate how the probabilistic project, SOBER can also detect it because the fault relevance
treatment of predicate evaluations encompasses program score for the violated invariant is þ1. Meanwhile, for
invariants [16] as a special case. Moreover, we also prove conformed invariants, SOBER simply discards them due to
that the fault relevance score in (9) readily identifies both the 1 score. Previous research suggests that invariant
invariant violations and conformations. violations by themselves can only locate a number of faults
6 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006

TABLE 1
Characteristics of Subject Programs

in the Siemens suite [20]. As will be shown shortly, our P : }ðm >¼ 0Þ ¼ true};
method SOBER, being a superset of invariant-based meth- IncreaseðP Þ ¼ 0:0104, and the Increase value for predicate
ods, actually achieves the best fault localization results on
the Siemens suite. P 0 : }ðm >¼ 0Þ ¼ false}
3.7 Differences between SOBER and Liblit05 is 0:0245. According to [12], neither P nor P 0 is ranked on
Because both LIBLIT05 and SOBER are based on a statistical top since they are either negative or too small. Thus,
analysis of predicate evaluations, we now illustrate the LIBLIT05 fails to identify the fault. In comparison, SOBER
differences in this section. successfully ranks the predicate P as the most suspicious
In principle, LIBLIT05 contrasts the probability that one predicate. Intuitively, this is because the evaluation bias in
execution crashes if the predicate P is ever observed true, failing executions (0.9024) significantly diverges from that
and that if P is observed (either true or false) in the in passing ones (0.2952).
execution. Specifically, the authors define

ContextðP Þ ¼ P rðCrashjP observedÞ; ð12Þ 4 EMPIRICAL COMPARISON WITH EXISTING


F ailureðP Þ ¼ P rðCrashjP observed trueÞ; ð13Þ
TECHNIQUES
In this section, we empirically evaluate the effectiveness of
and take the probability difference SOBER in fault localization. We compare SOBER with seven
existing fault localization algorithms under the same setting
IncreaseðP Þ ¼ F ailureðP Þ  ContextðP Þ ð14Þ
as previous studies. Section 4.1 first describes the experi-
as one of the two key components of P ’s fault relevance mental setup, which includes the subject programs, the
score. The other component is the number of failing runs metric for localization quality, and the implementation
where P is ever observed as true. A harmonic average is details. We briefly explain the seven fault localization
then taken to combine these two components. algorithms in Section 4.2. Detailed comparison results are
A detailed examination reveals fundamental differences presented in Sections 4.3 and 4.4. Finally, Section 4.5
between LIBLIT05 and SOBER. First, from the methodologi- compares these algorithms from different perspectives
cal point of view, LIBLIT05 estimates how much more likely other than the localization accuracy.
an execution crashes if the predicate P is observed as true
in comparison with if P is observed as either true or false. 4.1 Experimental Setup
This indicates that LIBLIT05 places a greater value on In this study, we use the Siemens suite as the subject
predicates whose true evaluation correlates with program programs. The Siemens suite was originally prepared by
crashes. SOBER, on the other hand, models the evaluation Siemens Corp. Research in a study of test adequacy criteria
distribution of the predicate P in passing (i.e., fP ðXjp Þ) and [17]. It contains 132 faulty versions of seven subject
failing (i.e., fP ðXjf Þ) executions, respectively, and regards programs, where each faulty version contains one and only
predicates with large differences between fP ðXjf Þ and one manually injected fault. Table 1 lists the characteristics
fP ðXjp Þ as fault-relevant. Therefore, SOBER and LIBLIT05 of the seven subject programs. The medians of the failing
actually follow two fundamentally different approaches, and passing cases are taken over all the faulty versions of
although both of them rank predicates statistically. Sec- each subject program. Readers interested in more details
ondly, SOBER explores the multiple evaluations of pre- about the Siemens suite are referred to [17], [18].
dicates within one execution while LIBLIT05 overlooks this Previously, many researchers investigating fault locali-
information. For instance, if a predicate P evaluates as true zation have reported their results on the Siemens suite [11],
at least once in each execution, and has different likelihood [20], [4], [6]. Because no failures are observed for the 32nd
to be true in passing and failing executions, LIBLIT05 version of the replace program and the 10th version of the
simply overlooks P while SOBER can readily capture the schedule2 program on the accompanying test suites, these
evaluation divergence. two versions are excluded in previous studies [4], [6], [21],
Let us reexamine the program in Fig. 1 presented in as well as in this one.
Section 2. The faulty statement (line 7) is executed in almost In order to objectively quantify the localization accuracy,
every execution. Within each run, it evaluates multiple an evaluation framework based on program static depen-
times as either true or false. In this case, LIBLIT05 has little dencies is adopted in this study. This measure was
discrimination power. Specifically, for the predicate originally proposed by Renieris and Reiss [4], and was
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 7

later adopted by Cleve and Zeller in reporting the quality of approach is closer to practice. Moreover, the ranking-based
CT [6]. We briefly summarize this measure as follows: T -score is not as generally applicable as the PDG-based,
because it requires a ranking of all statements. For example,
1. Given a (faulty) program P, its program depen- none of the discussed algorithms in Section 4.2, except
dence graph (PDG) is written as G, where each TARANTULA, can be evaluated using the ranking-based
statement is a vertex and there is an edge between
approach, but TARANTULA can be evaluated by the PDG-
two vertices if two statements have data and/or
based T -score by taking the top k statements as a fault
control dependencies.
localization report.
2. The vertices corresponding to faulty statements are
In this study, we compare SOBER with seven existing
marked as defect vertices. The set of defect vertices is
written as Vdefect . fault localization algorithms (described in the next section).
3. Given a fault localization report R, which is a set of Among them, we implemented LIBLIT05 in Matlab and
suspicious statements, their corresponding vertices validated the correctness of the implementation with the
are called blamed vertices. The set of blamed vertices original authors. For the other six algorithms, the localiza-
is written as Vblamed . tion result on the Siemens suite is taken directly from their
4. A developer can start from Vblamed and perform a corresponding publications.
breadth-first search until he reaches one of the defect We instrumented the subject programs with two kinds of
vertices. The set of statements covered by the predicates: branches and function returns, which are
breadth-first search is written as Vexamined . described in detail in [7], [12]. In particular, we treat each
5. The T -score, defined as follows, measures the branch conditional as one inseparable instrumentation unit,
percentage of code that has been examined in order and do not consider each subclause separately. For better
to reach the fault: fault localization, one may be tempted to introduce more
predicates. But the introduction of more predicates is a
jVexamined j double-edged sword. On the positive side, an expanded set
T¼  100%; ð15Þ
jV j of predicates is more likely to cover the faulty code; but the
where jV j is the size of the program dependence superfluous predicates brought in can nontrivially compli-
graph G. In [4], [6], the authors used 1  T as an cate the predicate ranking. So far, no agreement has been
equivalent measure. reached on what are the “golden predicates.” At runtime,
The T -score estimates the percentage of code a devel- the evaluation of predicates is collected without sampling
oper needs to examine (along the static dependencies) for both LIBLIT05 and SOBER.
before the fault location is found, when a fault localization All experiments in this section were carried out on a
report is provided. A high quality fault localization is 3.2 GHz Intel Pentium-4 PC with 1 GB physical memory,
expected to be a small set of statements that are close to running Fedora Core 2. In calculating the T -scores, we used
(or contain) the fault location. The above definition of CODESURFER 1.9 with patch 3 to generate the program
T -score is immediately applicable to localizations that dependence graphs. Because PDGs generated by CODE-
consist of a set of “blamed” statements. For algorithms that SURFER may vary with different build options, the factory
generate a ranked list of all predicates, like LIBLIT05 and default (by enabling the factory-default switch) is
SOBER, the corresponding statements of the top k pre- used to allow reproducible results in the future. Moreover,
dicates are taken as a fault localization report. The optimal the Matlab source code of SOBER and the instrumented
k is the one that minimizes the average examined code Siemens suite are available online at http://www.ews.
over a set of faults under study, i.e., uiuc.edu/~chaoliu/sober.htm.
kopt ¼ arg min E½Tk : ð16Þ 4.2 Compared Fault Localization Algorithms
k
We now briefly explain the seven fault localization
where E½Tk  is the average T -score for the given set of faults algorithms we compare with SOBER. As LIBLIT05 is already
for any fixed k. discussed in Section 3.7, we only describe the other six
As the above defined T -score is calculated based on algorithms below:
PDGs, we call it PDG-based. Recently, another kind of
T -score was used by Jones and Harrold in reporting the . Set-Union. This algorithm is based on the program
localization results of TARANTULA [8]. The TARANTULA spectra difference between a failing case f and a set
tool produces a ranking of all executable statements, and the of passing cases P . Specifically, let SðtÞ be the
authors calculate the T -score directly from the ranking. program spectra of running the test case t. Then, the
Instead of taking the top k statements and calculating the set difference between SðfÞ and the union spectra of
T -score based on PDGs, the authors examine whether the cases in P is taken as the fault localization report R,
faulty statements are ranked high. Specifically, a developer i.e., R ¼ SðfÞ  [pi 2P Sðpi Þ. This algorithm is de-
is assumed to examine statement by statement from the top scribed in [4], and we denote it by UNION for
of the ranking until a faulty statement is touched. The brevity.
percentage of statements examined by then is taken as the . Set-Intersect. A complementary algorithm to UNION
T -score. We call the T -score calculated in this way is also described in [4]. It is based on the set
ranking-based. Apparently, the ranking-based T -score as- difference between the spectra of the failing case
sumes a different code examination strategy than that and the intersection spectra of passing cases, namely,
assumed by the PDG-based, i.e., along the ranking rather the localization report R ¼ \pi 2P Sðpi Þ  SðfÞ. We
than along the dependencies. Intuitively, the PDG-based denote this algorithm by INTERSECT.
8 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006

Fig. 4. Located faults with regard to code examination. (a) Interval comparison. (b) Cumulative comparison.

. Nearest Neighbor. The nearest neighbor approach, calculated from f 0 and the erroneous output,
proposed by Renieris and Reiss in [4], contrasts the respectively. Finally, the intersection of F S and
failing case to the passing case that most “resembles” BS, i.e., the chop, is taken as the fault localization
the failing case. Namely, the localization report report, namely, R ¼ F S \ BS. We denote this
R ¼ SðfÞ  SðpÞ, where p is the nearest passing case algorithm by SLICECHOP.
to f as measured under certain distance metrics. The In previous studies, comparisons of some of the above
authors studied two distance metrics and found that algorithms are reported. Specifically, Renieris and Reiss
the nearest neighbor search based on the Ulam’s found that NN/PERM outperformed both UNION and
distance renders better fault localization. This algo- INTERSECT [4], whereas Cleve and Zeller later reported
rithm is denoted as NN/PERM by the original that a better result than NN/PERM was achieved by CT [6].
authors. These reported results are all based on the PDG-based
. Cause Transition. The Cause Transition algorithm T -score. As CT achieves the best localization result as
[6], denoted as CT, is an enhanced variant of Delta measured with the PDG-based T -score, we compare SOBER
Debugging [5]. Delta Debugging contrasts the with CT and LIBLIT05 in Section 4.3 using the same
memory graph [22] of one failing execution, ef , measure. Because TARANTULA and SLICECHOP results
against that of one passing execution, ep . By carefully are not reported with the PDG-based T -score, we compare
manipulating the two memory graphs, Delta Debug- SOBER with them separately in Section 4.4.
ging systematically narrows the difference between
ef and ep down to a small set of suspicious variables. 4.3 Comparison with LIBLIT05 and CT
CT enhances Delta Debugging by exploiting the In this section, we compare SOBER with CT and LIBLIT05.
notion of cause transition: “moments where new We subject both LIBLIT05 and SOBER to the 130 faults in the
relevant variables begin being failure causes” [6]. Siemens suite and measure their localization quality using
Therefore, CT essentially implements the concept of the PDG-based T -score (15). The result of CT is directly cited
“search in time” in addition to the original “search in from [6].
space” used in Delta Debugging. Fig. 4a depicts the number of faults that can be located
. Tarantula. The TARANTULA tool was originally when a certain percentage of code is examined by a
presented to visualize the test information for each developer. The x-axis is labeled with T -score. For LIBLIT05
statement in a subject program, and it was shown to and SOBER, we choose the top five predicates to form the set
be useful for fault localization [23]. In a recent study of blamed vertices. Because localization that still requires
[8], the authors took ð1  hueðsÞÞ as the fault developers to examine more than 20 percent of the code is
relevance score for the statement s, where hueðsÞ is generally useless, we treat only [0, 20] as the meaningful
the hue component of each statement in visualiza- T -score range. Under these circumstances, S OBER is
tion [23]. With the fault relevance score calculated apparently better than LIBLIT05, while both of them are
for each statement, TARANTULA produces a ranking consistently superior to CT.
of all executable statements. Developers are ex- For practical use, it is instructive to know how many (or
pected to examine the ranking from the top down what percentage of) faults can be identified when no more
to locate the fault. than  percent of the code is examined. We therefore plot
. Failure-Inducing Chops. Gupta et al. recently the cumulative comparison in Fig. 4b. It clearly suggests
propose a fault localization algorithm that inte- that both SOBER and LIBLIT05 are much better than CT and
grates delta debugging and dynamic slicing [9]. that SOBER outperforms LIBLIT05 consistently. Although
First, a minimal failure-inducing input f 0 is derived LIBLIT05 catches up when the T -score is 60 percent or
from the given failing case f using the algorithms higher, we regard this advantage as irrelevant because it
of Zeller and Hildebrandt [24]. Then, a forward hardly makes sense for a fault locator to require a developer
dynamic slice, F S, and a backward slice, BS, are to examine more than 60 percent of the code.
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 9

Fig. 5. Quality Comparison with regard to various top k Values. (a) Top one. (b) Top two. (c) Top three. (d) Top four. (e) Top five. (f) Top six. (g) Top
seven. (h) Top eight.

Fig. 4b shows that, for the 130 faults in the Siemens suite, Oððn þ mÞ  k þ k  logðkÞÞ. Similarly, LIBLIT05 also needs
when a developer examines at most 1 percent of the code, Oðn þ mÞ to score each predicate, and its time complexity is
CT catches 4.65 percent of the faults while LIBLIT05 and also Oððn þ mÞ  k þ k  logðkÞÞ. We experimented with the
SOBER capture 7.69 percent and 8.46 percent, respectively. 31 faulty versions of the replace program, and the average
Moreover, when 10 percent code examination is acceptable, time for unoptimized LIBLIT05 and SOBER to analyze each
CT and LIBLIT05 identify 34 (26.36 percent) and 52 version was 11.7775 seconds and 11.3844 seconds, respec-
(40.00 percent) of the 130 faults. SOBER is the best of the
tively. This is much faster than CT, as reported in [6].
three, locating 68 (52.31 percent) of the 130 faults, which is
16 faults more than the state-of-the-art approach LIBLIT05. If 4.4 Comparison with Tarantula and SLICECHOP
the developer is patient enough to examine 20 percent of the We now compare SOBER with TARANTULA and SLICE-
code, 73.85 percent of the faults (i.e., 96 of 130) can be CHOP. Recently, Jones and Harrold. [8] reported the result
located by SOBER. of TARANTULA on the Siemens suite with the ranking-
We also vary the parameter k in calculating the T -score
based T -score, and compared it with previous PDG-based
for both LIBLIT05 and SOBER. The quality comparison is
T -scores of CT, NN/PERM, INTERSECT, and UNION. As it is
plotted in Fig. 5 for k varying from 1 through 8. The
unclear to what extent these two kinds of T -score agree with
comparison is confined within the [0, 20] T -score range.
each other, we assume they are equivalent, as Jones and
Since detailed results about CT is not available in [6], CT is
Harrold did in [8]. More investigation, however, is needed
still depicted only at the 1, 10, and 20 ticks. Fig. 5 shows that
to clarify this issue in the future. Moreover, because the
LIBLIT05 is the best when k is equal to 1 or 2. When k ¼ 3,
SOBER catches up, and it consistently outperforms LIBLIT05
afterward. Because developers are always interested in
locating faults with minimal code checking, it is desirable to
select the optimal k that maximizes the localization quality.
We found that both LIBLIT05 and SOBER achieve their best
quality when k is equal to 5. In addition, Fig. 6 plots the
quality of SOBER with various k-values. It clearly indicates
that SOBER locates the largest number of faults when k is
equal to 5. Therefore, the setting of k ¼ 5 in Fig. 4 is
justified. Finally, Fig. 6 also suggests that too few predicates
(e.g., k ¼ 1) may not convey enough information for fault
localization, while too many predicates (e.g., k ¼ 9) are in
themselves a burden for developers to examine and, thus,
neither of them leads to the best result.
Besides being accurate in fault localization, SOBER is also
computationally efficient. Suppose we have n correct and
m incorrect executions. Then, the time complexity of scoring
each predicate is Oðn þ mÞ. If there are, in total, k predicates
instrumented, the entire time complexity of SOBER is Fig. 6. Quality of SOBER with regard to top-k values.
10 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006

exclude the program tot_info because, at that time, their


framework could not handle floating point operations. For
the remaining five subject programs, which consist of 66
faulty versions in total, another 28 faulty versions were
excluded for various reasons, leaving 38 versions used in
the final evaluation. The authors reported that for 23 out of
the 38 versions, no more than 10.4 percent of the source
code needed to be examined. We checked the quality of
SOBER on the 66 versions of the five subject programs, and
found that the T -score is less than 10.4 percent on
43 versions. Moreover, within the 38 faulty versions
examined by SLICECHOP in [9], SOBER has a T -score of less
than 8.4 percent on 27 versions. Because the ratio of
examined code was not reported for each of the 38 versions
in [9], no further comparison is performed here between
SOBER and SLICECHOP.

4.5 Comparison from Other Perspectives


Fig. 7. Quality comparison between existing algorithms. A comprehensive comparison between fault localization
algorithms is hard, and many aspects must be considered
authors failed to compare TARANTULA with statistical for a fair comparison. For example, some important aspects
debugging in [8], this study fills the gap. are the runtime overhead, analysis complexity, localization
We differ from previous comparisons in choosing to accuracy, and the accessibility of final fault localization
compare algorithms in terms of the absolute number of reports. So far, we have been focusing on localization
faulty versions on which an algorithm renders a T -score of accuracy and have demonstrated that SOBER is one of the
no more than  percent. Previously, different subsets of the most accurate algorithms. However, when compared on
Siemens suite were used by different authors, and the other aspects, SOBER might be inferior to other techniques,
percentages based on the different subsets are put together at least in its current state.
for comparison [4], [6], [8], [9]. Specifically, the reported First, some techniques, like NN/P ERM , C T and
percentages for UNION, INTERSECT and NN/PERM are SLICECHOP, only need one failing and multiple passing
based on 109 faulty versions, and the percentage for CT is cases for fault localization, whereas SOBER, LIBLIT05 and
based on 129 versions. In the previous section, the TARANTULA, in principle, need to collect the statistics from
percentages for LIBLIT05 and SOBER are calculated on the multiple failing cases. Secondly, SOBER could be inferior to
whole-set 130 versions. In a recent study of TARANTULA LIBLIT05 in terms of the runtime overhead due to
and SLICECHOP [8], [9], 122 and 38 faulty versions are used instrumentation. Specifically, since LIBLIT05 is based on
by the original authors, respectively. the predicate coverage data, the instrumentation on a
Therefore, based on the reported percentage and the predicate can be disabled once the predicate has been
chosen subset of faulty versions, we recover how many evaluated (in a similar way to Jazz [25]). In contrast, SOBER
faults are located by each algorithm with a T -score no more needs to count the evaluation frequency throughout the
than  percent, and Fig. 7 shows the effectiveness compar- execution. Finally, some algorithms, including TARANTULA,
ison in terms of the absolute number of faulty versions. LIBLIT05 and Delta Debugging, have provided visual
Because the study of SLICECHOP excluded 91 faulty versions, interfaces to increase their accessibility. Currently, no visual
for fairness it is not plotted in Fig. 7. Instead, we compare interface is available for SOBER, but one could be added in
SOBER with SLICECHOP separately later. the future.
Fig. 7 clearly shows that the effectiveness of the seven
algorithms is at three different levels. The algorithms UNION
and INTERSECT are the least effective and NN/PERM and CT
5 SOBER IN AN IMPERFECT WORLD
are in the middle, with CT being better than NN/PERM. The Besides the probabilistic treatment of program predicates,
other three algorithms: LIBLIT05, TARANTULA and SOBER, there are two other factors that implicitly contribute to
apparently have the best result on the Siemens suite. SOBER’s effectiveness shown in Section 4. First, the test suite
We now compare TARANTULA with SOBER in detail. in the experiment is reasonably adequate given the program
They both locate 68 faults when the T -score is no more than code size: Each subject program of the Siemens suite is
10 percent. When the T -score is less than 1 percent, accompanied by a few thousand test cases.1 Intuitively,
TARANTULA and SOBER locate 17 and 11 faults, respec- more-reliable statistics can be collected from a more-
tively. On the other hand, with the T -score no more than adequate test suite and would enable SOBER to produce
better fault localizations. Second, by taking the fault-free
20 percent, SOBER can help locate 96 out of the 130 faults,
version as the test oracle, each execution is precisely labeled
whereas TARANTULA helps locate 75. Since the comparison
as either passing or failing. This provides SOBER with a
is based on the assumption of the equivalence between the
noise-free analysis environment, which likely benefits
PDG-based and ranking-based T -scores, we refrain from SOBER’s inference ability.
drawing conclusions about the relative superiority of either Although these two elements are highly desirable for
method. Ultimately, the effectiveness of all fault localization
quality localization, they are usually not available in
algorithms will be assessed by end-users in practice.
We now compare SOBER with SLICECHOP. In the study 1. In this paper, we take the number of test cases as a rough measure of
of SLICECHOP [9], the authors excluded the program tcas the test adequacy. More involved discussion about test adequacy is out of
from the Siemens suite due to its small size, and they the scope of this study.
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 11

Fig. 8. Quality degradation with regard to  percent-sampled test suite. (a) Quality of SOBER with regard to sampled test suite. (b) Quality of LIBLIT
with regard to sampled test suite.

practice due to the potential high cost. For example, because Specifically, for each faulty version of the Siemens suite,
the program specification varies from one component to we randomly sample a portion  ð0 <   1Þ of the original
another, exclusive test scripts for each component must be test suite T . Suppose T consists of N test cases. Then,
prepared by human testers. Although some tools can help dN  e cases are randomly taken, constituting a -sampled
expedite the generation of test cases [26], [27], [28], critical test suite, denoted as T . Because both SOBER and LIBLIT05
manual work is still unavoidable. Furthermore, besides the need at least one failing case, the above sampling is
difficulty of test case generation, the test oracle is even repeated until at least one failing case is included. Finally,
harder to construct. Again because of variations in program both SOBER and LIBLIT05 are run on the same T for each
functionality, it is usually humans developers who prepare faulty program.
the expected outputs or pass judgment about the correct- Fig. 8 plots how the quality varies with different
ness of outputs in practice. sampling rates for both SOBER and LIBLIT05. We set 
Therefore, considering the difficulty of obtaining an equal to 100 percent, 10 percent, 1 percent and 0.1 percent,
adequate test suite and a test oracle, we regard the respectively, so that T100% represents the entire test suite
environment that we experimented with in Section 4 as “a and each of the following is roughly one-tenth as small as
perfect world.” In order to shed some light on how SOBER the previous one. As  gets smaller, the localization quality
would work in practice, in this section we subject SOBER to of both SOBER and LIBLIT05 gradually degrades. For
an “imperfect world,” where adequate test suites and test example, in Fig. 8a, curves for smaller s are strictly below
oracles are not simultaneously available. Section 5.1 exam- those for higher sampling rates. A similar pattern for
ines SOBER’s robustness to test inadequacy, and Section 5.2 LIBLIT05 is also observed in Fig. 8b. These observations are
studies how SOBER handles partially labeled test suites. easily explainable. In statistical hypothesis testing, the
We regard, and hence believe, that the examination of confidence of either accepting or rejecting the null hypothesis
SOBER in an “imperfect world” is both necessary and is, in general, proportional to the number of observations.
interesting. To some extent, this examination bridges the Because SOBER bears a similar rationale to hypothesis
gap between the perfect-world experiments (i.e., Section 4) testing, its quality naturally improves as more and more test
and real-world practices that cannot be fully covered in any cases are observed. Because LIBLIT05 relies on the accurate
single research paper. We simulate the imperfect world estimation of the two conditional probabilities, its quality
with the 130 faulty versions of the Siemens suite. In parallel also improves with more labeled test cases due to the Law
with SOBER, LIBLIT05 is also subjected to the same of Large Numbers.
experiments for a comparative study, which illustrates In Fig. 8a, one can also notice that the curve for  ¼ 10%
how the two statistical debugging algorithms react to the is quite close to the highest. This suggests that SOBER
imperfect world. obtains competitive results even when the test suite is only
one-tenth of the original. Moreover, Fig. 8 also indicates that
5.1 Robustness to Inadequate Test Suites even when  is as low as 0.1 percent, both SOBER and
Because of the cost of an adequate test suite, people usually LIBLIT05 are still consistently better than CT. Based on the
settle for inadequate but nevertheless satisfactory suites in typical suite size from Table 1, T0:1% contains at most six test
practice. For instance, during the prototyping stage, one cases, at least one of which is failing. As one can see, even
may not bother much with an all-around testing, and a with such an insufficient test suite, both SOBER and
preliminary test suite usually suffices. We now simulate an LIBLIT05 still outperform CT. For example, without exam-
inadequate test suite by sampling (without replacement) the ining more than 20 percent of the code, SOBER and LIBLIT05
accompanying test suite of the Siemens suite. The sampled locate 53.08 percent and 51.54 percent of the 130 faults
test suite becomes more and more inadequate as the respectively, while CT works well with 38 percent of the
sampling rate gets smaller. versions. This could be attributed to the underlying
12 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006

The outer ellipse represents the entire test suite T . The


vertical line divides T into the full failing set Tft on the left
and the full passing set Tpt on the right. Certainly, Tf  Tft
and Tp  Tpt . As seen in Fig. 8a, the best localization is
achieved by SOBER when Tf ¼ Tft and Tp ¼ Tpt , i.e., when
the given test suite T is fully labeled.
Now, given the partially labeled test suite T , the most
straightforward scheme for SOBER is to analyze labeled test
cases Te only. Because Te is fully labeled, SOBER can be
Fig. 9. Two schemes to work with partial-labeled test suite. (a) Scheme
with labeled cases only. (b) Scheme with both labeled and unlabeled
immediately applied with Te . In fact, this scheme is
cases. equivalent to running SOBER on a -sampled test suite,
where  jTjTejj and is usually quite small. As a conservative
mechanism of CT: It localizes faults by systematically estimation,  can be around 1 percent. In considering the
contrasting the memory graphs of one passing and one Siemens programs,  ¼ 1% means that the tester examines
failing execution. However, because the faults in the tens among thousands of test cases and identifies about five
Siemens suite are mainly logic errors that rarely cause
failing runs on the average. In our opinion, this can be a
memory abnormalities, CT has difficulties in identifying the
“delta” and further locating the fault. On the other hand, reasonable workload for the tester.
This scheme, although straightforward, does render
because predicates express logic relations, it is no surprise
reasonable localization results. As shown in Fig. 8a,
that predicate-based algorithms work better.
Beside varying the sampling rate , we also examined “SOBER 1% Sampled” is clearly better than CT. But it is
how the quality changed with respect to the absolute size of also seen that a considerable gap exists between “SOBER
0.1% Sampled” and “SOBER 100% Sampled.” For concise
the test suite. However, because the size of the accompany-
reference, we use SOBER_FULL to denote “SOBER 100%
ing test suite and the failing rate drastically vary from one
Sampled” in the following. Although the same quality as
faulty version to another, it makes little sense to set a
SOBER_FULL is not (unrealistically) expected when T is
uniform size for the test suite for quality examination. We
partially labeled, we nevertheless believe that Tu can be
therefore refrain from doing so, but choose instead to study
utilized for better quality than that with Te only.
how the number of failing cases could affect the localization
The above straightforward scheme apparently overlooks
quality, as described in the next section.
the information contained in Tu . Although Tu bears no
5.2 Handling Partially Labeled Test Suites labeling information, its runtime statistics, if used properly,
Although an adequate test suite is difficult to obtain, can assist SOBER in fault analysis. In this study, we restrict
preparing a test oracle that can automatically recognize our discussion to reasonably developed programs that pass
each execution as either passing or failing is even harder. In all but a few test cases. One can judge whether this
some situations, test case generation can be relatively easy. assumption holds by examining the percentage of failing
For example, one can simply feed random strings to a cases in Te . For example, if a program fails most cases in Te ,
program that consumes string inputs. However, these test the fault could be quite easy to find. For reasonably
cases are hardly useful until we know the expected outputs. developed programs, we can choose to label all the
In practice, except for programs that can be described by unexamined test cases Tu as passing and apply SOBER to
a program model, the expected outputs are usually the regarded failing and passing set Tf0 and Tp0 , where Tf0 ¼ Tf
prepared by human testers, either manually or assisted by and Tp0 ¼ Tp [ Tu . The difference between the two schemes
tools. It is usually unrealistic for a tester to examine is visualized in Fig. 9.
thousands of executions and label them. Instead, a tester Let Tm represent the set of unexamined failing cases, i.e.,
will likely stop testing and return the faulty program to Tm ¼ Tft  Tf . Then, all the cases in Tm are mislabeled as
developers for patches when a small number of failing cases passing in the above treatment. While this mislabeling
are encountered. At that time, the examined cases are labeled unavoidably introduces impurity into Tp0 , the effect it has on
and the rest are unlabeled. This describes a typical scenario SOBER is minimal: The p calculated with Tp0 deviates
which exemplifies how partially labeled test suites arise in negligibly from that with Tpt because Tp0 ¼ Tp [ Tu ¼ Tpt [ Tm
practice. In this section, we examine how well SOBER helps and jTm j  jTf j  jTpt j.
developers locate the underlying faults, when the test suite On the other hand, by mislabeling Tm , we utilize the
is partially labeled. runtime statistics of the cases in Tpt  Tp , which are
Formally, given a test suite T , suppose a tester has otherwise disregarded. In this way, p can be estimated
examined and labeled a subset suite Te ðTe  T Þ. Because more accurately with Tp0 than with Tp only. This could
manual labeling is usually expensive, it is common that subsequently bring better localization quality. Therefore,
jTe j  jT j. Let Tp and Tf denote the set of passing and this is essentially a trade-off between grabbing more
failing runs identified by the tester. Then, Te ¼ Tp [ Tf and passing runs and (unavoidably) mislabeling some failing
Tp \ Tf ¼ . We use Tu to denote the unexamined part of the runs. In our belief, the gain from including more passing
suite, i.e., Tu ¼ T  Te . T is partially labeled if and only if executions should surpass the loss from mislabeling. As
Tu 6¼ . The set relationship is further depicted in Fig. 9a. will be shown shortly, this scheme achieves much better
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 13

Fig. 10. Quality comparison with regard to the number ðmÞ of labeled failing cases. (a) m ¼ 1. (b) m ¼ 3. (c) m ¼ 5. (d) m ¼ 10.

results than the straightforward scheme, and sometimes it 6 EXPERIMENTAL EVALUATION WITH LARGE
even obtains comparable results to SOBER_FULL. PROGRAMS
We simulate partially labeled test suites using the
Siemens programs. For each faulty version, we randomly Although the 130 faulty versions of the Siemens programs
select m failing cases as Tf (i.e., the set of failing cases are appropriate for algorithm comparison, the effectiveness
identified by the tester). According to the above scheme, all of SOBER nevertheless needs to be assessed on large
the remaining cases are regarded as passing, i.e., Tp0 . We programs. In this section, we report on the experimental
then run both SOBER and LIBLIT05 with the same Tp0 and Tf0 evaluation of SOBER on two (reasonably) large programs,
(recall that Tf0 ¼ Tf ) for each of the 130 faulty versions. We grep 2.2 and bc 1.06. Moreover, as two faults are located in
experiment with m equal to 1, 3, 5, and 10, respectively, and each program, this evaluation also illustrates how SOBER
this represents the increasing effort that the tester puts into helps developers handle multifault cases. The detailed
test evaluation. If a faulty version does not have m failing experimental results with grep 2.2 and bc 1.06 are
cases, we take all the failing cases. In the Siemens suite, presented in Sections 6.1 and 6.2, respectively.
there are 0, 4, 14, and 19 versions that have less than 1, 3, 5,
and 10 failing cases. These versions were not excluded 6.1 A Controlled Experiment with grep 2.2
because they do represent real situations. We obtained a copy of the grep 2.2 subject program from
Fig. 10 plots the localization quality for both SOBER and the “Subject Infrastructure Repository” (SIR) [29]. The
LIBLIT05 with m equal to 1, 3, 5, and 10, respectively. original code of grep 2.2 has 11,826 lines of C code, as
Curves for CT and SOBER_FULL are also plotted as the counted by the tool SLOCCount [30], while the announced
baseline and ceiling quality in each subfigure. Among the size of the modified version at SIR is 15,633 LOC. A test
four subfigures, Fig. 10a represents the toughest situation, suite of 470 test cases is available at SIR for the program. We
where only one failing case is identified in each faulty tried out all the seeded faults provided by SIR, but found no
version. This simulates a typical scenario where a developer fault incurred failures on the accompanying test suite. We
starts debugging once a faulty execution is encountered. As therefore manually injected two faults in the source code, as
expected, the quality of SOBER degrades considerably from shown in Fig. 11 and Fig. 12, respectively.
SOBER_FULL, but it is still better than CT. The first fault (shown in Fig. 11) is an “off-by-one”
We note that the m ¼ 1 situation is at least as harsh as the error: an expression “+1” is appended to line 553 in the
situation with 0.1 percent-sampled test suites, as shown in grep.c file. This fault causes failures in 48 of the 470 test
Fig. 8a. Nevertheless, at least one failing run is in every cases. The second fault (in Fig. 12) is a “subclause-
0.1 percent-sampled test suite. In order to demonstrate the missing” error. The subclause ðlcp½i ¼¼ rcp ½iÞ is
effect of treating Tu as passing, we replot the curve of SOBER commented out at line 2270 in file dfa.c. The fault incurs
with  ¼ 0:1% in Fig. 10a with a dashed line. The another 88 failing cases.
remarkable gap between “SOBER” and “SOBER, 0.1%” Although these two faults are manually injected, they do
suggests the benefit of treating unlabeled cases as passing. mimic realistic logic errors. Logic errors like “off-by-one” or
The four subfigures of Fig. 10, viewed in a sequence, “subclause-missing” may sneak in when developers are
show that the quality of SOBER gradually improves as handling obscure corner conditions. Because logic errors,
additional failing cases are explicitly labeled. Intuitively, the like these two, do not generally incur program crashes, they
more failing cases that are identified, the more accurately are usually harder to debug than those causing program
the statistic Y (4) approaches to the true faulty behavior of crashes. In the following, we illustrate how SOBER helps
predicate P and, hence, the higher quality of the final developers find these two faults.
predicate ranking list. LIBLIT05 also improves for a similar We first instrument the source code. According to the
reason. instrumentation schema described in Section 4.1, grep 2.2 is
instrumented with 1,732 branch and 1,404 return predicates.
5.3 Summary The first run of SOBER with the 136 failing (due to the two
In this section, we empirically examined how SOBER works faults) and the remaining 334 passing cases produces a
in an imperfect world, where either the test suite is predicate ranking, whose top three predicates are listed in
inadequate or only a limited number of failing executions Table 2. For easy reference, the three predicates are also
are explicitly identified. The experiment demonstrates the marked at their instrumented locations in Fig. 11 and
robustness of SOBER under these harsh conditions. In Fig. 12.
addition, the scheme of tagging all unlabeled cases as As we can see, the predicates P1 and P2 point to the
passing is shown effective in leveraging SOBER’s quality. faulty function for the first fault. The predicate P1 is four
14 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006

TABLE 2
Top Three Predicates from the First Run of SOBER

Fig. 13. First Fault in bc 1.06, in storage.c.

Similarly, we find that P2 is considerably evaluated as true


Fig. 11. Fault 1: An off-by-one error in grep.c. in failing cases, but mostly false in passing cases.
We notice that when P2 evaluates true, the variable
lastout is reset to 0, which immediately causes P1 to
evaluate as false in the next iteration. This explains why
predicates P1 and P2 are both ranked at the top. In order to
find why “beg ! ¼ lastout” tends to evaluate to true in
failing cases, a developer would pay attention to the
assignment to variables beg and lastout. Within the
for loop from lines 541 through 580, there are no other
assignments to lastout except lines 549 and 575. Then, the
developer would examine lines 553, 566, and 571, where
beg gets assigned. A developer familiar with the code will
then identify the fault.
After fixing the first fault, a second run of SOBER with
the 88 failing and 382 passing cases puts P3 at the top. A
developer paying more attention to line 2270 of the dfa.c file
would find the fault, as P3 points to the exact fault location.
Because SOBER is only for fault localization, it is the
developer’s responsibility to confirm the fault location and
fix it. To the best of our knowledge, no tools can
automatically suggest patches for logic errors without
assuming any specifications.
Fig. 12. Fault 2: A subclause-missing error in dfa.c.
6.2 A Case Study with bc 1.06
lines above the real fault location. The predicate P3 , on the In this section, we report a case study of SOBER with a real-
other hand, points directly to the exact location of the world program bc 1.06, on which SOBER identifies two
second fault. Now, let us explore how these top predicates buffer overflow faults, one of which has never been
reported before.
help developers locate the faults.
bc is a calculator program that accepts scripts written in
Given the top-ranked predicates, it is natural to ask why
the bc language, which supports arbitrary precision calcula-
they are ranked high. We find that the sample mean and tions. The 1.06 version of the bc program is shipped with
standard deviation of the evaluation bias of P1 (denoted by most recent UNIX/Linux distributions. It has 14,288 LOC,
ðP1 Þ) are 0.90 and 0.25 in the 136 failing cases but are 0.99 and a buffer overflow fault has been reported in [7], [12].
and 0.065 in the remaining 344 passing cases. This suggests This experiment was conducted on a 3.06 GHz Pentium-4
that P1 is mostly evaluated true in passing cases with a PC running Linux RedHat 9 with gcc 3.3.3. Inputs to bc
small variance but is mostly evaluated false in some failing 1.06 are 4,000 valid bc programs that are randomly
generated with various size and complexity. We generate
cases, as indicated by its much larger variance in failing
each input program in two steps: First, a random syntax tree
cases. By examining ðP1 Þ in the failing cases, we find that is generated in compliance with the bc language specifica-
ðP1 Þ is smaller than 0.1 in five failing cases. Therefore, we tion; second, a program is derived from the syntax tree.
know that P1 can be evaluated mostly as false in these With the aid of SOBER, we quickly identify two faults in bc
failing cases, whereas it is mostly true in passing cases. 1.06, including one that has not been reported. Among the
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 15

cases [20]. Readers interested in the details of invariants are


referred to the project DAIKON [16]. The DIDUCE project
[14] monitors a more restricted set of predicates and relaxes
them in a similar manner to DAIKON at runtime. After the
set of predicates becomes stable, the DIDUCE tool relates
future violations as indications of potential faults. This
approach is demonstrated to be effective on four large
software systems. However, as invariants are a special kind
of predicates that hold in all passing executions, they may
not be effective in locating subtle faults as suggested by
Pytlik et al. in [20]. In comparison, the probabilistic
treatment of predicates implemented by SOBER naturally
relaxes this requirement and is shown to achieve much
Fig. 14. Second fault in bc 1.06, in storage.c. better localization results on the Siemens suite.
Contrasts based on program slicing [34] and dicing [35]
4,000 input cases , the bc 1.06 program fails 521 of them. After are also shown effective for fault localization. For example,
running through these test cases, the analysis from SOBER Agrawal et al. [33] present a fault localization technique,
reports “indx < old count” as the most fault-relevant implemented as slice, which is based on the execution
traces of test cases. This technique displays and contrasts
predicate. This predicate points to the variable old_ count
the dices of one failing case to those of multiple passing
in line 137 of storage.c (shown in Fig. 13). A quick scan of the
cases. Jones et al. [23] describe a similar approach
code shows that old_count copies its value from v_count.
implemented as TARANTULA. Unlike -slice, TARANTULA
By putting a watch on v_count, we find that v_count is
collects the testing information from all passing and failing
overwritten when a buffer named genstr overflows (in bc.y,
cases and colors suspicious statements based on the
line 306). The buffer genstr is 80 bytes long and is used to
contrast. Later, Renieris and Reiss [4] find that the contrast
hold bytecode characters. An input containing complex and
renders better fault localization when the given failing case
relatively large functions can easily overflow it. To the best of
is contrasted with the most similar passing case (i.e., the
our knowledge, this fault has not been reported before. We
nearest neighbor). In comparison, SOBER collects the
manually examine the statistics of the top-ranked predicate evaluation frequency of instrumented predicates, a much
and find that its evaluation bias in correct and incorrect richer information base, and quantifies the model difference
executions is 0.0274 and 0.9423, respectively, which intui- through a statistical approach.
tively explains why SOBER works. LIBLIT05 also ranks the While all the fault localization algorithms examined in
same predicate at the top. this paper are designed for programming professionals,
After fixing the above fault, a second run of SOBER (3,303 recent years have also witnessed an emergence of fault
correct and 697 incorrect cases) generates a fault report with localization algorithms especially tuned to assist end users
the top predicate “a count < v count,” which points to in fault diagnosis. For example, Ayalew and Mittermeir
line 176 of storage.c (shown in Fig. 14). This is likely a copy- propose a technique to trace faults in spreadsheets based on
paste error where a_count should have been used in the “interval testing” and slicing [36]. Ruthruff et al. improve
position of v_count. This fault has been reported in this approach by allowing end-users to interactively adjust
previous studies [7], [12]. their feedbacks [37]. The Whyline prototype realizes a new
As a final note, predicates identified by SOBER for these debugging paradigm called “interrogative debugging,”
two faults are far from the actual crashing points. This which allows users to ask why did and why didn’t questions
suggests that SOBER picks up predicates that characterize about runtime failures [38].
the scenario under which faults are triggered, rather than The power of statistical analysis is demonstrated in
the crashing venues. program analysis and fault detection. Dickinson et al. find
program failures through clustering program execution
profiles [39]. Their subsequent work [40] first performs
7 DISCUSSION feature selection using logistic regression and then clusters
7.1 Related Work failure reports within the space of selected features. The
In this section, we briefly review previous work related to clustering results are shown to be useful in prioritizing
fault detection in general. Static analysis techniques have software faults. Early work of Liblit et al. on statistical
been used to verify program correctness against a well- debugging [7] also adopts logistic regression in sifting
specified program model [1], [31] and to check real codes predicates that are correlated with program crashes. In
directly for Java [2] and C/C++ programs [3]. Engler et al. addition, they impose L1 norm regularization during the
[32] further show that the correctness rules sometimes can regression so that predicates that are really correlated are
distinguished. In comparison, our method SOBER is a
be automatically inferred from source code, hence saving, to
statistical model-based approach, while the above statistical
some extent, the cost of preparing specifications. Comple-
methods follow the principle of discriminant analysis.
mentary to static analysis, dynamic analysis focuses more Specifically, SOBER features a hypothesis testing-based
on the runtime behavior and often assumes fewer specifica- approach, which has not been seen in the fault localization
tions. SOBER belongs to the category of dynamic analysis. literature.
Within dynamic analysis, most fault localization techni-
ques are based on the contrast between failing and passing 7.2 Threats to Validity
cases [4], [5], [6], [7], [8], [12], [20], [21], [33]. For example, Like any empirical study, threats to validity should be
invariants that are formed from passing cases can suggest considered in interpreting the experimental results pre-
potential fault locations if they are violated in any failing sented in this paper. Specifically, the results obtained
16 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006

with the Siemens suite cannot be generalized to arbitrary Inc. offered the authors a free copy of CODESURFER. Last
programs. However, we expect that on larger programs but not the least, the authors deeply appreciate the
with greater separation of concerns, most fault localiza-
tion techniques will do better. This expectation is insightful questions, comments, and suggestions from
supported by existing studies with CT, LIBLIT05, and anonymous referees, which proved invaluable during the
TARANTULA [6], [8], [12], as well as the experiments in preparation of this paper.
Section 6 in this study.
Threats to construct validity concern the appropriateness
of the quality metric for fault localization results. In this REFERENCES
paper, we adopt the PDG-based T -score, which was proposed [1] E. Clarke, O. Grumberg, and D. Peled, Model Checking. MIT Press,
by Renieris and Reiss [4]. Although this evaluation frame- 1999.
work involves no subjective judgments, it is by no means a [2] W. Visser, K. Havelund, G. Brat, and S. Park, “Model Checking
Programs,” Proc. 15th IEEE Int’l Conf. Automated Software Eng.
comprehensively fair metric. For instance, this measure does (ASE ’00), pp. 3-12, 2000.
not take into account how easily a developer can make sense [3] M. Musuvathi, D. Park, A. Chou, D. Engler, and D. Dill, “CMC: A
of the fault localization report. Recent work [6] also identifies Pragmatic Approach to Model Checking Real Code,” Proc. Fifth
some other limitations of this measurement. In previous Symp. Operating System Design and Implementation (OSDI ’02),
pp. 75-88, 2002.
work, a ranking-based T -score is used to evaluate the [4] M. Renieris and S. Reiss, “Fault Localization with Nearest
effectiveness of TARANTULA. Although both forms of Neighbor Queries,” Proc. 18th IEEE Int’l Conf. Automated Software
T -score estimate the human efforts needed to locate the fault, Eng. (ASE ’03), pp. 30-39, 2003.
it is yet unclear whether they agree. The comparison of [5] A. Zeller, “Isolating Cause-Effect Chains from Computer Pro-
grams,” Proc. ACM Int’l Symp. Foundations of Software Eng.
TARANTULA with other algorithms in Section 4.4 assumes (FSE ’02), pp. 1-10, 2002.
the equivalence between the two forms. More extensive [6] H. Cleve and A. Zeller, “Locating Causes of Program Failures,”
studies are needed to clarify this issue. Proc. 27th Int’l Conf. Software Eng. (ICSE ’05), pp. 342-351, 2005.
Finally, threats to internal validity concern the experi- [7] B. Liblit, A. Aiken, A. Zheng, and M. Jordan, “Bug Isolation via
ments of SOBER with the programs grep 2.2 and bc 1.06, Remote Program Sampling,” Proc. ACM SIGPLAN 2003 Int’l Conf.
Programming Language Design and Implementation (PLDI ’03),
discussed in Section 6. Specifically, the two logic errors in grep pp. 141-154, 2003.
2.2 are injected by us. However, because these two logic [8] J. Jones and M. Harrold, “Empirical Evaluation of the Tarantula
errors do not incur segmentation faults, they are generally Automatic Fault-Localization Technique,” Proc. 20th IEEE/ACM
harder to debug, even for human developers. In contrast, case Int’l Conf. Automated Software Eng. (ASE ’05), pp. 273-282, 2005.
[9] N. Gupta, H. He, X. Zhang, and R. Gupta, “Locating Faulty Code
studies in previous work target crashing faults [5], [6], [7], Using Failure-Inducing Chops,” Proc. 20th IEEE/ACM Int’l Conf.
[12]. Therefore, the experiment with grep 2.2 demonstrates Automated Software Eng. (ASE ’05), pp. 263-272, 2005.
the effectiveness of SOBER on large programs with logic [10] I. Vessey, “Expertise in Debugging Computer Programs,” Int’l J.
errors. In order to minimize the threats to external validity Man-Machine Studies: A Process Analysis, vol. 23, no. 5, pp. 459-494,
1985.
about experiments with large programs, a case study with bc
[11] M. Harrold, G. Rothermel, K. Sayre, R. Wu, and L. Yi, “An
1.06 is also presented, which illustrates the effectiveness of Empirical Investigation of the Relationship between Spectra
SOBER on real faults. However, two experiments are still Differences and Regression Faults,” Software Testing, Verification,
insufficient to make claims about the general effectiveness of and Reliability, vol. 10, no. 3, pp. 171-194, 2000.
SOBER on large programs. Ultimately, all fault localization [12] B. Liblit, M. Naik, A. Zheng, A. Aiken, and M. Jordan, “Scalable
Statistical Bug Isolation,” Proc. ACM SIGPLAN 2005 Int’l Conf.
algorithms should be subjected to real practice, and evaluated Programming Language Design and Implementation (PLDI ’05),
by end users. pp. 15-26, 2005.
[13] Y. Brun and M. Ernst, “Finding Latent Code Errors via Machine
Learning over Program Executions,” Proc. 26th Int’l Conf. Software
8 CONCLUSIONS Eng. (ICSE ’04), pp. 480-490, 2004.
[14] S. Hangal and M. Lam, “Tracking down Software Bugs Using
In this paper, we propose a statistical approach to localize Automatic Anomaly Detection,” Proc. 24th Int. Conf. Software Eng.
software faults without prior knowledge of program (ICSE ’02), pp. 291-301, 2002.
semantics. This approach tackles the limitations of previous [15] G. Casella and R. Berger, Statistical Inference, second ed., Duxbury,
2001.
methods in modeling the divergence of predicate evalua- [16] M. Ernst, J. Cockrell, W. Griswold, and D. Notkin, “Dynamically
tions between correct and incorrect executions. A systema- Discovering Likely Program Invariants to Support Program
tic evaluation with the Siemens suite, together with two Evolution,” IEEE Trans. Software Eng., vol. 27, no. 2, pp. 1-25,
Feb. 2001.
case studies with grep 2.2 and bc 1.06, clearly demonstrates [17] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand, “Experiments
the advantages of our method in fault localization. We also of the Effectiveness of Dataflow- and Controlflow-Based Test
simulate an “imperfect world” to investigate SOBER’s Adequacy Criteria,” Proc. 16th Int’l Conf. Software Eng. (ICSE’94),
robustness to the harsh scenarios that may be encountered pp. 191-200, 1994.
[18] G. Rothermel and M. Harrold, “Empirical Studies of a Safe
in practice. The experimental result favorably supports Regression Test Selection Technique,” IEEE Trans. Software Eng.,
SOBER’s applicability. vol. 24, no. 6, pp. 401-419, June 1998.
[19] T. Cover and J. Thomas, Elements of Information Theory, first ed.
Wiley-Interscience, 1991.
ACKNOWLEDGMENTS [20] B. Pytlik, M. Renieris, S. Krishnamurthi, and S. Reiss, “Automated
Fault Localization Using Potential Invariants,” Proc. Fifth Int’l
The authors would like to thank Gregg Rothermel for Workshop Automated and Algorithmic Debugging (AADEBUG ’03),
making the Siemens program suite available. Darko pp. 273-276, 2003.
Marinov provided the authors with insightful suggestions. [21] C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff, “Sober: Statistical
Model-Based Bug Localization,” Proc. 10th European Software Eng.
Andreas Zeller, Holger Cleve and Manos Reneris gener- Conf./13th ACM SIGSOFT Int’l Symp. Foundations of Software Eng.
ously shared their evaluation frameworks. GrammaTech (ESEC/FSE ’05), pp. 286-295, 2005.
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 17

[22] T. Zimmermann and A. Zeller, “Visualizing Memory Graphs,” Long Fei received the BS degree in computer
Revised Lectures on Software Visualization, Int’l Seminar, pp. 191-204, Science from Fudan University, China, and the
2002. MS degree in electrical and computer engineer-
[23] J. Jones, M. Harrold, and J. Stasko, “Visualization of Test ing from Purdue University. He is currently a
Information to Assist Fault Localization,” Proc. 24th Int’l Conf. PhD student in the School of Electrical and
Software Eng. (ICSE ’02), pp. 467-477, 2002. Computer Engineering at Purdue University. His
[24] A. Zeller and R. Hildebrandt, “Simplifying and Isolating Failure- research interests are compilers and using
Inducing Input,” IEEE Trans. Software Eng., vol. 28, no. 2, pp. 183- compiler techniques for software debugging.
200, Feb. 2002. He is a member of the IEEE.
[25] J. Misurda, J. Clause, J. Reed, B. Childers, and M. Soffa, “Jazz: A
Tool for Demand-Driven Structural Testing,” Proc. 14th Int’l Conf.
Compiler Construction (CC ’05), pp. 242-245, 2005.
[26] C. Pacheco and M. Ernst, “Eclat: Automatic Generation and Xifeng Yan received the BE degree from the
Classification of Test Inputs,” Proc. 19th European Conf. Object- Computer Engineering Department of Zhejiang
Oriented Programming (ECOOP ’05), pp. 504-527, 2005. University, China, in 1997, the MSc degree in
[27] C. Boyapati, S. Khurshid, and D. Marinov, “Korat: Automated computer science from the University of New
Testing Based on Java Predicates,” Proc. ACM/SIGSOFT Int’l Symp. York at Stony Brook in 2001, and the PhD
Software Testing and Analysis (ISSTA ’02), pp. 123-133, 2002. degree in computer science from the University
[28] C. Csallner and Y. Smaragdakis, “JCrasher: An Automatic of Illinois at Urbana-Champaign in 2006. He is a
Robustness Tester for Java,” Software—Practice and Experience, research staff member at the IBM T.J. Watson
vol. 34, no. 11, pp. 1025-1050, 2004. Research Center. His area of expertise is data
[29] H. Do, S. Elbaum, and G. Rothermel, “Supporting Controlled mining, with an emphasis on mining and search
Experimentation with Testing Techniques: An Infrastructure and of graph and network data. His current research is focused on data
its Potential Impact,” Empirical Software Eng.: An Int’l J., vol. 10, mining foundations, pattern post analysis, social, biological and Web
no. 4, pp. 405-435, 2005. data mining, and data mining in software engineering and computer
[30] D. Wheeler, SLOCCount: A Set of Tools for Counting Physical systems. He has published more than 30 papers in reputed journals and
Source Lines of Code, http://www.dwheeler.com/sloccount/, conferences, such as the ACM Transactions on Database Systems, the
2006. ACM SIGMOD Conference on Management of Databases, the ACM
[31] K. Apt and E. Olderog, Verification of Sequential and Concurrent SIGKDD International Conference on Knowledge Discovery and Data
Programs, second ed. Springer-Verlag, 1997. Mining, the Very Large Database Conference, the Conference on
[32] D. Engler, D. Chen, and A. Chou, “Bugs as Inconsistent Behavior: Intelligent Systems for Molecular Biology, the International Conference
A General Approach to Inferring Errors in Systems Code,” Symp. on Data Engineering, and the Foundations of Software Engineering
Operating Systems Principles, pp. 57-72, 2001. Conference. He is a member of the IEEE.
[33] H. Agrawal, J. Horgan, S. London, and W. Wong, “Fault
Localization Using Execution Slices and Dataflow Tests,” Proc. Jiawei Han is a professor in the Department of
Sixth Int’l Symp. Software Reliability Eng., pp. 143-151, 1995.
Computer Science at the University of Illinois at
[34] F. Tip, “A Survey of Program Slicing Techniques,” J. Programming
Urbana-Champaign. He has been working on
Languages, vol. 3, pp. 121-189, 1995. research into data mining, data warehousing,
[35] J. Lyle and M. Weiser, “Automatic Program Bug Location by stream data mining, spatiotemporal and multi-
Program Slicing,” Proc. Second Int’l Conf. Computers and Applica- media data mining, biological data mining, social
tions, pp. 877-882, 1987. network analysis, text and Web mining, and
[36] Y. Ayalew and R. Mittermeir, “Spreadsheet Debugging,” Proc. software bug mining, with over 300 conference
European Spreadsheet Risks Interest Group Ann. Conf., 2003.
and journal publications. He has chaired or
[37] J. Ruthruff, M. Burnett, and G. Rothermel, “An Empirical Study of
served on many program committees of inter-
Fault Localization for End-User Programmers,” Proc. 27th Int’l national conferences and workshops. He also served or is serving on the
Conf. Software Eng. (ICSE ’05), pp. 352-361, 2005. editorial boards for Data Mining and Knowledge Discovery, the IEEE
[38] A. Ko and B. Myers, “Designing the Whyline: A Debugging Transactions on Knowledge and Data Engineering, the Journal of
Interface for Asking Questions about Program Behavior,” Proc. Computer Science and Technology, and the Journal of Intelligent
SIGCHI Conf. Human Factors in Computing Systems (CHI ’04), pp.
Information Systems. He is currently serving as founding editor-in-chief
151-158, 2004.
of the ACM Transactions on Knowledge Discovery from Data and on the
[39] W. Dickinson, D. Leon, and A. Podgurski, “Finding Failures by board of directors for the executive committee of the ACM Special
Cluster Analysis of Execution Profiles,” Proc. 23rd Int’l Conf. Interest Group on Knowledge Discovery and Data Mining (SIGKDD).
Software Eng. (ICSE ’01), pp. 339-348, 2001. Jiawei is an ACM fellow and an IEEE senior member. He has received
[40] A. Podgurski, D. Leon, P. Francis, W. Masri, M. Minch, J. Sun, and many awards and recognitions, including the ACM SIGKDD Innovation
B. Wang, “Automated Support for Classifying Software Failure
Award (2004) and the IEEE Computer Society Technical Achievement
Reports,” Proc. 25th Int’l Conf. Software Eng. (ICSE ’03), pp. 465-475,
Award (2005).
2003.

Chao Liu received the BS degree in computer Samuel P. Midkiff received the PhD degree in
science from Peking University, China, in 2003, 1992 from the University of Illinois at Urbana-
and the MS degree in computer science from the Champaign, where he was a member of the
University of Illinois at Urbana-Champaign in Cedar project. In 1991, he became a research
2005. He is currently a PhD student in the staff member at the IBM T.J. Watson Research
Department of Computer Science at the Uni- Center, where he was a key member of the
versity of Illinois at Urbana-Champaign. His xlHPF compiler team and the Ninja project. He
research focus is on developing statistical data has been an associate professor of computer
mining algorithms to improve software reliability, and electrical engineering at Purdue University
with an emphasis on statistical debugging and since 2002. His research has focused on
automated program failure diagnosis. Since 2003, he has more than 10 parallelism, high performance computing, and, in particular, software
publications in refereed conferences and journals, such as the ACM support for the development of correct and efficient programs. To this
SIGKDD International Conference on Knowledge Discovery and Data end, his research has covered dependence analysis and automatic
Mining, the International World Wide Web Conference, the European synchronization of explicitly parallel programs, compilation under
Software Engineering Conference, the ACM SIGSOFT Symposium on different memory models, automatic parallelization, high performance
the Foundations of Software Engineering, and the IEEE Transactions on computing in Java and other high-level languages, and tools to help in
Software Engineering. He is a member of the IEEE. the detection and localization of program errors. Professor Midkiff has
over 50 refereed publications. He is a member of the IEEE.

You might also like