Statistical Debugging: A Hypothesis Testing-Based Approach
Statistical Debugging: A Hypothesis Testing-Based Approach
Abstract—Manual debugging is tedious, as well as costly. The high cost has motivated the development of fault localization techniques,
which help developers search for fault locations. In this paper, we propose a new statistical method, called SOBER, which automatically
localizes software faults without any prior knowledge of the program semantics. Unlike existing statistical approaches that select
predicates correlated with program failures, SOBER models the predicate evaluation in both correct and incorrect executions and regards
a predicate as fault-relevant if its evaluation pattern in incorrect executions significantly diverges from that in correct ones. Featuring a
rationale similar to that of hypothesis testing, SOBER quantifies the fault relevance of each predicate in a principled way. We
systematically evaluate SOBER under the same setting as previous studies. The result clearly demonstrates the effectiveness: SOBER
could help developers locate 68 out of the 130 faults in the Siemens suite by examining no more than 10 percent of the code, whereas the
Cause Transition approach proposed by Holger et al. [6] and the statistical approach by Liblit et al. [12] locate 34 and 52 faults,
respectively. Moreover, the effectiveness of SOBER is also evaluated in an “imperfect world,” where the test suite is either inadequate or
only partially labeled. The experiments indicate that SOBER could achieve competitive quality under these harsh circumstances. Two
case studies with grep 2.2 and bc 1.06 are reported, which shed light on the applicability of SOBER on reasonably large programs.
1 INTRODUCTION
checks whether accesses to the buffer ever exceed the upper
T HE last decade has witnessed great advances in fault
localization techniques [1], [2], [3], [4], [5], [6], [7], [8], [9].
These techniques aim to assist developers in finding fault
bound. Statistics on the evaluations of predicates are
collected over multiple executions at runtime and analyzed
locations, which is one of the most expensive debugging afterward.
activities [10]. Fault localization techniques can be roughly The method described in this paper shares the principle
classified as static or dynamic. A static analysis detects of predicate-based dynamic analysis. However, by explor-
program defects by checking the source codes with or ing detailed statistics about predicate evaluation, our
without referring to a well-specified program model [1], [2], method can detect more and subtler faults than the state-
[3]. A dynamic analysis, on the other hand, typically tries to of-the-art statistical debugging approach proposed by Liblit
locate defects by contrasting the runtime behavior of correct et al. [12]. For easy reference, we denote this method as
and incorrect executions. Dynamic techniques usually do LIBLIT05. For each predicate P in a program P, LIBLIT05
not assume any prior knowledge of program semantics estimates two conditional probabilities:
other than the labeling of each execution as either correct or
P r1 ¼ P rðP failsjP is ever observedÞ
incorrect. Previous studies deploy a variant of program
runtime behaviors for fault localization, such as program and
spectra [11], [4], memory graphs [5], [6], and program
predicate evaluation history [7], [12]. P r2 ¼ P rðP failsjP is ever observed as trueÞ:
Within dynamic analyses, techniques based on predicate
It then treats the probability difference P r2 P r1 as an
evaluations have been shown to be promising for fault
indicator of how relevant P is to the fault. Therefore,
localization [13], [14], [7], [12]. Programs are first instru-
LIBLIT05 essentially regards a predicate fault-relevant if its
mented with predicates such that the runtime behavior in
true evaluation correlates with program failures.
each execution is encoded through predicate evaluations. While LIBLIT05 succeeded in isolating faults in some
Consider the predicate “idx < LENGTH,” where the variable widely used software [12], it has a potential problem in its
idx is an index into a buffer of length LENGTH. This predicate ranking model. Because LIBLIT05 only considers whether a
predicate has ever been evaluated as true or not in each
. C. Liu, X. Yan, and J. Han are with the Department of Computer Science, execution, it loses its power to discriminate when a
University of Illinois at Urbana-Champaign, Urbana, IL 61801. predicate P is observed as true at least once in all
E-mail: {chaoliu, xyan, hanj}@cs.uiuc.edu. executions. In this case, P r1 is equal to P r2 , which suggests
. L. Fei and S.P. Midkiff are with the School of Electronic and Computer that the predicate P has no relevance to the fault. In
Engineering, Purdue University, West Lafayette, IN 47907. Section 2, we will present an example where the most fault-
E-mail: {lfei, smidkiff}@purdue.edu.
relevant predicate reveals only a small difference between
Manuscript received 12 Dec. 2005; revised 10 May 2006; accepted 4 Aug. P r1 and P r2 . We found that similar cases are not rare in
2006; published online DD Mmmm, YYYY.
Recommended for acceptance by R. Lutz. practice, as suggested by the experiments in Section 4.
For information on obtaining reprints of this article, please send e-mail to: The above issue motivates us to develop a new approach
[email protected], and reference IEEECS Log Number TSE-0327-1205. that can exploit multiple evaluations of a predicate within
0098-5589/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society
2 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006
b
Fig. 2. Branching actions in (a) P and (b) P.
Fig. 3. (a) A correct and (b) an incorrect execution in P.
of P is touched at least once (i.e., nt þ nf 6¼ 0), ðP Þ varies in 3.4 Predicate Ranking
the range of [0, 1]: ðP Þ is equal to 1 if P always holds, to 0 if The lack of prior knowledge about fP ðXjÞ constitutes one
it never holds, and in between for all other sets of outcomes. of the major obstacles in calculating the similarity (or
If the predicate is never evaluated, ðP Þ has a singularity 0/ difference, equivalently) between fP ðXjp Þ and fP ðXjf Þ. If
0. In this case, since we have no evidence to favor either the closed form of fP ðXjp Þ and fP ðXjf Þ were given,
true or false, we set ðP Þ to 0.5 for fairness. Finally, if a measures used in information theory [19], such as the
predicate is never evaluated in any failing runs, it has relative entropy, would immediately apply. Meanwhile, we
nothing to do with program failures and is hence eliminated are not authorized to impose model assumptions, like
normality, on fP ðXÞ because improper assumptions can
from the predicate ranking.
lead to misleading inferences. Therefore, given the above
3.3 Methodology Overview difficulties in directly measuring the model difference, in
this paper we propose an indirect approach that measures
We formulate the main idea of our method in this section the difference between fP ðXjp Þ and fP ðXjf Þ without any
and then develop its details in Section 3.4. Following the model assumption.
convention in statistics, we use uppercase letters for Aiming at the model difference, we first propose the null
random variables and the corresponding lowercase letters hypothesis that H0 : fP ðXjp Þ ¼ fP ðXjf Þ, i.e., there is no
for their realizations. Moreover, fðXjÞ is a general notation difference between the two models. Letting X ¼
of the probability model for the random variable X that is ðX1 ; X2 ; ; Xm Þ be a random sample from fP ðXjf Þ (i.e.,
indexed by the parameter . observed evaluation bias from m failing cases), we derive a
Let the entire test case space be T , which conceptually statistic Y , which, under the null hypothesis H0 , conforms to a
contains all the possible inputs and expected outputs. known distribution. If the realized statistic Y ðXÞ corresponds
According to the correctness of P on the test cases in T , T to an event that has a small likelihood of happening, the null
can be partitioned into two disjoint sets T p and T f for hypothesis H0 is likely invalid, which suggests that a
passing and failing cases. Therefore, the available test suite T nontrivial difference exists between fP ðXjp Þ and fP ðXjf Þ.
and its partitions Tp and Tf can be treated as a random We choose to characterize fP ðXjÞ through its population
sample from T , T p , and T f , respectively. Let X be the mean and variance 2 , so that the null hypothesis H0 is
random variable for the evaluation bias of predicate P . We
then use fP ðXjp Þ and fP ðXjf Þ to denote the statistical p ¼ f and 2p ¼ 2f : ð3Þ
model for the evaluation bias of P in T p and T f , respectively. Let X ¼ ðX1 ; X2 ; ; Xm Þ be an independent and identi-
Therefore, the evaluation bias from running a test case t can cally distributed (i.i.d.) random sample from fP ðXjf Þ.
be treated as an observation from fP ðXjÞ, where is either Under the null hypothesis, we have EðXi Þ ¼ f ¼ p
p or f depending on whether t is passing or failing. Given and V arðXi Þ ¼ 2f ¼ 2p . Because Xi 2 ½0; 1, both EðXi Þ
the statistical models for both passing and failing runs, we and V arðXi Þ are finite. According to the Central Limit
then define the fault relevance of P as follows: Theorem [15], the following statistic
Definition 2 (Fault Relevance). A predicate P is relevant to Pm
Xi
the hidden fault if its underlying model fP ðXjf Þ diverges Y ¼ i¼1 ; ð4Þ
m
from fP ðXjp Þ, where X is the random variable for the
2
evaluation bias of P . asymptotically conforms to Nðp ; mp Þ, a normal distribution
2
The above definition relates fP ðXjÞ, the statistical model with mean p and variance mp .
for P ’s evaluation bias, to the hidden fault. Naturally, the Let fðY jp Þ be the probability density function of the
2
larger the difference between fP ðXjf Þ and fP ðXjp Þ, the normal distribution Nðp ; mp Þ. Then, the likelihood Lðqp jY Þ
more relevant P is to the fault. Let LðP Þ be an arbitrary of p given the observed Y is
similarity function,
Lðp jY Þ ¼ fðY jp Þ: ð5Þ
LðP Þ ¼ SimðfðXjp Þ; fðXjf ÞÞ: ð1Þ
A smaller likelihood implies that H0 is less likely to hold,
The ranking score sðP Þ can be defined as gðLðP ÞÞ, where which, in turn, indicates a larger difference between fP ðXjp Þ
gðxÞ can be any monotonically decreasing function. We and fP ðXjf Þ. Therefore, we can reasonably instantiate the
here choose gðxÞ ¼ logðxÞ because logðxÞ effectively similarity function in (1) with the likelihood function
measures the relative magnitude even when xs are closed
to 0 (certainly, x must be positive). Therefore, the fault LðP Þ ¼ Lðp jY Þ: ð6Þ
relevance score sðP Þ is defined as According to the property of normal distribution, the
sðP Þ ¼ logðLðP ÞÞ: ð2Þ normalized statistic
Combining (2), (6), (5), and (8), we finally get the fault- Without loss of generality, a predicate P is a program
relevance ranking score for predicate P as invariant on a test suite C if and only if it always evaluates
true during the execution of C. In practice, the test suite C
p is usually chosen to be a set of passing cases so that the
sðP Þ ¼ logðLðP ÞÞ ¼ log pffiffiffiffiffi : ð9Þ
m’ðZÞ summarized invariants characterize the correct behavior of
the subject program [16]. During the failing executions,
3.5 Discussions on Score Computation these invariants are either conformed (i.e., still evaluate
First, in order to calculate sðP Þ using (9), we need to true) or violated (i.e., evaluate false at least once), and
estimate the population mean p and the standard error those violated invariants are regarded as hints for debug-
p of fP ðXjp Þ. Let X0 ¼ ðX10 ; X20 ; ; Xn0 Þ be a random ging. In some special cases, the test suite C is chosen to be a
sample from fP ðXjp Þ (which corresponds to the observed time interval during which the execution is believed to be
evaluation bias from the n passing runs), then p and p correct. One typical example is that for software that runs
can be estimated as for a long time, such as Web servers, the execution is likely
Pn correct at the beginning of the execution [14].
X0 According to Definition 1, the evaluation bias of an
p ¼ X0 ¼ i¼1 i ð10Þ
n invariant is always 1. Taking the set of passing cases Tp as
and C, we know that, if the predicate P is an invariant, p ¼ 1
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and p ¼ 0. Moreover, the following theorem proves that
1 X n the fault relevance score function of (9) naturally identifies
p ¼ SX 0 ¼ ðX 0 X 0 Þ2 : ð11Þ both invariant violations and conformations.
n 1 i¼1 i
pffiffiffiffiffi Theorem 1. Let P be any invariant summarized from a set of
Second, because the m in (9) does not affect the relative correct executions Tp . sðP Þ ¼ þ1 if P is violated in at least
order between predicates, it can be safely dropped in one faulty execution, and sðP Þ ¼ 1 if P is conformed in all
practice. However, as simple algebra would reveal, the m in faulty executions.
pffiffiffiffiffi
(4) and the m in (7) cannot be discarded, because they Proof. Let x ¼ ðx1 ; x2 ; ; xm Þ be a realized random sample,
properly scale the statistics for standard normality as which corresponds to the observed evaluation biases
required by the Central Limit Theorem. from the mP failing runs. Once P is violated in at least one
Finally, we note that although the derivation of (9) is execution, m i¼1 xi 6¼ m. It then follows from (7) that
based on the asymptotic behavior, i.e., when m ! þ1,
Pm
statistical inference suggests that the asymptotic result is c xi mp
still valid even when the sample size is nowhere near z ¼ ; where c ¼ i¼1 pffiffiffiffiffi 6¼ 0;
p m
infinity [15]. In the fault localization scenario, it is true that
we cannot have an infinite number of failing cases. But as and then
shown in experiments, (9) still works well in ranking rffiffiffiffiffiffi rffiffiffiffiffiffi c2 t2
abnormal predicates even when only a small number of p 2 p 2 e2
failing cases are available. lim pffiffiffiffiffi ¼ lim ¼ lim
p !0 m’ðzÞ m p !0 e12 ðc Þ2 m t!1 t
We now use a concrete example to conclude the rffiffiffiffiffiffi
p
TABLE 1
Characteristics of Subject Programs
in the Siemens suite [20]. As will be shown shortly, our P : }ðm >¼ 0Þ ¼ true};
method SOBER, being a superset of invariant-based meth- IncreaseðP Þ ¼ 0:0104, and the Increase value for predicate
ods, actually achieves the best fault localization results on
the Siemens suite. P 0 : }ðm >¼ 0Þ ¼ false}
3.7 Differences between SOBER and Liblit05 is 0:0245. According to [12], neither P nor P 0 is ranked on
Because both LIBLIT05 and SOBER are based on a statistical top since they are either negative or too small. Thus,
analysis of predicate evaluations, we now illustrate the LIBLIT05 fails to identify the fault. In comparison, SOBER
differences in this section. successfully ranks the predicate P as the most suspicious
In principle, LIBLIT05 contrasts the probability that one predicate. Intuitively, this is because the evaluation bias in
execution crashes if the predicate P is ever observed true, failing executions (0.9024) significantly diverges from that
and that if P is observed (either true or false) in the in passing ones (0.2952).
execution. Specifically, the authors define
later adopted by Cleve and Zeller in reporting the quality of approach is closer to practice. Moreover, the ranking-based
CT [6]. We briefly summarize this measure as follows: T -score is not as generally applicable as the PDG-based,
because it requires a ranking of all statements. For example,
1. Given a (faulty) program P, its program depen- none of the discussed algorithms in Section 4.2, except
dence graph (PDG) is written as G, where each TARANTULA, can be evaluated using the ranking-based
statement is a vertex and there is an edge between
approach, but TARANTULA can be evaluated by the PDG-
two vertices if two statements have data and/or
based T -score by taking the top k statements as a fault
control dependencies.
localization report.
2. The vertices corresponding to faulty statements are
In this study, we compare SOBER with seven existing
marked as defect vertices. The set of defect vertices is
written as Vdefect . fault localization algorithms (described in the next section).
3. Given a fault localization report R, which is a set of Among them, we implemented LIBLIT05 in Matlab and
suspicious statements, their corresponding vertices validated the correctness of the implementation with the
are called blamed vertices. The set of blamed vertices original authors. For the other six algorithms, the localiza-
is written as Vblamed . tion result on the Siemens suite is taken directly from their
4. A developer can start from Vblamed and perform a corresponding publications.
breadth-first search until he reaches one of the defect We instrumented the subject programs with two kinds of
vertices. The set of statements covered by the predicates: branches and function returns, which are
breadth-first search is written as Vexamined . described in detail in [7], [12]. In particular, we treat each
5. The T -score, defined as follows, measures the branch conditional as one inseparable instrumentation unit,
percentage of code that has been examined in order and do not consider each subclause separately. For better
to reach the fault: fault localization, one may be tempted to introduce more
predicates. But the introduction of more predicates is a
jVexamined j double-edged sword. On the positive side, an expanded set
T¼ 100%; ð15Þ
jV j of predicates is more likely to cover the faulty code; but the
where jV j is the size of the program dependence superfluous predicates brought in can nontrivially compli-
graph G. In [4], [6], the authors used 1 T as an cate the predicate ranking. So far, no agreement has been
equivalent measure. reached on what are the “golden predicates.” At runtime,
The T -score estimates the percentage of code a devel- the evaluation of predicates is collected without sampling
oper needs to examine (along the static dependencies) for both LIBLIT05 and SOBER.
before the fault location is found, when a fault localization All experiments in this section were carried out on a
report is provided. A high quality fault localization is 3.2 GHz Intel Pentium-4 PC with 1 GB physical memory,
expected to be a small set of statements that are close to running Fedora Core 2. In calculating the T -scores, we used
(or contain) the fault location. The above definition of CODESURFER 1.9 with patch 3 to generate the program
T -score is immediately applicable to localizations that dependence graphs. Because PDGs generated by CODE-
consist of a set of “blamed” statements. For algorithms that SURFER may vary with different build options, the factory
generate a ranked list of all predicates, like LIBLIT05 and default (by enabling the factory-default switch) is
SOBER, the corresponding statements of the top k pre- used to allow reproducible results in the future. Moreover,
dicates are taken as a fault localization report. The optimal the Matlab source code of SOBER and the instrumented
k is the one that minimizes the average examined code Siemens suite are available online at http://www.ews.
over a set of faults under study, i.e., uiuc.edu/~chaoliu/sober.htm.
kopt ¼ arg min E½Tk : ð16Þ 4.2 Compared Fault Localization Algorithms
k
We now briefly explain the seven fault localization
where E½Tk is the average T -score for the given set of faults algorithms we compare with SOBER. As LIBLIT05 is already
for any fixed k. discussed in Section 3.7, we only describe the other six
As the above defined T -score is calculated based on algorithms below:
PDGs, we call it PDG-based. Recently, another kind of
T -score was used by Jones and Harrold in reporting the . Set-Union. This algorithm is based on the program
localization results of TARANTULA [8]. The TARANTULA spectra difference between a failing case f and a set
tool produces a ranking of all executable statements, and the of passing cases P . Specifically, let SðtÞ be the
authors calculate the T -score directly from the ranking. program spectra of running the test case t. Then, the
Instead of taking the top k statements and calculating the set difference between SðfÞ and the union spectra of
T -score based on PDGs, the authors examine whether the cases in P is taken as the fault localization report R,
faulty statements are ranked high. Specifically, a developer i.e., R ¼ SðfÞ [pi 2P Sðpi Þ. This algorithm is de-
is assumed to examine statement by statement from the top scribed in [4], and we denote it by UNION for
of the ranking until a faulty statement is touched. The brevity.
percentage of statements examined by then is taken as the . Set-Intersect. A complementary algorithm to UNION
T -score. We call the T -score calculated in this way is also described in [4]. It is based on the set
ranking-based. Apparently, the ranking-based T -score as- difference between the spectra of the failing case
sumes a different code examination strategy than that and the intersection spectra of passing cases, namely,
assumed by the PDG-based, i.e., along the ranking rather the localization report R ¼ \pi 2P Sðpi Þ SðfÞ. We
than along the dependencies. Intuitively, the PDG-based denote this algorithm by INTERSECT.
8 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006
Fig. 4. Located faults with regard to code examination. (a) Interval comparison. (b) Cumulative comparison.
. Nearest Neighbor. The nearest neighbor approach, calculated from f 0 and the erroneous output,
proposed by Renieris and Reiss in [4], contrasts the respectively. Finally, the intersection of F S and
failing case to the passing case that most “resembles” BS, i.e., the chop, is taken as the fault localization
the failing case. Namely, the localization report report, namely, R ¼ F S \ BS. We denote this
R ¼ SðfÞ SðpÞ, where p is the nearest passing case algorithm by SLICECHOP.
to f as measured under certain distance metrics. The In previous studies, comparisons of some of the above
authors studied two distance metrics and found that algorithms are reported. Specifically, Renieris and Reiss
the nearest neighbor search based on the Ulam’s found that NN/PERM outperformed both UNION and
distance renders better fault localization. This algo- INTERSECT [4], whereas Cleve and Zeller later reported
rithm is denoted as NN/PERM by the original that a better result than NN/PERM was achieved by CT [6].
authors. These reported results are all based on the PDG-based
. Cause Transition. The Cause Transition algorithm T -score. As CT achieves the best localization result as
[6], denoted as CT, is an enhanced variant of Delta measured with the PDG-based T -score, we compare SOBER
Debugging [5]. Delta Debugging contrasts the with CT and LIBLIT05 in Section 4.3 using the same
memory graph [22] of one failing execution, ef , measure. Because TARANTULA and SLICECHOP results
against that of one passing execution, ep . By carefully are not reported with the PDG-based T -score, we compare
manipulating the two memory graphs, Delta Debug- SOBER with them separately in Section 4.4.
ging systematically narrows the difference between
ef and ep down to a small set of suspicious variables. 4.3 Comparison with LIBLIT05 and CT
CT enhances Delta Debugging by exploiting the In this section, we compare SOBER with CT and LIBLIT05.
notion of cause transition: “moments where new We subject both LIBLIT05 and SOBER to the 130 faults in the
relevant variables begin being failure causes” [6]. Siemens suite and measure their localization quality using
Therefore, CT essentially implements the concept of the PDG-based T -score (15). The result of CT is directly cited
“search in time” in addition to the original “search in from [6].
space” used in Delta Debugging. Fig. 4a depicts the number of faults that can be located
. Tarantula. The TARANTULA tool was originally when a certain percentage of code is examined by a
presented to visualize the test information for each developer. The x-axis is labeled with T -score. For LIBLIT05
statement in a subject program, and it was shown to and SOBER, we choose the top five predicates to form the set
be useful for fault localization [23]. In a recent study of blamed vertices. Because localization that still requires
[8], the authors took ð1 hueðsÞÞ as the fault developers to examine more than 20 percent of the code is
relevance score for the statement s, where hueðsÞ is generally useless, we treat only [0, 20] as the meaningful
the hue component of each statement in visualiza- T -score range. Under these circumstances, S OBER is
tion [23]. With the fault relevance score calculated apparently better than LIBLIT05, while both of them are
for each statement, TARANTULA produces a ranking consistently superior to CT.
of all executable statements. Developers are ex- For practical use, it is instructive to know how many (or
pected to examine the ranking from the top down what percentage of) faults can be identified when no more
to locate the fault. than percent of the code is examined. We therefore plot
. Failure-Inducing Chops. Gupta et al. recently the cumulative comparison in Fig. 4b. It clearly suggests
propose a fault localization algorithm that inte- that both SOBER and LIBLIT05 are much better than CT and
grates delta debugging and dynamic slicing [9]. that SOBER outperforms LIBLIT05 consistently. Although
First, a minimal failure-inducing input f 0 is derived LIBLIT05 catches up when the T -score is 60 percent or
from the given failing case f using the algorithms higher, we regard this advantage as irrelevant because it
of Zeller and Hildebrandt [24]. Then, a forward hardly makes sense for a fault locator to require a developer
dynamic slice, F S, and a backward slice, BS, are to examine more than 60 percent of the code.
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 9
Fig. 5. Quality Comparison with regard to various top k Values. (a) Top one. (b) Top two. (c) Top three. (d) Top four. (e) Top five. (f) Top six. (g) Top
seven. (h) Top eight.
Fig. 4b shows that, for the 130 faults in the Siemens suite, Oððn þ mÞ k þ k logðkÞÞ. Similarly, LIBLIT05 also needs
when a developer examines at most 1 percent of the code, Oðn þ mÞ to score each predicate, and its time complexity is
CT catches 4.65 percent of the faults while LIBLIT05 and also Oððn þ mÞ k þ k logðkÞÞ. We experimented with the
SOBER capture 7.69 percent and 8.46 percent, respectively. 31 faulty versions of the replace program, and the average
Moreover, when 10 percent code examination is acceptable, time for unoptimized LIBLIT05 and SOBER to analyze each
CT and LIBLIT05 identify 34 (26.36 percent) and 52 version was 11.7775 seconds and 11.3844 seconds, respec-
(40.00 percent) of the 130 faults. SOBER is the best of the
tively. This is much faster than CT, as reported in [6].
three, locating 68 (52.31 percent) of the 130 faults, which is
16 faults more than the state-of-the-art approach LIBLIT05. If 4.4 Comparison with Tarantula and SLICECHOP
the developer is patient enough to examine 20 percent of the We now compare SOBER with TARANTULA and SLICE-
code, 73.85 percent of the faults (i.e., 96 of 130) can be CHOP. Recently, Jones and Harrold. [8] reported the result
located by SOBER. of TARANTULA on the Siemens suite with the ranking-
We also vary the parameter k in calculating the T -score
based T -score, and compared it with previous PDG-based
for both LIBLIT05 and SOBER. The quality comparison is
T -scores of CT, NN/PERM, INTERSECT, and UNION. As it is
plotted in Fig. 5 for k varying from 1 through 8. The
unclear to what extent these two kinds of T -score agree with
comparison is confined within the [0, 20] T -score range.
each other, we assume they are equivalent, as Jones and
Since detailed results about CT is not available in [6], CT is
Harrold did in [8]. More investigation, however, is needed
still depicted only at the 1, 10, and 20 ticks. Fig. 5 shows that
to clarify this issue in the future. Moreover, because the
LIBLIT05 is the best when k is equal to 1 or 2. When k ¼ 3,
SOBER catches up, and it consistently outperforms LIBLIT05
afterward. Because developers are always interested in
locating faults with minimal code checking, it is desirable to
select the optimal k that maximizes the localization quality.
We found that both LIBLIT05 and SOBER achieve their best
quality when k is equal to 5. In addition, Fig. 6 plots the
quality of SOBER with various k-values. It clearly indicates
that SOBER locates the largest number of faults when k is
equal to 5. Therefore, the setting of k ¼ 5 in Fig. 4 is
justified. Finally, Fig. 6 also suggests that too few predicates
(e.g., k ¼ 1) may not convey enough information for fault
localization, while too many predicates (e.g., k ¼ 9) are in
themselves a burden for developers to examine and, thus,
neither of them leads to the best result.
Besides being accurate in fault localization, SOBER is also
computationally efficient. Suppose we have n correct and
m incorrect executions. Then, the time complexity of scoring
each predicate is Oðn þ mÞ. If there are, in total, k predicates
instrumented, the entire time complexity of SOBER is Fig. 6. Quality of SOBER with regard to top-k values.
10 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006
Fig. 8. Quality degradation with regard to percent-sampled test suite. (a) Quality of SOBER with regard to sampled test suite. (b) Quality of LIBLIT
with regard to sampled test suite.
practice due to the potential high cost. For example, because Specifically, for each faulty version of the Siemens suite,
the program specification varies from one component to we randomly sample a portion ð0 < 1Þ of the original
another, exclusive test scripts for each component must be test suite T . Suppose T consists of N test cases. Then,
prepared by human testers. Although some tools can help dN e cases are randomly taken, constituting a -sampled
expedite the generation of test cases [26], [27], [28], critical test suite, denoted as T . Because both SOBER and LIBLIT05
manual work is still unavoidable. Furthermore, besides the need at least one failing case, the above sampling is
difficulty of test case generation, the test oracle is even repeated until at least one failing case is included. Finally,
harder to construct. Again because of variations in program both SOBER and LIBLIT05 are run on the same T for each
functionality, it is usually humans developers who prepare faulty program.
the expected outputs or pass judgment about the correct- Fig. 8 plots how the quality varies with different
ness of outputs in practice. sampling rates for both SOBER and LIBLIT05. We set
Therefore, considering the difficulty of obtaining an equal to 100 percent, 10 percent, 1 percent and 0.1 percent,
adequate test suite and a test oracle, we regard the respectively, so that T100% represents the entire test suite
environment that we experimented with in Section 4 as “a and each of the following is roughly one-tenth as small as
perfect world.” In order to shed some light on how SOBER the previous one. As gets smaller, the localization quality
would work in practice, in this section we subject SOBER to of both SOBER and LIBLIT05 gradually degrades. For
an “imperfect world,” where adequate test suites and test example, in Fig. 8a, curves for smaller s are strictly below
oracles are not simultaneously available. Section 5.1 exam- those for higher sampling rates. A similar pattern for
ines SOBER’s robustness to test inadequacy, and Section 5.2 LIBLIT05 is also observed in Fig. 8b. These observations are
studies how SOBER handles partially labeled test suites. easily explainable. In statistical hypothesis testing, the
We regard, and hence believe, that the examination of confidence of either accepting or rejecting the null hypothesis
SOBER in an “imperfect world” is both necessary and is, in general, proportional to the number of observations.
interesting. To some extent, this examination bridges the Because SOBER bears a similar rationale to hypothesis
gap between the perfect-world experiments (i.e., Section 4) testing, its quality naturally improves as more and more test
and real-world practices that cannot be fully covered in any cases are observed. Because LIBLIT05 relies on the accurate
single research paper. We simulate the imperfect world estimation of the two conditional probabilities, its quality
with the 130 faulty versions of the Siemens suite. In parallel also improves with more labeled test cases due to the Law
with SOBER, LIBLIT05 is also subjected to the same of Large Numbers.
experiments for a comparative study, which illustrates In Fig. 8a, one can also notice that the curve for ¼ 10%
how the two statistical debugging algorithms react to the is quite close to the highest. This suggests that SOBER
imperfect world. obtains competitive results even when the test suite is only
one-tenth of the original. Moreover, Fig. 8 also indicates that
5.1 Robustness to Inadequate Test Suites even when is as low as 0.1 percent, both SOBER and
Because of the cost of an adequate test suite, people usually LIBLIT05 are still consistently better than CT. Based on the
settle for inadequate but nevertheless satisfactory suites in typical suite size from Table 1, T0:1% contains at most six test
practice. For instance, during the prototyping stage, one cases, at least one of which is failing. As one can see, even
may not bother much with an all-around testing, and a with such an insufficient test suite, both SOBER and
preliminary test suite usually suffices. We now simulate an LIBLIT05 still outperform CT. For example, without exam-
inadequate test suite by sampling (without replacement) the ining more than 20 percent of the code, SOBER and LIBLIT05
accompanying test suite of the Siemens suite. The sampled locate 53.08 percent and 51.54 percent of the 130 faults
test suite becomes more and more inadequate as the respectively, while CT works well with 38 percent of the
sampling rate gets smaller. versions. This could be attributed to the underlying
12 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006
Fig. 10. Quality comparison with regard to the number ðmÞ of labeled failing cases. (a) m ¼ 1. (b) m ¼ 3. (c) m ¼ 5. (d) m ¼ 10.
results than the straightforward scheme, and sometimes it 6 EXPERIMENTAL EVALUATION WITH LARGE
even obtains comparable results to SOBER_FULL. PROGRAMS
We simulate partially labeled test suites using the
Siemens programs. For each faulty version, we randomly Although the 130 faulty versions of the Siemens programs
select m failing cases as Tf (i.e., the set of failing cases are appropriate for algorithm comparison, the effectiveness
identified by the tester). According to the above scheme, all of SOBER nevertheless needs to be assessed on large
the remaining cases are regarded as passing, i.e., Tp0 . We programs. In this section, we report on the experimental
then run both SOBER and LIBLIT05 with the same Tp0 and Tf0 evaluation of SOBER on two (reasonably) large programs,
(recall that Tf0 ¼ Tf ) for each of the 130 faulty versions. We grep 2.2 and bc 1.06. Moreover, as two faults are located in
experiment with m equal to 1, 3, 5, and 10, respectively, and each program, this evaluation also illustrates how SOBER
this represents the increasing effort that the tester puts into helps developers handle multifault cases. The detailed
test evaluation. If a faulty version does not have m failing experimental results with grep 2.2 and bc 1.06 are
cases, we take all the failing cases. In the Siemens suite, presented in Sections 6.1 and 6.2, respectively.
there are 0, 4, 14, and 19 versions that have less than 1, 3, 5,
and 10 failing cases. These versions were not excluded 6.1 A Controlled Experiment with grep 2.2
because they do represent real situations. We obtained a copy of the grep 2.2 subject program from
Fig. 10 plots the localization quality for both SOBER and the “Subject Infrastructure Repository” (SIR) [29]. The
LIBLIT05 with m equal to 1, 3, 5, and 10, respectively. original code of grep 2.2 has 11,826 lines of C code, as
Curves for CT and SOBER_FULL are also plotted as the counted by the tool SLOCCount [30], while the announced
baseline and ceiling quality in each subfigure. Among the size of the modified version at SIR is 15,633 LOC. A test
four subfigures, Fig. 10a represents the toughest situation, suite of 470 test cases is available at SIR for the program. We
where only one failing case is identified in each faulty tried out all the seeded faults provided by SIR, but found no
version. This simulates a typical scenario where a developer fault incurred failures on the accompanying test suite. We
starts debugging once a faulty execution is encountered. As therefore manually injected two faults in the source code, as
expected, the quality of SOBER degrades considerably from shown in Fig. 11 and Fig. 12, respectively.
SOBER_FULL, but it is still better than CT. The first fault (shown in Fig. 11) is an “off-by-one”
We note that the m ¼ 1 situation is at least as harsh as the error: an expression “+1” is appended to line 553 in the
situation with 0.1 percent-sampled test suites, as shown in grep.c file. This fault causes failures in 48 of the 470 test
Fig. 8a. Nevertheless, at least one failing run is in every cases. The second fault (in Fig. 12) is a “subclause-
0.1 percent-sampled test suite. In order to demonstrate the missing” error. The subclause ðlcp½i ¼¼ rcp ½iÞ is
effect of treating Tu as passing, we replot the curve of SOBER commented out at line 2270 in file dfa.c. The fault incurs
with ¼ 0:1% in Fig. 10a with a dashed line. The another 88 failing cases.
remarkable gap between “SOBER” and “SOBER, 0.1%” Although these two faults are manually injected, they do
suggests the benefit of treating unlabeled cases as passing. mimic realistic logic errors. Logic errors like “off-by-one” or
The four subfigures of Fig. 10, viewed in a sequence, “subclause-missing” may sneak in when developers are
show that the quality of SOBER gradually improves as handling obscure corner conditions. Because logic errors,
additional failing cases are explicitly labeled. Intuitively, the like these two, do not generally incur program crashes, they
more failing cases that are identified, the more accurately are usually harder to debug than those causing program
the statistic Y (4) approaches to the true faulty behavior of crashes. In the following, we illustrate how SOBER helps
predicate P and, hence, the higher quality of the final developers find these two faults.
predicate ranking list. LIBLIT05 also improves for a similar We first instrument the source code. According to the
reason. instrumentation schema described in Section 4.1, grep 2.2 is
instrumented with 1,732 branch and 1,404 return predicates.
5.3 Summary The first run of SOBER with the 136 failing (due to the two
In this section, we empirically examined how SOBER works faults) and the remaining 334 passing cases produces a
in an imperfect world, where either the test suite is predicate ranking, whose top three predicates are listed in
inadequate or only a limited number of failing executions Table 2. For easy reference, the three predicates are also
are explicitly identified. The experiment demonstrates the marked at their instrumented locations in Fig. 11 and
robustness of SOBER under these harsh conditions. In Fig. 12.
addition, the scheme of tagging all unlabeled cases as As we can see, the predicates P1 and P2 point to the
passing is shown effective in leveraging SOBER’s quality. faulty function for the first fault. The predicate P1 is four
14 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 10, OCTOBER 2006
TABLE 2
Top Three Predicates from the First Run of SOBER
with the Siemens suite cannot be generalized to arbitrary Inc. offered the authors a free copy of CODESURFER. Last
programs. However, we expect that on larger programs but not the least, the authors deeply appreciate the
with greater separation of concerns, most fault localiza-
tion techniques will do better. This expectation is insightful questions, comments, and suggestions from
supported by existing studies with CT, LIBLIT05, and anonymous referees, which proved invaluable during the
TARANTULA [6], [8], [12], as well as the experiments in preparation of this paper.
Section 6 in this study.
Threats to construct validity concern the appropriateness
of the quality metric for fault localization results. In this REFERENCES
paper, we adopt the PDG-based T -score, which was proposed [1] E. Clarke, O. Grumberg, and D. Peled, Model Checking. MIT Press,
by Renieris and Reiss [4]. Although this evaluation frame- 1999.
work involves no subjective judgments, it is by no means a [2] W. Visser, K. Havelund, G. Brat, and S. Park, “Model Checking
Programs,” Proc. 15th IEEE Int’l Conf. Automated Software Eng.
comprehensively fair metric. For instance, this measure does (ASE ’00), pp. 3-12, 2000.
not take into account how easily a developer can make sense [3] M. Musuvathi, D. Park, A. Chou, D. Engler, and D. Dill, “CMC: A
of the fault localization report. Recent work [6] also identifies Pragmatic Approach to Model Checking Real Code,” Proc. Fifth
some other limitations of this measurement. In previous Symp. Operating System Design and Implementation (OSDI ’02),
pp. 75-88, 2002.
work, a ranking-based T -score is used to evaluate the [4] M. Renieris and S. Reiss, “Fault Localization with Nearest
effectiveness of TARANTULA. Although both forms of Neighbor Queries,” Proc. 18th IEEE Int’l Conf. Automated Software
T -score estimate the human efforts needed to locate the fault, Eng. (ASE ’03), pp. 30-39, 2003.
it is yet unclear whether they agree. The comparison of [5] A. Zeller, “Isolating Cause-Effect Chains from Computer Pro-
grams,” Proc. ACM Int’l Symp. Foundations of Software Eng.
TARANTULA with other algorithms in Section 4.4 assumes (FSE ’02), pp. 1-10, 2002.
the equivalence between the two forms. More extensive [6] H. Cleve and A. Zeller, “Locating Causes of Program Failures,”
studies are needed to clarify this issue. Proc. 27th Int’l Conf. Software Eng. (ICSE ’05), pp. 342-351, 2005.
Finally, threats to internal validity concern the experi- [7] B. Liblit, A. Aiken, A. Zheng, and M. Jordan, “Bug Isolation via
ments of SOBER with the programs grep 2.2 and bc 1.06, Remote Program Sampling,” Proc. ACM SIGPLAN 2003 Int’l Conf.
Programming Language Design and Implementation (PLDI ’03),
discussed in Section 6. Specifically, the two logic errors in grep pp. 141-154, 2003.
2.2 are injected by us. However, because these two logic [8] J. Jones and M. Harrold, “Empirical Evaluation of the Tarantula
errors do not incur segmentation faults, they are generally Automatic Fault-Localization Technique,” Proc. 20th IEEE/ACM
harder to debug, even for human developers. In contrast, case Int’l Conf. Automated Software Eng. (ASE ’05), pp. 273-282, 2005.
[9] N. Gupta, H. He, X. Zhang, and R. Gupta, “Locating Faulty Code
studies in previous work target crashing faults [5], [6], [7], Using Failure-Inducing Chops,” Proc. 20th IEEE/ACM Int’l Conf.
[12]. Therefore, the experiment with grep 2.2 demonstrates Automated Software Eng. (ASE ’05), pp. 263-272, 2005.
the effectiveness of SOBER on large programs with logic [10] I. Vessey, “Expertise in Debugging Computer Programs,” Int’l J.
errors. In order to minimize the threats to external validity Man-Machine Studies: A Process Analysis, vol. 23, no. 5, pp. 459-494,
1985.
about experiments with large programs, a case study with bc
[11] M. Harrold, G. Rothermel, K. Sayre, R. Wu, and L. Yi, “An
1.06 is also presented, which illustrates the effectiveness of Empirical Investigation of the Relationship between Spectra
SOBER on real faults. However, two experiments are still Differences and Regression Faults,” Software Testing, Verification,
insufficient to make claims about the general effectiveness of and Reliability, vol. 10, no. 3, pp. 171-194, 2000.
SOBER on large programs. Ultimately, all fault localization [12] B. Liblit, M. Naik, A. Zheng, A. Aiken, and M. Jordan, “Scalable
Statistical Bug Isolation,” Proc. ACM SIGPLAN 2005 Int’l Conf.
algorithms should be subjected to real practice, and evaluated Programming Language Design and Implementation (PLDI ’05),
by end users. pp. 15-26, 2005.
[13] Y. Brun and M. Ernst, “Finding Latent Code Errors via Machine
Learning over Program Executions,” Proc. 26th Int’l Conf. Software
8 CONCLUSIONS Eng. (ICSE ’04), pp. 480-490, 2004.
[14] S. Hangal and M. Lam, “Tracking down Software Bugs Using
In this paper, we propose a statistical approach to localize Automatic Anomaly Detection,” Proc. 24th Int. Conf. Software Eng.
software faults without prior knowledge of program (ICSE ’02), pp. 291-301, 2002.
semantics. This approach tackles the limitations of previous [15] G. Casella and R. Berger, Statistical Inference, second ed., Duxbury,
2001.
methods in modeling the divergence of predicate evalua- [16] M. Ernst, J. Cockrell, W. Griswold, and D. Notkin, “Dynamically
tions between correct and incorrect executions. A systema- Discovering Likely Program Invariants to Support Program
tic evaluation with the Siemens suite, together with two Evolution,” IEEE Trans. Software Eng., vol. 27, no. 2, pp. 1-25,
Feb. 2001.
case studies with grep 2.2 and bc 1.06, clearly demonstrates [17] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand, “Experiments
the advantages of our method in fault localization. We also of the Effectiveness of Dataflow- and Controlflow-Based Test
simulate an “imperfect world” to investigate SOBER’s Adequacy Criteria,” Proc. 16th Int’l Conf. Software Eng. (ICSE’94),
robustness to the harsh scenarios that may be encountered pp. 191-200, 1994.
[18] G. Rothermel and M. Harrold, “Empirical Studies of a Safe
in practice. The experimental result favorably supports Regression Test Selection Technique,” IEEE Trans. Software Eng.,
SOBER’s applicability. vol. 24, no. 6, pp. 401-419, June 1998.
[19] T. Cover and J. Thomas, Elements of Information Theory, first ed.
Wiley-Interscience, 1991.
ACKNOWLEDGMENTS [20] B. Pytlik, M. Renieris, S. Krishnamurthi, and S. Reiss, “Automated
Fault Localization Using Potential Invariants,” Proc. Fifth Int’l
The authors would like to thank Gregg Rothermel for Workshop Automated and Algorithmic Debugging (AADEBUG ’03),
making the Siemens program suite available. Darko pp. 273-276, 2003.
Marinov provided the authors with insightful suggestions. [21] C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff, “Sober: Statistical
Model-Based Bug Localization,” Proc. 10th European Software Eng.
Andreas Zeller, Holger Cleve and Manos Reneris gener- Conf./13th ACM SIGSOFT Int’l Symp. Foundations of Software Eng.
ously shared their evaluation frameworks. GrammaTech (ESEC/FSE ’05), pp. 286-295, 2005.
LIU ET AL.: STATISTICAL DEBUGGING: A HYPOTHESIS TESTING-BASED APPROACH 17
[22] T. Zimmermann and A. Zeller, “Visualizing Memory Graphs,” Long Fei received the BS degree in computer
Revised Lectures on Software Visualization, Int’l Seminar, pp. 191-204, Science from Fudan University, China, and the
2002. MS degree in electrical and computer engineer-
[23] J. Jones, M. Harrold, and J. Stasko, “Visualization of Test ing from Purdue University. He is currently a
Information to Assist Fault Localization,” Proc. 24th Int’l Conf. PhD student in the School of Electrical and
Software Eng. (ICSE ’02), pp. 467-477, 2002. Computer Engineering at Purdue University. His
[24] A. Zeller and R. Hildebrandt, “Simplifying and Isolating Failure- research interests are compilers and using
Inducing Input,” IEEE Trans. Software Eng., vol. 28, no. 2, pp. 183- compiler techniques for software debugging.
200, Feb. 2002. He is a member of the IEEE.
[25] J. Misurda, J. Clause, J. Reed, B. Childers, and M. Soffa, “Jazz: A
Tool for Demand-Driven Structural Testing,” Proc. 14th Int’l Conf.
Compiler Construction (CC ’05), pp. 242-245, 2005.
[26] C. Pacheco and M. Ernst, “Eclat: Automatic Generation and Xifeng Yan received the BE degree from the
Classification of Test Inputs,” Proc. 19th European Conf. Object- Computer Engineering Department of Zhejiang
Oriented Programming (ECOOP ’05), pp. 504-527, 2005. University, China, in 1997, the MSc degree in
[27] C. Boyapati, S. Khurshid, and D. Marinov, “Korat: Automated computer science from the University of New
Testing Based on Java Predicates,” Proc. ACM/SIGSOFT Int’l Symp. York at Stony Brook in 2001, and the PhD
Software Testing and Analysis (ISSTA ’02), pp. 123-133, 2002. degree in computer science from the University
[28] C. Csallner and Y. Smaragdakis, “JCrasher: An Automatic of Illinois at Urbana-Champaign in 2006. He is a
Robustness Tester for Java,” Software—Practice and Experience, research staff member at the IBM T.J. Watson
vol. 34, no. 11, pp. 1025-1050, 2004. Research Center. His area of expertise is data
[29] H. Do, S. Elbaum, and G. Rothermel, “Supporting Controlled mining, with an emphasis on mining and search
Experimentation with Testing Techniques: An Infrastructure and of graph and network data. His current research is focused on data
its Potential Impact,” Empirical Software Eng.: An Int’l J., vol. 10, mining foundations, pattern post analysis, social, biological and Web
no. 4, pp. 405-435, 2005. data mining, and data mining in software engineering and computer
[30] D. Wheeler, SLOCCount: A Set of Tools for Counting Physical systems. He has published more than 30 papers in reputed journals and
Source Lines of Code, http://www.dwheeler.com/sloccount/, conferences, such as the ACM Transactions on Database Systems, the
2006. ACM SIGMOD Conference on Management of Databases, the ACM
[31] K. Apt and E. Olderog, Verification of Sequential and Concurrent SIGKDD International Conference on Knowledge Discovery and Data
Programs, second ed. Springer-Verlag, 1997. Mining, the Very Large Database Conference, the Conference on
[32] D. Engler, D. Chen, and A. Chou, “Bugs as Inconsistent Behavior: Intelligent Systems for Molecular Biology, the International Conference
A General Approach to Inferring Errors in Systems Code,” Symp. on Data Engineering, and the Foundations of Software Engineering
Operating Systems Principles, pp. 57-72, 2001. Conference. He is a member of the IEEE.
[33] H. Agrawal, J. Horgan, S. London, and W. Wong, “Fault
Localization Using Execution Slices and Dataflow Tests,” Proc. Jiawei Han is a professor in the Department of
Sixth Int’l Symp. Software Reliability Eng., pp. 143-151, 1995.
Computer Science at the University of Illinois at
[34] F. Tip, “A Survey of Program Slicing Techniques,” J. Programming
Urbana-Champaign. He has been working on
Languages, vol. 3, pp. 121-189, 1995. research into data mining, data warehousing,
[35] J. Lyle and M. Weiser, “Automatic Program Bug Location by stream data mining, spatiotemporal and multi-
Program Slicing,” Proc. Second Int’l Conf. Computers and Applica- media data mining, biological data mining, social
tions, pp. 877-882, 1987. network analysis, text and Web mining, and
[36] Y. Ayalew and R. Mittermeir, “Spreadsheet Debugging,” Proc. software bug mining, with over 300 conference
European Spreadsheet Risks Interest Group Ann. Conf., 2003.
and journal publications. He has chaired or
[37] J. Ruthruff, M. Burnett, and G. Rothermel, “An Empirical Study of
served on many program committees of inter-
Fault Localization for End-User Programmers,” Proc. 27th Int’l national conferences and workshops. He also served or is serving on the
Conf. Software Eng. (ICSE ’05), pp. 352-361, 2005. editorial boards for Data Mining and Knowledge Discovery, the IEEE
[38] A. Ko and B. Myers, “Designing the Whyline: A Debugging Transactions on Knowledge and Data Engineering, the Journal of
Interface for Asking Questions about Program Behavior,” Proc. Computer Science and Technology, and the Journal of Intelligent
SIGCHI Conf. Human Factors in Computing Systems (CHI ’04), pp.
Information Systems. He is currently serving as founding editor-in-chief
151-158, 2004.
of the ACM Transactions on Knowledge Discovery from Data and on the
[39] W. Dickinson, D. Leon, and A. Podgurski, “Finding Failures by board of directors for the executive committee of the ACM Special
Cluster Analysis of Execution Profiles,” Proc. 23rd Int’l Conf. Interest Group on Knowledge Discovery and Data Mining (SIGKDD).
Software Eng. (ICSE ’01), pp. 339-348, 2001. Jiawei is an ACM fellow and an IEEE senior member. He has received
[40] A. Podgurski, D. Leon, P. Francis, W. Masri, M. Minch, J. Sun, and many awards and recognitions, including the ACM SIGKDD Innovation
B. Wang, “Automated Support for Classifying Software Failure
Award (2004) and the IEEE Computer Society Technical Achievement
Reports,” Proc. 25th Int’l Conf. Software Eng. (ICSE ’03), pp. 465-475,
Award (2005).
2003.
Chao Liu received the BS degree in computer Samuel P. Midkiff received the PhD degree in
science from Peking University, China, in 2003, 1992 from the University of Illinois at Urbana-
and the MS degree in computer science from the Champaign, where he was a member of the
University of Illinois at Urbana-Champaign in Cedar project. In 1991, he became a research
2005. He is currently a PhD student in the staff member at the IBM T.J. Watson Research
Department of Computer Science at the Uni- Center, where he was a key member of the
versity of Illinois at Urbana-Champaign. His xlHPF compiler team and the Ninja project. He
research focus is on developing statistical data has been an associate professor of computer
mining algorithms to improve software reliability, and electrical engineering at Purdue University
with an emphasis on statistical debugging and since 2002. His research has focused on
automated program failure diagnosis. Since 2003, he has more than 10 parallelism, high performance computing, and, in particular, software
publications in refereed conferences and journals, such as the ACM support for the development of correct and efficient programs. To this
SIGKDD International Conference on Knowledge Discovery and Data end, his research has covered dependence analysis and automatic
Mining, the International World Wide Web Conference, the European synchronization of explicitly parallel programs, compilation under
Software Engineering Conference, the ACM SIGSOFT Symposium on different memory models, automatic parallelization, high performance
the Foundations of Software Engineering, and the IEEE Transactions on computing in Java and other high-level languages, and tools to help in
Software Engineering. He is a member of the IEEE. the detection and localization of program errors. Professor Midkiff has
over 50 refereed publications. He is a member of the IEEE.