0% found this document useful (0 votes)
18 views12 pages

On The Real-World Effectiveness of Static Bug Detectors at Finding Null Pointer Exceptions

Uploaded by

17778221637user
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views12 pages

On The Real-World Effectiveness of Static Bug Detectors at Finding Null Pointer Exceptions

Uploaded by

17778221637user
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

On the Real-World Effectiveness of Static Bug

2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) | 978-1-6654-0337-5/21/$31.00 ©2021 IEEE | DOI: 10.1109/ASE51524.2021.9678535

Detectors at Finding Null Pointer Exceptions


David A. Tomassi Cindy Rubio-González
University of California, Davis University of California, Davis
United States of America United States of America
[email protected] [email protected]

Abstract—Static bug detectors aim at helping developers to drawn with respect to the overall bug-finding capabilities of
automatically find and prevent bugs. In this experience paper, the tools. In contrast, this paper evaluates static bug detectors
we study the effectiveness of static bug detectors at identifying with respect to their effectiveness at finding a common and
Null Pointer Dereferences or Null Pointer Exceptions (NPEs).
NPEs pervade all programming domains from systems to web serious kind of bug: Null Pointer Dereferences or Null Pointer
development. Specifically, our study measures the effectiveness of Exceptions (NPEs).
five Java static bug detectors: CheckerFramework, E RADICATE, NPEs pervade all programming domains from systems
I NFER, N ULL AWAY, and S POT B UGS. We conduct our study on software to web development. For instance, as of August
102 real-world and reproducible NPEs from 42 open-source 2021, there are over 1,900 CVEs (Common Vulnerabilities and
projects found in the B UG S WARM and D EFECTS 4J datasets. We
apply two known methods to determine whether a bug is found by Exposures) that involve NPEs [3]. One such CVE describes
a given tool, and introduce two new methods that leverage stack a denial of service attack in early versions of Java (1.3 and
trace and code coverage information. Additionally, we provide a 1.4) caused by crashing the Java Virtual Machine when calling
categorization of the tool’s capabilities and the bug characteristics a function with a null parameter [1]. In general, NPEs are
to better understand the strengths and weaknesses of the tools. problematic in memory-unsafe and object-oriented languages.
Overall, the tools under study only find 30 out of 102 bugs
(29.4%), with the majority found by E RADICATE. Based on our NPEs occur when either a pointer to a memory location or an
observations, we identify and discuss opportunities to make the object is dereferenced while being uninitialized or explicitly
tools more effective and useful. set to null. Depending on the programming language, NPEs
Index Terms—static bug detectors, null pointer exceptions, null will result in either undefined behavior or a runtime exception.
pointer dereferences, bug finding, BugSwarm, Defects4J, Java This experience paper evaluates recall of static bug detectors
with respect to a known set of real NPE bugs. The focus
I. I NTRODUCTION on NPEs allows to present an in-depth study of different
Defects in software are a common and troublesome fact approaches to find a same kind of bug, the characteristics of
of programming. Software defects can cause programs to real-world NPEs, and the reasons that affect tool effectiveness.
crash, lose or corrupt data, suffer from security vulnerabilities, To the best of our knowledge, this is the first study on the real-
among other problems. Depending on the application domain, world effectiveness of static bugs detectors at finding NPEs.
undesirable behavior can range from poor user experience to There are two orthogonal approaches to finding or prevent-
more severe consequences in mission critical applications [44]. ing NPEs, which make use of either a static bug detector or
Testing to uncover such software defects remains one of the a type-based null safety checker. The former uses dataflow
most expensive tasks in the software development cycle [31]. analysis [6, 10, 23, 29, 30, 32, 34] to find null dereferences.
There is a need for both precision and scalability when Such approaches mainly differ on the complexity of their
finding defects in real-world code. Furthermore, in an effort analyses. Some favor analysis scalability at the expense of
to increase their applicability, static bug detectors are often missing real bugs and/or producing numerous false positives,
designed to target a large variety of software bugs. Many e.g., intra/interprocedural and field sensitivity. The latter pre-
static bug detectors [2, 5, 7, 9, 10, 13–16] are currently being vents NPEs via a type system with null-related information
developed in industry and academia. Even with many tools to using dataflow analysis for type refinement. The type checker
choose from, developers have some hesitation in using static approach has been adopted in recent years [4, 14, 19, 33].
bug detectors for a variety of reasons such as large number of We study two popular Java static bug detectors: I NFER
bug warnings, high false positive rates, and inadequate warning [6, 15–17] and S POT B UGS [10], and three popular type-
messages [18, 26]. based null safety checkers for Java: Checker Framework’s
Previous studies have evaluated static bug detectors through Nullness Checker (CFN ULLNESS) [19, 33], E RADICATE [4],
various metrics: number of warnings [35], number of false and N ULL AWAY [8, 14]. I NFER uses separation logic and
negatives [38], tool performance [35], and recall [20, 41]. bi-abduction analysis [16] to infer pre/post conditions from
These studies have focused on popular tools that identify procedures affecting memory. S POT B UGS detects bugs based
a large number of bug patterns, and their conclusions are on a predefined set of bug patterns. CFN ULLNESS verifies

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: Workflow for running tools, collecting reports, parsing results, and analyzing data.

the absence of NPEs via type checking nullable expression with E RADICATE finding the majority of these.
dereferences and assignments. E RADICATE is a type checker The second challenge is to understand the reasons why
that performs flow-sensitive analysis to find possible null tools fail to find NPEs to identify opportunities to improve
dereferences. Finally, N ULL AWAY uses dataflow analysis to their real-world effectiveness. This requires understanding the
type check nullability in procedures and class fields. capabilities of the tools under study as well as the charac-
In this study, we consider 102 real-world and reproducible teristics of the NPE bugs in our dataset. First, we conduct
NPEs found across 42 popular open-source Java projects. 76 a detailed analysis of the tools’ capabilities with respect to
of these NPEs belong to the B UG S WARM dataset [42] while well-known program-analysis properties (e.g., flow sensitivity,
the remaining 26 are from D EFECTS 4J [27]. For each NPE, context sensitivity, etc.), and we identify common sources of
both datasets provide buggy and fixed versions of the programs unsoundness. This process required us to manually inspect the
along with scripts for compilation and testing. Furthermore, source code of the tools and write tests. All of our findings
each program has a failing test due to an NPE. This makes both were later confirmed by tool developers. Second, we manually
the B UG S WARM and D EFECTS 4J datasets good candidates for inspect and categorize each NPE bug in the dataset with
this study; we want to run existing static bug detectors and type respect to the nature of the dereference and its context. Based
checkers on these programs to determine their effectiveness at on the tool results, and the tool and bug characterizations,
detecting and preventing real NPEs. we identify several open opportunities to improve static bug
The first challenge is to determine whether a tool finds or detectors that find NPEs.
prevents a specific NPE bug. Tools may report the program The contributions of this paper are:
location at which the null dereference occurs, or simply the • We present two new methods that leverage dynamic infor-
location where the null value originates, which can be far mation to map tool warnings to NPE bugs (Section II).
from the dereference. The latter is particularly difficult to • We provide a categorization of the tool’s capabilities and
associate with the bug fix, which is often applied closer to the the bug characteristics to better understand the strengths
dereference site. Another difficulty lies in the large number of and weaknesses of the tools under study (Section III).
warnings to inspect. On average a tool produces from 122 to • We evaluate CFN ULLNESS , E RADICATE , I NFER , N ULL -
1,307 bug warnings per program (in our dataset). AWAY, and S POT B UGS on a collection of 102 NPEs, from
Previous work has partially automated the process of map- which only 29.4% of NPE bugs are detected (Section IV).
ping bugs to warnings based on static information such as the • We discuss the capabilities and limitations of each tool,
code difference (diff) between buggy and fixed versions [20, and provide future directions for improving their real-
38], and by comparing the warnings produced for each version world effectiveness (Section V).
of the program [20]. In this paper, we observe that dynamic
information can also be leveraged when an input exposing the II. M ETHODOLOGY
NPE bug is available, which is the case for all the bugs in our Here we describe the benchmark and tool selection, and
dataset. We present two new mapping methods for NPEs that the methodology to determine the effectiveness of the tools at
use (1) stack trace information, and (2) code coverage of tests finding NPEs. Figure 1 shows the main steps of our approach.
that fail due to NPEs. Our experimental evaluation shows that
these methods complement previous approaches. A. Benchmark Selection
We run CFN ULLNESS, E RADICATE, I NFER, N ULL AWAY, Our study focuses on Null Pointer Exceptions (NPEs). We
and S POT B UGS on our dataset of 102 real NPEs. We find consider bugs from the B UG S WARM and D EFECTS 4J datasets,
that the tools produce a large number of warnings, including both of which provide a bug classification based on runtime
over 500,000 NPE warnings across all programs. We apply exceptions. Our selection criteria is: (1) the bug is due to an
existing approaches, and our new methods, to identify the NPE, (2) there is a failing test due to the NPE, and (3) code
warnings that describe the bugs under study. Ultimately, we coverage can be measured. Additionally, we control for unique
find that the tools detect only 30 out of 102 bugs (29.4%), builds when selecting B UG S WARM bugs. Our final dataset

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
consists of 76 NPE bugs from the B UG S WARM dataset and hold a null value. The user can add explicit @Nullable
26 from D EFECTS 4J. The B UG S WARM NPE bugs belong annotations to obtain more precise results.
to 32 Java projects hosted on GitHub that use the Maven e) S POT B UGS: S POT B UGS applies pattern matching and
build system, while the D EFECTS 4J bugs belong to 10 Java limited dataflow analysis to find a large variety of bugs
projects that use the Ant build system. Note that all NPEs are such as infinite recursion, integer overflows, and null pointer
reproducible, i.e., one can run the programs and observe a Null dereferences. The tool produces an XML report listing bug
Pointer Exception being thrown. Furthermore, we manually warnings that include class name, method name, severity, and
verified that each NPE bug in our study is an actual NPE, i.e., line numbers associated with the identified bug. S POT B UGS
a null object is eventually dereferenced. Each NPE instance is available as a plugin for a variety of build systems such
consists of the source code that contains the bug, the source as Ant, Gradle, and Maven. We run S POT B UGS with effort
code that fixes the bug, and scripts to compile and test. level “max”, which indicates that S POT B UGS performs its
interprocedural analysis. Also, we use two different error con-
B. Tool Selection and Configuration fidence threshold settings “low” and “high” (“low” confidence
threshold may report a higher number of false positives).
We conducted an extensive search for tools that find or
prevent NPE bugs in Java projects. We focused on publicly C. Analysis of NPE Warnings
available tools that are standalone and under active devel- A challenge in this study is to determine whether a tool finds
opment. Out of nine tools, four [29, 30, 32, 34] did not or prevents a specific NPE bug. In the case of NPEs, tools
satisfy at least one of these requirements. In this paper we may report the program location at which the null dereference
study the remaining five tools: CFN ULLNESS, E RADICATE, occurs, or simply report the location where the null value
I NFER, N ULL AWAY, and S POT B UGS. Note that I NFER and originates, which can be far from the dereference. The latter
S POT B UGS find a large variety of bugs in addition to NPE is particularly difficult to associate with the bug fix, which is
bugs. CFN ULLNESS, E RADICATE, and N ULL AWAY exclu- often applied closer to the dereference site.
sively specialize in NPEs. Below we describe each tool. We consider four approaches for mapping bug warnings to
a) CFN ULLNESS: A type checker written using the actual bugs in the source code, i.e., determine whether a tool
Checker Framework, which is available as a compiler plu- finds a given bug under study. Two of these approaches have
gin. CFN ULLNESS works with nullness type annotations, been used in previous work: the C ODE D IFF M ETHOD [20, 38]
@Nullable and @NotNull, and looks for violations in their and the R EPORT D IFF M ETHOD [20]. We explore two new
use. Namely looking for dereferences on @Nullable expres- approaches, which we refer to as the S TACK T RACE M ETHOD
sions and for @Nullable-value assignments to @NotNull and the C OVERAGE M ETHOD.
variables. CFN ULLNESS produces compile-time warnings. We Figure 2 shows an example of an NPE found in the
run CFN ULLNESS using its default configuration. OpenPnP1 GitHub project as part of the BugSwarm dataset.2
b) E RADICATE: A type checker part of the I NFER static- Method saveDebugImage is called on Line 8 of file
analysis suite. E RADICATE type checks for @Nullable OpenCvVisionProvider.java (see Figure 2b), where
annotations in Java programs by performing a flow-sensitive argument debugMat is null. Method saveDebugImage
analysis to propagate null-related information through assign- in file OpenCvUtils.java calls toBufferedImage on
ments and calls. E RADICATE produces warnings for accesses Line 6 (see Figure 2a), passing in null, which is then
that could lead to an NPE. E RADICATE produces a report dereferenced on Line 11. The code highlighted (in green)
in JSON format that provides the stack trace, severity, and represents the patch to fix the NPE. Figure 2e shows the stack
source location associated with each bug detected. We run trace, and Figures 2c and 2d show the warnings produced by
E RADICATE using its default configuration. S POT B UGS and I NFER, respectively.
c) I NFER: A static-analysis tool developed by Facebook 1) C ODE D IFF M ETHOD: This method takes as input the
that finds a variety of bugs in Java, C/C++, and Objective- set of warnings reported for the buggy program and the set of
C programs. I NFER uses bi-abduction analysis to find bugs patches from the GitHub code diff.3 The analysis focuses on
including deadlocks, memory leaks, and null pointer deref- NPE bug warnings, and checks whether the source location of
erences. Similar to E RADICATE, I NFER produces a report in these warnings overlaps with the lines changed in the patches.
JSON format that provides the stack trace, severity, and bug However, this is based on an over-approximation; the lowest
location. We use I NFER’s default setting, which runs the bi- and highest line numbers associated with the patch in each
abduction analysis. changed file are considered.4 If an overlapping line is found,
d) N ULL AWAY: A type checker for Java developed by then the warning is considered a bug candidate. We manually
Uber that applies various AST-based checkers to find NPE examine bug candidates to verify their validity.
bugs. N ULL AWAY is available as a plugin for Maven and
1 https://github.com/openpnp/openpnp
Gradle. We use N ULL AWAY’s default configuration, which 2 BugSwarm image tag: openpnp-openpnp-213669200.
assumes that unannotated method parameters, return values, 3A GitHub code diff may consist of several patch fragments.
and class fields are not null. In such cases, the tool produces 4 Previous work has also added a configurable number of lines before the
a warning when it is found that any of those locations could starting point and after the ending point of the line range [20].

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
@@ -2,6 +2,9 @@ public synchronized static Mat ...
<BugInstance rank="8" abbrev="NP" category="
2 2
3 3 } CORRECTNESS" priority="2" type="
4 4 NP_NULL_PARAM_DEREF">
5 5 public static void saveDebugImage(..., Mat mat) { <Method classname="org.openpnp.util.
6 + if (mat == null) { OpenCvUtils" name="saveDebugImage">
7 + return; <SourceLine classname="org.openpnp.util.
8 + } OpenCvUtils" start="6" end="6"
6 9 saveDebugImage(...,OpenCvUtils.toBufferedImage(mat)); sourcefile="OpenCvUtils.java"/>
7 10 ...
8 11 }
9 12 (c) SpotBugs XML report.
10 13 public static BufferedImage toBufferedImage(Mat m) {
11 14 if (m.type() == CvType.CV_8UCI) {...} // NPE!
12 15 } {"bug_class":"PROVER",
"kind":"ERROR",
(a) GitHub diff in file OpenCvUtils.java. "bug_type":"NULL_DEREFERENCE",
"qualifier":"object ‘debugMat‘ last assigned
1 1 public getTemplateMatches(BufferedImage template) {
on line 3 could be null and is
2 2 ...
3 3 Mat debugMat = null; dereferenced by call to ‘saveDebugImage
4 4 if (LogUtils.isDebugEnabled()) { (...)‘ at line 8.",
5 5 debugMat = imageMat.clone(); "file":"OpenCvVisionProvider.java",
6 6 } "severity":"HIGH",
7 7 ... ...
8 8 OpenCvUtils.saveDebugImage(..., debugMat); }
9 9 }

(b) Null origin in file OpenCvVisionProvider.java. (d) Infer JSON report.


java.lang.NullPointerException
at org.openpnp.util.OpenCvUtils.toBufferedImage(OpenCvUtils.java:11)
at org.openpnp.util.OpenCvUtils.saveDebugImage(OpenCvUtils.java:6)
at org.openpnp.machine.reference.vision.OpenCvVisionProvider.getTemplateMatches(
OpenCvVisionProvider.java:8)
...
(e) Stack trace for buggy program.

Fig. 2: GitHub diff, stack trace, SpotBugs XML report, and Infer JSON report for an NPE found by S POT B UGS LT and I NFER.

Consider the patch in Figure 2a. The line at the top mean that the bug of interest was found. The code change
(starting with @@) indicates that the patch includes orig- could have introduced “noise” that leads the tool to conclude
inal lines 2 through 6, and new lines 2 through 9 from that an unrelated bug warning is no longer a problem. We
file OpenCvUtils.java. Therefore, the approximated line observe that this occurs often in practice (see Section IV-B).
range is 2 through 9 for the buggy program, i.e., the
3) S TACK T RACE M ETHOD: This approach requires the set
program before the fix. The S POT B UGS report (see Fig-
of bug warnings of the buggy program, and the stack trace(s)
ure 2c) includes the XML tag SourceLine: Line 6 of file
produced when running the buggy program. As with previous
OpenCvUtils.java. This location lies within the line
methods, this approach only considers warnings related to NPE
range 2–9, thus the method correctly collects this warning
bugs. For each NPE warning, the algorithm retrieves the file
as a bug candidate. On the other hand, even though I NFER
and line number(s) associated with the warning, and checks
(see Figure 2d) successfully finds the bug, the C ODE D IFF
whether those are included in the stack trace. If so, the warning
M ETHOD approach fails to map the warning because the report
is classified as a bug candidate.
does not include lines close to the fix. In this case, using code
diff information is not effective. Consider again the example from Figure 2. The
S POT B UGS report (Figure 2c) mentions Line 6 in file
2) R EPORT D IFF M ETHOD: This method uses the set of OpenCVUtils.java. The report pinpoints that there is a
bug warnings of the buggy program, and the set of warnings null parameter in a recursive call to saveDebugImage,
of its fixed version. The algorithm searches for NPE bug warn- which could result in an NPE. On the other hand, the I N -
ings that are only reported for the buggy program. The intuition FER report (Figure 2d) lists a warning associated with file
is that the warning that describes the bug of interest should OpenCvVisionProvider.java on Line 8. The call to
not be present in the bug report of the fixed program. Using saveDebugImage in method getTemplateMatches is
this method, both S POT B UGS and I NFER are determined to passed debugMat as argument, which could be null and
have found the bug from Figure 2. This method is convenient result in an NPE. Note that I NFER refers to a lower stack
because it only requires two bug reports. However, the absence frame than S POT B UGS, but the S TACK T RACE M ETHOD
of a bug warning in the fixed program does not necessarily successfully maps both reports to the same bug because both

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
locations can be found in the stack trace (Figure 2e).
Concurrency 2 Bug Classification
The S TACK T RACE M ETHOD takes advantage of the nature
Generics 3
of NPE bugs and their presence in the stack trace. Because
Third Party Library 8
NPE bugs correspond to Null Pointer Exceptions, the call stack
Collection-Like Object 10
is given at the time the exception occurs. This information is
a valuable resource that leads to a more natural bug mapping Method Parameter 13
than previous methods. However, this method requires an Map-Like Object 13
executable buggy program and a reproducible NPE. Also, this Reflection 13
method provides a line in the stack trace that can be mapped Field 17
to a bug warning, however, this does not necessarily mean Method Return 32
that the tool found the correct dereference; there are long 0 5 10 15 20 25 30 35
dereference chains that may be associated with the same line. Number of Bugs
Thus, as with previous methods, it is necessary to manually
verify that the trace indeed matches the context of the NPE Fig. 3: Bug Classification Results
warning. We consider all available sources of information such
as source code and code diff during manual inspection.
4) C OVERAGE M ETHOD: This method is a general version Using the source code, the GitHub code diff, and the build
of the S TACK T RACE M ETHOD, but it includes all lines log, we identified the origin of the null value, and its
executed by the test that triggers the NPE. The input is the dereference location. Based on this inspection, we identified
set of NPE warnings of the buggy program, and the lines nine general categories of NPEs with respect to what is deref-
covered (executed) by a test case that fails due to an NPE. erenced, and the context of the dereference. These categories
The approach determines if the source location given in a along with their counts can be found in Figure 3. Note that
warning is covered, in which case the warning is added to an NPE can belong to multiple categories. The most common
the set of bug candidates. This captures the scenario where categories are when a method return value is dereferenced (32
the location of an NPE warning is far away from the actual bugs) and when a field is dereferenced (17 bugs).
dereference, which is particularly useful when analyzing the As for the tool capabilities, we consider seven well-known
warnings produced by type checkers such as N ULL AWAY and program analysis properties: (1) intraprocedural, (2) interpro-
E RADICATE. The assumption is that even if the NPE warning cedural, (3) flow sensitive, (4) context sensitive, (5) field
and the actual dereference are located far away from each sensitive, (6) object sensitive, and (7) path sensitive [11].
other, both source locations will be part of the execution trace. We identified seven common sources of unsoundness: (1)
For example, consider a case in which a field is set to null handling of third-party libraries whose source code may not
in a constructor and the field is dereferenced in some method. be available, (2) impure methods that have side effects and
Type checkers may produce a warning related to setting the are non-deterministic, (3) concurrency, (4) reasoning about
field to null, but not a warning describing the dereference dynamic dispatch, (5) dealing with code that uses reflection,
itself. However, in this case, both source locations will be (6) field initialization after a constructor is called, and (7)
part of the execution trace. Note that this approach requires generic parameters. Unsoundness can lead to false positives
the existence of a failing test that triggers the NPE, and the (incorrect bug warnings) and false negatives (missed bugs).
ability to execute the test. Both requirements are met for our We studied CFN ULLNESS, E RADICATE, I NFER, N ULL -
dataset. As with other methods, we manually inspect all bug AWAY, and S POT B UGS with respect to the above analysis
candidates to determine their validity. characteristics and sources of unsoundness. In this process,
we manually inspected the source code and documentation of
III. B UG AND T OOL C HARACTERIZATION the tools, and we wrote kernel test programs that exhibited
different categories of behaviors to confirm tools’ capabilities
A fundamental step in evaluating the effectiveness of static and limitations. Table I shows the tool capabilities, and Table II
bug detectors is to understand their capabilities, and whether shows the sources of unsoundness for each tool. Below we
real-world bugs possess the desired characteristics to be de- describe our findings for each tool, which were confirmed by
tected. In this section, we characterize the dataset of real the corresponding developers.
NPEs as well as the tools under study with respect to their a) CFN ULLNESS: An ensemble of three checkers: (1)
approaches to find NPEs. We describe our methodology and an intraprocedural flow-sensitive qualifier inference for the
results, which will be critical in Section IV to determine nullness of a particular object, (2) initialization checking, and
whether a given NPE can be found by the tools. (3) map key checking. It assumes @NonNull for unannotated
First, we performed a manual categorization of all 102 code except for locals, and provides an analysis for iterating
NPEs to determine the root cause of the null pointer deref- over null collections and arrays. Additionally, CFN ULLNESS
erences. The categorization was performed separately by an supports annotations to denote: (1) if a method has no side-
author of this paper and two people external to the project. effects or is deterministic, (2) the target of a reflection invo-
When in disagreement, the inspectors met to reach consensus. cation, (3) and upper bounds of types for generic objects.

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Tool Capabilities Confirmed by Developers. != has capabilities, "= no capabilities, Partial = limited capabilities.

Tool Intraproc. Interproc. Field Sensitive Context Sensitive Object Sensitive Flow Sensitive Path Sensitive
CFN ULLNESS ! " ! " " ! "
E RADICATE ! " Partial " " ! !
I NFER ! ! ! " ! ! !
N ULL AWAY ! Partial Partial " N/A Partial Partial
S POT B UGS ! Partial Partial " " Partial Partial

TABLE II: Sources of Unsoundness for the Tools. != is sound, "= is unsound, Partial = unsound in some aspects.

Tool Third Party Libs. Impure Methods Concurrency Dynamic Dispatch Reflection Field Init. Generic Types
CFN ULLNESS ! ! ! ! ! ! !
E RADICATE Partial " " " " Partial "
I NFER Partial Partial " Partial " " "
N ULL AWAY " " " ! " Partial "
S POT B UGS " " " " " " "

b) E RADICATE: An intraprocedural flow-sensitive anal- check fields that are read by methods called from constructors.
ysis for the propagation of nullability through variable assign- e) S POT B UGS: A null-pointer analysis inherited from
ments and function calls. E RADICATE also raises an alarm F IND B UGS [24] that combines forward and backward dataflow
for accesses to fields that have annotated nullability, however analyses for tracking null values. The analysis provides
its field initialization checker is subject to ongoing work. limited tracking of object fields; it does not support aliasing
E RADICATE’s nullability annotations allow for the annotation and volatile fields, and it assumes that any method can modify
of methods, fields, and method parameters with @Nullable a field of an object passed as argument. Additionally, S POT-
annotations. As detailed in Table II, E RADICATE provides B UGS provides a null-related annotation @CheckForNull to
built-in models of the JDK and Android SDK and supports denote values that must be null-checked prior to a dereference.
user-specified nullability signatures for other third-party li- Our tests confirmed the intraprocedural nature of S POT B UGS,
braries, which helps mitigate false negatives. however we were unable to expose S POT B UGS’ field sensi-
c) I NFER: An interprocedural analysis that supports tivity. Lastly, S POT B UGS infers parameter and return value
tracking object aliasing, side effects in methods, and dynamic information intraprocedurally if these are null checked, and it
types of objects. All our tests were successful when running suffers from all sources of unsoundness as shown in Table II.
I NFER, showing that the tool is interprocedural and field IV. E XPERIMENTAL E VALUATION
sensitive. A caveat is that I NFER does not find uninitialized
fields, but it can find null dereferences to fields that have been This experimental evaluation is designed to answer the
initialized. As shown in Table II, I NFER partially supports following research questions:
third-party libraries via an internal model of the JDK. For RQ1 How prevalent are NPEs among all warnings?
impure methods, I NFER tracks some effects in methods, e.g., RQ2 How effective are bug mapping methods for NPEs?
if a method sets this.field = null, the effect will be RQ3 How effective are static bug detectors for NPEs?
tracked at the call site. Tracking dynamic types of objects RQ4 What are the reasons bug detectors miss NPEs?
is useful to refine the control-flow graph. However, this only We ran CFN ULLNESS, E RADICATE, I NFER, N ULL AWAY,
occurs in the context of the entry point of the analysis. and S POT B UGS on our dataset of 102 programs with real NPE
bugs to generate bug reports for the buggy and fixed versions
d) N ULL AWAY: A flow-sensitive type refinement anal-
of the programs. We ran the tools on the full programs,
ysis to infer nullness of local variables that includes a field
and verified that the files relevant to the bug and fix were
initialization checker. N ULL AWAY assumes that unannotated
indeed analyzed by the tools. We considered two settings for
code cannot be null. For methods, fields, and method parame-
S POT B UGS: low and high thresholds. The results are presented
ters annotated with the @Nullable annotation, N ULL AWAY
as S POT B UGS LT and S POT B UGS HT, respectively. We au-
ensures no dereferences, and that their value will not be
tomatically parsed the bug reports to extract and normalize
assigned to a non-null field or argument. Our tests showed
relevant information, which we stored in a MySQL database.
that N ULL AWAY finds local and object field dereferences
Our study is fully reproducible. The dataset of real re-
without annotations. With annotations, N ULL AWAY can find
producible NPE bugs from B UG S WARM and D EFECTS 4J is
null dereferences of method parameters and return values.
publicly available as well as the tools we study. The scripts
N ULL AWAY is able to avoid dynamic dispatch as a source of
for performing the experiments and all data described in this
unsoundness by ensuring that methods that are overridden have
section is publicly accessible.5
the same nullability as its parent’s class. N ULL AWAY’s field
initialization is unsound. For example, the analysis does not 5 https://github.com/ucd-plse/Static-Bug-Detectors-ASE-Artifact

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
Static Inner Class 6,983 SpotBugsLT Wrong Map Iterator 784 SpotBugsHT Unsafe Thread Interface 128 Infer
Unused Field 8,922 Return Value 1,780 Immutable Cast 1,966

Return Value 9,751 Dead Local Store 2,512 Thread Safety Violation 6,742
Dangerous Method 14,828 Null Dereference 8,807 Null Dereference 12,307

Null Dereference 22,656 Dangerous Method 9,096 Resource Leak 15,708


0 10,000 20,000 30,000 40,000 0 5,000 10,000 15,000 20,000 0 10,000 20,000 30,000 40,000
Number of Warnings Number of Warnings Number of Warnings

Fig. 4: S POT B UGS LT, S POT B UGS HT, and I NFER distribution of top 5 warnings.

TABLE III: Number of all warnings and NPE warnings TABLE IV: Bugs mapped by each method. We show the
produced by each tool.“Avg All” and “Avg NPEs” refer to number of correct mappings / the total number of mappings.
the average number of warnings produced per program. Column “Bugs Found” gives the total number of bugs found
per tool. Inside parenthesis are the number of bugs that a tool
Tool All NPEs Avg All Avg NPEs found but not others. 30 unique bugs are found across tools.
CFN ULLNESS 231,860 231,860 (100%) 1,137 1,137
E RADICATE 266,682 266,682 (100%) 1,307 1,307 Method
I NFER 37,035 12,307 (33.2%) 181 60 Tool Code Report Stack Covered Bugs Found
N ULL AWAY 25,065 25,065 (100%) 122 122
S POT B UGS HT 49,555 8,807 (17.8%) 243 43 CFN ULLNESS 6/56 5/18 7/26 5/56 11 (2)
S POT B UGS LT 129,183 22,656 (17.5%) 633 111 E RADICATE 10/51 7/24 7/22 8/52 20 (5)
I NFER 3/23 2/13 9/12 7/27 10 (1)
N ULL AWAY 1/17 0/21 1/4 5/26 5 (2)
S POT B UGS HT 4/18 4/6 1/5 2/13 4 (0)
S POT B UGS LT 6/46 6/13 6/12 3/26 9 (1)
A. RQ1: Prevalence of NPE Warnings Total 30/211 24/95 31/81 30/200 30 Unique

Table III shows the total and average number of warnings


produced by each tool when analyzing the programs. There are B. RQ2: Effectiveness of Bug Mapping Methods
a total of 739,380 warnings across the 102 × 2 programs in
We applied the four methods discussed in Section II-C to
our dataset. E RADICATE yields the largest number of warnings
find whether the tool warnings describe the bugs of interest.
with 266,682, all of which are NPE warnings. Similarly,
In total, all methods together correctly find 30 distinct bugs
CFN ULLNESS has the second highest number of NPE warn-
out of 102 bugs (29.4%). All bug candidates were manually
ings with a total of 231,860. S POT B UGS LT produces the
inspected. Table IV summarizes the results.
third highest number of warnings with 129,183 warnings,
An effective mapping method is defined as having high
and S POT B UGS HT follows with 49,555 warnings. Unlike
recall and precision. The S TACK T RACE M ETHOD is the most
E RADICATE, CFN ULLNESS and N ULL AWAY, S POT B UGS can
effective among the four, mapping 31 bugs with a precision
generate over a hundred different types of non-NPE warnings
of 38.2%. All the NPEs mapped to a warning were contained
while I NFER generates seven.
within the S TACK T RACE M ETHOD, except for four. On the
Figure 4 shows the top five types of warnings for S POT- other hand, while the C ODE D IFF M ETHOD and C OVERAGE
B UGS LT, S POT B UGS HT, and I NFER. It is observed that NPEs M ETHOD produce the largest number of bug candidates across
are one of the most prevalent warnings for these tools: the most all tools, they also suffer from the lowest precision, 14.2% and
common for S POT B UGS LT, and the second most common 15.0%, respectively. The R EPORT D IFF M ETHOD mapped the
for both S POT B UGS HT and I NFER. Indeed, NPEs constitute lowest number of true bugs in comparison to other methods
from 17.5% to 33.2% of the total warnings produced by (24 bugs), but its precision of 25.3% was still higher than that
these tools. For S POT B UGS HT, we observe a reduction in of the C ODE D IFF M ETHOD and C OVERAGE M ETHOD. The
total number of warnings and NPE warnings with respect to results show that the four methods are primarily complemen-
S POT B UGS LT of 61.6% and 61.1%, respectively. tary of each other as they map different types of information.
Finally, N ULL AWAY produces the fewest warnings (all of
them are NPE warnings), with a total of 25,065. RQ2: The S TACK T RACE M ETHOD is the most effective
with 81 bug candidates, of which 31 were true bugs
RQ1: NPE warnings are prevalent in all the tools stud- (38.2%). The C ODE D IFF M ETHOD and C OVERAGE
ied. A total of 567,377 NPE warnings (76.7% of all M ETHOD had similar recall than the S TACK T RACE
warnings) are produced for our dataset. The percentage M ETHOD, but a lower precision of 14.2% and 15.0%,
of NPE warnings for S POT B UGS LT is 17.5%, S POT- respectively. The R EPORT D IFF M ETHOD had the lowest
B UGS HT 17.8%, and I NFER 33.2%. recall but a higher precision than C ODE D IFF M ETHOD
and C OVERAGE M ETHOD.

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
1 protected Object decode(Channel channel, ...){ 1 public class GrblCntrllr extends AbstractCntrllr {
2 - if (channel == null) { 2 - capabilities = null;
3 + if (channel != null) { 3 + capabilities = new GrblUtils.Capabilities()
4 channel.write(response, remoteAddress); 4 protected void pauseStreamingEvent(){
5 } 5 if (this.capabilities.REAL_TIME) { ... }
6} 6}

(a) NPE bug found by both S POT B UGS and E RADICATE. (b) NPE bug dereferencing field of an object not found by any tool.
1 protected void ldCmdVerSheet(String sheetName) { 1 private void verifyDecodedPosition() {
2 Sheet sheet = switchToSheet(sheetName, false); 2 - if(p.gNtk()!=null){
3 + if(sheet==null) return; 3 + if(p.gNtk()!=null && p.gNtk().gTwrs()!=null){
4 while(i<sheet.getRows()) { ... } 4 for (Twr Twr : p.gNtk().gTwrs()){
5} 5}

(c) NPE bug due to null dereference of a return value. (d) NPE bug with dereferencing object returned from a method.
Fig. 5: Examples of NPE diffs from the dataset.

C. RQ3: Effectiveness of Tools at Finding NPEs some nullness annotations. We ran all tools on the anno-
tated programs, except for I NFER which does not use an-
Overall, the tools find 30 distinct bugs out of 102 bugs
notations. IntelliJ added 34,229 @Nullable and 167,236
(29.4%). The breakdown per tool is shown in Table IV.
@NotNull annotations. We applied the C OVERAGE M ETHOD
E RADICATE finds the most bugs with 20 out of 30. CFN ULL -
to map warnings. This resulted in 2, 3, 0, and 2 additional
NESS finds 11 bugs, I NFER 10 bugs, and S POT B UGS LT 9
bugs found by CFN ULLNESS, E RADICATE, N ULL AWAY, and
bugs. N ULL AWAY and S POT B UGS HT find the fewest bugs
S POT B UGS LT, respectively. These accounted for three unique
with 5 and 4, respectively. We examined the overlap among
bugs across all tools. Despite the small increase in bugs found,
bugs found by each tool. The two tools with the most overlap
the results are promising as annotating less than half of the
are CFN ULLNESS and E RADICATE with 7 bugs. Interestingly,
programs resulted in finding 10% more bugs in total.
each tool finds bugs not found by other tools (also shown in
Table IV). This shows that the tools are complementary, and
RQ3: Overall, the tools have low effectiveness at finding
that practitioners could benefit from running multiple tools. A
NPE bugs. Out of the 102 bugs in our dataset, E RAD -
challenge to this is the large number of warnings to inspect.
ICATE found 20 bugs (19.6%), CFN ULLNESS found
An example of a bug found by CFN ULLNESS, I NFER, and 11 (10.8%), I NFER found 10 (9.8%), S POT B UGS LT
S POT B UGS LT was given in Figure 2a. We show the diff found 9 bugs (8.8%), N ULL AWAY found 5 (4.9%), and
between a buggy version (with an NPE bug) and the fixed S POT B UGS HT found 4 (3.9%). Additional annotations
version of the GitHub project openpnp/openpnp (a robotic resulted in finding 3 more bugs.
pick and place machine). The call to the buggy method that
causes the NPE is located on Line 6 of the buggy program.
D. RQ4: Reasons Bug Detectors Miss NPEs
The fix for this NPE bug consists of adding a null check for
parameter mat in saveDebugImage. We are interested in understanding the reasons why bug
Figure 5a shows an example of a bug found by CFN ULL - detectors fail to find real NPEs. We start by discussing the
NESS, E RADICATE, S POT B UGS HT, and S POT B UGS LT. Here characteristics of the bugs that the tools find based on the
we show the diff between a buggy version and the fixed version characterization of 102 bugs from our dataset and the tools
of project traccar/traccar (a GPS tracking system). The themselves (see Section III). We then discuss the characteris-
bug was that the null check was flipped, incorrectly deref- tics of those bugs that the tools fail to find.
erencing channel when null. The fix simply consists of a) CFN ULLNESS: CFN ULLNESS found 11 bugs includ-
changing the comparison operator from == to !=. A possible ing every category shown in Figure 3. These included 3
reason why I NFER did not find this bug is that I NFER does dereferences to a method return value and 2 dereferences of a
not gather information from checks. Since Figure 5a includes map object. The sound properties of CFN ULLNESS allow it to
a null check, S POT B UGS is able to reason that channel is find classes of bugs that the other tools cannot. For example,
dereferenced when null, leading to an NPE. CFN ULLNESS also found bugs due to concurrency, field
We conducted an additional experiment on a random sample initialization, generics, and reflection. The lack of necessary
of 40 programs6 from our initial set for which annotations annotations in the projects under study inhibits CFN ULL -
were inferred using IntelliJ IDEA’s Infer Nullity [7]. IntelliJ NESS’s ability to find all of the bugs in those categories.
Idea infers both @Nullable and @NotNull annotations. b) E RADICATE: E RADICATE found 20 bugs where 9
Note that 49 out of the 102 programs originally include dereferenced a method return value, 3 dereferenced an object
field, 1 retrieved a value from a map object, and the rest
6 The process could not be automated due to the IDE, thus the sample. dereferenced a method parameter. Despite using a partial

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
model of the JDK, E RADICATE missed bugs in other third- characterization; S POT B UGS does not provide complete field
party libraries. E RADICATE does not handle concurrency and sensitivity.
reflection. These limitations explain some of the false nega-
tives, while others can be explained by the lack of full field RQ4: S POT B UGS misses NPEs that require interproce-
initialization checks and dynamic dispatch. dural analysis. I NFER performs interprocedural analysis
c) I NFER: I NFER found 10 bugs that included 4 deref- but does not have a field initialization check nor does it
erences of a method parameter, 4 dereferences of a method handle some path-sensitive information from null checks.
return value (one of which is from a JDK library), a derefer- N ULL AWAY relies on nullness annotations but does not
ence of a list, and a dereference of an object field. These handle maps nor third-party libraries. E RADICATE deals
NPEs are interprocedural in nature, which aligns with our with third-party libraries better than other tools, but it
characterization of I NFER in Section III. However, I NFER did still misses bugs due to partial field initialization check-
not find the remaining NPEs that involve method parameters, ing. CFN ULLNESS provides sound analyses to handle
method return values, or object fields, which we would expect reflection and initialization which allows finding bugs that
to be captured by interprocedural analysis. One reason is that other tools cannot. However, the lack of annotations can
I NFER does not take into account existing null checks. still lead to missed bugs.
I NFER has an internal partial model of the JDK, which
enables reasoning about certain library methods. Surprisingly, E. Threats to Validity
despite the fact that I NFER supports field sensitivity, and was
successful at finding such bugs in our tests, it missed many Although we conducted this study on a substantial number
other field sensitive bugs. Note that I NFER does not have a of real-world NPEs, our results cannot be generalized. We
check for field initialization so it does not find uninitialized attempted to reduce this threat by including a large number
fields, but it does support fields set to null. Such an example of NPE bugs from a diverse set of 42 projects from two Java
is shown in Figure 5b. Additionally, I NFER does not find NPEs bug datasets. It is possible that we may have missed other
that involve reflection, concurrency, maps, or use of third-party tools that are eligible for our study. We still believe that the
libraries outside of the JDK. five tools considered are good representatives of popular and
d) N ULL AWAY: N ULL AWAY found 5 bugs, all of which widely-used state-of-the-art static bug detectors for NPEs. The
dereference a return value. This shows the challenge in placing four different mapping methods used in this paper are not
annotations in the right place to be beneficial. N ULL AWAY’s perfect, and may lead to false positives. To alleviate this threat,
main sources of false negatives are its assumptions that we manually inspected all warnings that were deemed to be
unannotated code is not null and that third-party libraries bug candidates. Anything requiring human intervention can be
do not return null. While manual tests written during our error prone and subjective. To mitigate this threat and reduce
categorization revealed correct warnings about dereferenced bias, we involved two people external to our project in the
fields, real bugs that share these characteristics were not categorization of bugs. Finally, we consulted tool developers to
detected. Such an example is shown in Figure 5b, where confirm our findings regarding tool capabilities and limitations,
an unannotated field (considered non-null) is being assigned as discussed in Section III.
null. This represents a strict violation of the assumption that
V. L ESSONS L EARNED
the field cannot hold a null value, and should result in a
warning. Finally, in the process of running N ULL AWAY, one This section describes some opportunities for improvement.
of the programs crashed the tool. The problem was due to a) Need for reducing or ranking warnings: Over 500,000
a buggy treatment of certain methods in the standard Java NPE warnings were generated across all tools and programs,
library. We reported the bug to N ULL AWAY developers, and with NPEs being in the top-2 warnings for every tool. The
it is now fixed in the latest release. average number of warnings per program was in the hundreds,
e) S POT B UGS: S POT B UGS found 9 NPEs, of which 5 which is a cumbersome amount. Because of this, we had
occur when dereferencing a method parameter, 3 when derefer- to employ a combination of mapping methods and manual
encing a method return value, and another when dereferencing verification to determine if a bug was found. In our case it is
a field. In all cases, there is at least one null check within the known that an NPE exists, and the goal is to determine whether
method for the object being dereferenced, but the programmer the tools find it. However, this is not the usual setting for tool
dereferences the object in a path that is not checked. The usage; developers do not know beforehand of the existence
null checks enable S POT B UGS to reason about the NPEs of bugs, or else the tools would not be needed. Thus, the
intraprocedurally (Section III). The remaining NPEs in our large number of warnings is especially problematic in a real
sample that dereference a method’s return value or parameter setting where true bugs are unknown and all warnings must
are not found because they require interprocedural reasoning. be inspected.
Additionally, the 17 NPEs that involve the dereference of an A bug ranking system could help in navigating the large
object field are not found by S POT B UGS. S POT B UGS fails number of warnings. All tools studied, except for N ULL AWAY,
to find any bugs dealing with reflection, concurrency, third- provide severity warning information, but this information did
party libraries, maps, and lists. This conforms to our tool not correlate to finding the NPEs under study. For example,

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
S POT B UGS provides a severity ranking: “concerning”, “trou- reflection calls. This approach has been implemented in other
bling”, “scary”, and “scariest”. However, the true bugs found analyses [28, 37, 39, 40] for Java, where analysis precision was
were not associated with the most severe category, but with the observed to improve. Indeed, incorporating the above strategy
“troubling” and “scary” categories. This shows the need for would enable the NPE bug detectors to find 13 additional
more conservative strategies to process warnings, or to label bugs, from which CFN ULLNESS successfully finds one given
warnings that are more likely to be true bugs. the existing annotations.
Two main approaches for ranking warnings are found in
the literature, and could be applied in the context of static VI. R ELATED W ORK
bug detectors for NPEs. The first solely focuses on ranking a) Static Analyzer Studies: Rutar et al. [35] compare
warnings of a specific program version without considering the static analyzers PMD, FindBugs, JLint, Bandera, and
information such as warnings produced for other versions ESC/Java 2 on a small suite of programs. The authors present
of the program. Examples in this category learn a classifier a taxonomy of bugs found by each tool showing that no tool
via methods ranging from bayesian networks, decision trees, subsumes the other. The study focuses on runtime and number
and neural networks [21, 43]. The second approach uses the of warnings produced. Johnson et al. [26] conduct a study in
difference of warnings between a previous and the current which 20 developers are interviewed on their experiences using
version of the program, or self-adapts through user feedback static analysis tools. The study finds that the main reason why
[22, 36]. A promising approach to aid static bug detectors for developers do not use tools is false positives.
NPEs would be to learn a project-specific classifier that has Habib and Pradel [20] study the static analyzers I NFER,
user-feedback on predictions. This would benefit users as the E RROR P RONE, and S POT B UGS to determine how many of all
tool learns, over time, domain-specific project characteristics, bugs in D EFECTS 4J can be found by these tools. The authors
which would eventually lead to higher precision. use the code diff and the bug report mapping methods. The
b) Need for automatically inferring nullability annota- study finds that only 27 bugs out of 594 bugs (4.5%) were
tions: There is an inherent burden in writing annotations. detected, of which only 2 were NPEs. Tomassi [41] conducts
Analyzers that depend on annotations could benefit from au- a study that compares E RROR P RONE and S POT B UGS to find
tomated inference of nullability annotations. Running IntelliJ how many of all bugs in a sample of 320 B UG S WARM artifacts
IDEA’s Infer Nullity on 40 programs enabled the tools to are found. The author found that only one bug was found
find an additional 3 bugs. This shows that there is promise in by S POT B UGS. Instead, we focus on a specific kind of bug,
annotation-based approaches for bug finding. However, there is NPEs, and present a detailed analysis of the capabilities and
room for improvement in annotation inference as the analysis the limitations of five popular tools that find NPEs.
still missed annotations that could have lead to finding more Ayewah and Pugh [12] run Coverity, Eclipse, FindBugs,
bugs. Furthermore, it was difficult to automate the process of Fortify, and XYLEM on different versions of the build system
annotating code using IntelliJ, which may prevent its use in Ant. The authors classify the null dereferences reported by
many scenarios. There exists work that applies static analysis each tool (plausible, implausible, or impossible), and explore
to infer non-null annotations for object fields in a subset of the usefulness of using null-related annotations. Most recently,
Java [25], which could be potentially used to aid annotation- Banerjee et al. [14] presented the tool N ULL AWAY and per-
based NPE bug detectors but it is not publicly available. formed a comparison to the Checker Framework’s Nullness
c) Need for reasoning about collection-like data struc- analysis [33], and I NFER’s Eradicate looking at build-time
tures: A pain point for all tools studied is reasoning about overhead. While Ayewah and Pugh [12] study false positives
the nullability of objects inside a collection-like data structure in one version of Ant, Banerjee et al. [14] focus on measuring
such as an array. Users can add annotations to indicate that false negatives in Uber’s Android apps. We study the recall
a data structure can be null, but there is no mechanism to of five popular bug detectors, including N ULL AWAY, on 102
annotate the nullability of individual elements in the data struc- real and reproducible NPEs from 42 open-source projects.
ture. CFN ULLNESS, E RADICATE, and N ULL AWAY overcome b) Tools to Find Null Pointer Dereferences: Ayewah
this challenge for map-like objects by assuming that the get et al. [13] present a static analysis tool called F IND B UGS, the
interface may return a “nullable” value. A similar approach predecessor of S POT B UGS. F IND B UGS finds a wide variety
could be adopted each time an element from other collection- of bugs including null pointer dereferences. Hovemeyer and
like data structure is retrieved. Incorporating such strategy Pugh [24] extend F IND B UGS’s NPE finding capabilities by
would enable the tools to successfully find 10 additional bugs. improving the precision of the analysis. These improvements
d) Need for reasoning about reflection: Reasoning about were a result of a better model of the core API of JDK,
reflection imposes a challenge for any static analysis. All changing how errors on exception paths are handled, improv-
of the tools in our study are unsound when it comes to ing field tracking, and finding guaranteed dereferences. We
reflection, except CFN ULLNESS. Since most of the tools include S POT B UGS in our study.
can leverage annotations, a potential approach for handling Papi et al. [33] introduce the Checker Framework, which
reflection is user-provided annotations. This is exactly what allows for pluggable type systems for Java. They evaluate
CFN ULLNESS does. This is done via a list of targets, a priori, five checkers, including the Nullness checker, running them
of what class or method is being operated on for certain over significant sized code bases. The checkers find real bugs

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
and confirmed the absence of others. We include the Checker [10] SpotBugs. https://spotbugs.github.io/, 2021.
[11] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Prin-
Framework in our study. ciples, Techniques, and Tools (2nd Edition). Addison-Wesley Longman
Nanda and Sinha [32] develop a demand-driven dataflow Publishing Co., Inc., USA, 2006.
analysis for null-dereference bugs in Java. By being path- [12] N. Ayewah and W. Pugh. Null dereference analysis in practice. In
Proceedings of the 9th ACM SIGPLAN-SIGSOFT Workshop on Program
sensitive and context-sensitive, the analysis allows for a low Analysis for Software Tools and Engineering, PASTE ’10, pages 65–72,
false positive rate, and an improved precision over F IND B UGS New York, NY, USA, 2010. ACM. doi: 10.1145/1806672.1806686. URL
and JLint. Romano et al. [34] use the analysis from Nanda and http://doi.acm.org/10.1145/1806672.1806686.
[13] N. Ayewah, D. Hovemeyer, J. D. Morgenthaler, J. Penix, and W. Pugh.
Sinha [32] to find variables and paths that lead to possible null Using static analysis to find bugs. IEEE Softw., 25(5):22–29, 2008. doi:
pointer dereferences. The authors use a genetic algorithm to 10.1109/MS.2008.130. URL https://doi.org/10.1109/MS.2008.130.
generate tests that trigger the null pointer dereferences. Logi- [14] S. Banerjee, L. Clapp, and M. Sridharan. Nullaway: Practical type-based
null safety for java. In Proceedings of the 2019 27th ACM Joint Meeting
nov et al. [29] develop a sound interprocedural analysis based on European Software Engineering Conference and Symposium on the
on abstract interpretation called expanding-scope algorithm. Foundations of Software Engineering, ESEC/FSE 2019, pages 740–750,
Madhavan and Komondoor [30] demonstrate a sound, demand- New York, NY, USA, 2019. ACM. doi: 10.1145/3338906.3338919. URL
http://doi.acm.org/10.1145/3338906.3338919.
driven, interprocedural, context-sensitive dataflow analysis to [15] C. Calcagno and D. Distefano. Infer: An automatic program verifier
verify whether a dereference will be safe or not. None of the for memory safety of C programs. In M. G. Bobaru, K. Havelund,
above tools [29, 30, 32, 34] are publicly available. G. J. Holzmann, and R. Joshi, editors, NASA Formal Methods - Third
International Symposium, NFM 2011, Pasadena, CA, USA, April 18-20,
VII. C ONCLUSION 2011. Proceedings, volume 6617 of Lecture Notes in Computer Science,
pages 459–465. Springer, 2011. doi: 10.1007/978-3-642-20398-5\ 33.
In this experience paper, we studied the effectiveness of URL https://doi.org/10.1007/978-3-642-20398-5 33.
popular Java static bug detectors CFN ULLNESS, E RADICATE, [16] C. Calcagno, D. Distefano, P. W. O’Hearn, and H. Yang. Compositional
shape analysis by means of bi-abduction. In Z. Shao and B. C. Pierce,
I NFER, N ULL AWAY, and S POT B UGS on 102 real NPEs from editors, Proceedings of the 36th ACM SIGPLAN-SIGACT Symposium
42 open-source projects. We identified the capabilities of the on Principles of Programming Languages, POPL 2009, Savannah, GA,
tools and the characteristics of the NPE bugs in our dataset. We USA, January 21-23, 2009, pages 289–300. ACM, 2009. doi: 10.1145/
1480881.1480917. URL https://doi.org/10.1145/1480881.1480917.
discussed the problem of mapping tool warnings to actual NPE [17] C. Calcagno, D. Distefano, J. Dubreil, D. Gabi, P. Hooimeijer, M. Luca,
bugs, and investigated four mapping methods, including two P. W. O’Hearn, I. Papakonstantinou, J. Purbrick, and D. Rodriguez.
new approaches that leverage stack trace and code coverage Moving fast with software verification. In K. Havelund, G. J. Holz-
mann, and R. Joshi, editors, NASA Formal Methods - 7th International
information, from which the stack-trace based was the most Symposium, NFM 2015, Pasadena, CA, USA, April 27-29, 2015, Pro-
effective. Overall, the tools detected a total of 30 out of ceedings, volume 9058 of Lecture Notes in Computer Science, pages
102 bugs. We conducted an additional experiment annotating 3–11. Springer, 2015. doi: 10.1007/978-3-319-17524-9\ 1. URL
https://doi.org/10.1007/978-3-319-17524-9 1.
40 programs using IntelliJ, which resulted in 3 new bugs [18] M. Christakis and C. Bird. What developers want and need from
found. Finally, we leveraged the characteristics of the tools program analysis: an empirical study. In D. Lo, S. Apel, and S. Khurshid,
and the bugs in our dataset to gain insights into why the tools editors, Proceedings of the 31st IEEE/ACM International Conference on
Automated Software Engineering, ASE 2016, Singapore, September 3-
missed certain types of bugs. We concluded by discussing 7, 2016, pages 332–343. ACM, 2016. doi: 10.1145/2970276.2970347.
opportunities for improving NPE bug detection. We provide URL https://doi.org/10.1145/2970276.2970347.
the link to a public repository that contains both our scripts [19] W. Dietl, S. Dietzel, M. D. Ernst, K. Muslu, and T. W. Schiller.
Building and using pluggable type-checkers. In R. N. Taylor, H. C.
and the data produced in our experimental evaluation. Gall, and N. Medvidovic, editors, Proceedings of the 33rd International
Conference on Software Engineering, ICSE 2011, Waikiki, Honolulu ,
ACKNOWLEDGMENT HI, USA, May 21-28, 2011, pages 681–690. ACM, 2011. doi: 10.1145/
This work was supported in part by National Science 1985793.1985889. URL https://doi.org/10.1145/1985793.1985889.
[20] A. Habib and M. Pradel. How many of all bugs do we find? a study of
Foundation award CNS-2016735, a Facebook Testing and Ver- static bug detectors. In M. Huchard, C. Kästner, and G. Fraser, editors,
ification research award, and a UC Davis Graduate Fellowship. Proceedings of the 33rd ACM/IEEE International Conference on Auto-
We would like to thank Aditya V. Thakur and Premkumar T. mated Software Engineering, ASE 2018, Montpellier, France, September
3-7, 2018, pages 317–328. ACM, 2018. doi: 10.1145/3238147.3238213.
Devanbu for their feedback and suggestions. We also thank URL https://doi.org/10.1145/3238147.3238213.
Amy Cu, Raisa Putri, Robert Furth, and Ryan Jae for their help [21] Q. Hanam, L. Tan, R. Holmes, and P. Lam. Finding patterns in static
manually classifying NPEs, and replicating our experimental analysis alerts: improving actionable alert ranking. In P. T. Devanbu,
S. Kim, and M. Pinzger, editors, 11th Working Conference on Mining
results. Lastly, we would like to thank the developers of each Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014,
tool for their prompt answers to our questions. Hyderabad, India, pages 152–161. ACM, 2014. doi: 10.1145/2597073.
2597100. URL https://doi.org/10.1145/2597073.2597100.
R EFERENCES [22] K. Heo, M. Raghothaman, X. Si, and M. Naik. Continuously reasoning
[1] CVE-2013-1134. https://cve.mitre.org/cgi-bin/cvename.cgi?name= about programs using differential bayesian inference. In K. S. McKinley
CVE-2003-1134, 2021. and K. Fisher, editors, Proceedings of the 40th ACM SIGPLAN Con-
[2] Checkstyle. https://github.com/checkstyle/checkstyle, 2021. ference on Programming Language Design and Implementation, PLDI
[3] CVE Null Pointer. https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword= 2019, Phoenix, AZ, USA, June 22-26, 2019, pages 561–575. ACM, 2019.
null+pointer, 2021. doi: 10.1145/3314221.3314616. URL https://doi.org/10.1145/3314221.
[4] Eradicate. https://fbinfer.com/docs/eradicate.html, 2021. 3314616.
[5] Error Prone. https://github.com/google/error-prone, 2021. [23] D. Hovemeyer and W. Pugh. Finding bugs is easy. In J. M. Vlissides and
[6] Infer. http://fbinfer.com/, 2021. D. C. Schmidt, editors, Companion to the 19th Annual ACM SIGPLAN
[7] IntelliJ. https://www.jetbrains.com/idea/, 2021. Conference on Object-Oriented Programming, Systems, Languages, and
[8] NullAway. https://github.com/uber/NullAway, 2021. Applications, OOPSLA 2004, October 24-28, 2004, Vancouver, BC,
[9] PMD. https://pmd.github.io/, 2021.

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.
Canada, pages 132–136. ACM, 2004. doi: 10.1145/1028664.1028717. based testing of null pointer exceptions. In Fourth IEEE International
URL https://doi.org/10.1145/1028664.1028717. Conference on Software Testing, Verification and Validation, ICST 2011,
[24] D. Hovemeyer and W. Pugh. Finding more null pointer bugs, but not Berlin, Germany, March 21-25, 2011, pages 160–169. IEEE Computer
too many. In M. Das and D. Grossman, editors, Proceedings of the 7th Society, 2011. doi: 10.1109/ICST.2011.49. URL https://doi.org/10.1109/
ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software ICST.2011.49.
Tools and Engineering, PASTE’07, San Diego, California, USA, June [35] N. Rutar, C. B. Almazan, and J. S. Foster. A comparison of bug
13-14, 2007, pages 9–14. ACM, 2007. doi: 10.1145/1251535.1251537. finding tools for java. In Proceedings of the 15th International Sym-
URL https://doi.org/10.1145/1251535.1251537. posium on Software Reliability Engineering, ISSRE ’04, pages 245–
[25] L. Hubert, T. P. Jensen, and D. Pichardie. Semantic foundations and 256, Washington, DC, USA, 2004. IEEE Computer Society. doi:
inference of non-null annotations. In G. Barthe and F. S. de Boer, editors, 10.1109/ISSRE.2004.1. URL http://dx.doi.org/10.1109/ISSRE.2004.1.
Formal Methods for Open Object-Based Distributed Systems, 10th IFIP [36] H. Shen, J. Fang, and J. Zhao. Efindbugs: Effective error ranking for
WG 6.1 International Conference, FMOODS 2008, Oslo, Norway, June findbugs. In Fourth IEEE International Conference on Software Testing,
4-6, 2008, Proceedings, volume 5051 of Lecture Notes in Computer Sci- Verification and Validation, ICST 2011, Berlin, Germany, March 21-25,
ence, pages 132–149. Springer, 2008. doi: 10.1007/978-3-540-68863-1\ 2011, pages 299–308. IEEE Computer Society, 2011. doi: 10.1109/
9. URL https://doi.org/10.1007/978-3-540-68863-1 9. ICST.2011.51. URL https://doi.org/10.1109/ICST.2011.51.
[26] B. Johnson, Y. Song, E. R. Murphy-Hill, and R. W. Bowdidge. Why [37] M. Sridharan, S. Artzi, M. Pistoia, S. Guarnieri, O. Tripp, and R. Berg.
don’t software developers use static analysis tools to find bugs? In F4F: taint analysis of framework-based web applications. In C. V. Lopes
D. Notkin, B. H. C. Cheng, and K. Pohl, editors, 35th International and K. Fisher, editors, Proceedings of the 26th Annual ACM SIGPLAN
Conference on Software Engineering, ICSE ’13, San Francisco, CA, Conference on Object-Oriented Programming, Systems, Languages, and
USA, May 18-26, 2013, pages 672–681. IEEE Computer Society, 2013. Applications, OOPSLA 2011, part of SPLASH 2011, Portland, OR, USA,
doi: 10.1109/ICSE.2013.6606613. URL https://doi.org/10.1109/ICSE. October 22 - 27, 2011, pages 1053–1068. ACM, 2011. doi: 10.1145/
2013.6606613. 2048066.2048145. URL https://doi.org/10.1145/2048066.2048145.
[27] R. Just, D. Jalali, and M. D. Ernst. Defects4j: a database of existing faults [38] F. Thung, Lucia, D. Lo, L. Jiang, F. Rahman, and P. T. Devanbu. To
to enable controlled testing studies for java programs. In C. S. Pasareanu what extent could we detect field defects? an empirical study of false
and D. Marinov, editors, International Symposium on Software Testing negatives in static bug finding tools. In M. Goedicke, T. Menzies, and
and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, M. Saeki, editors, IEEE/ACM International Conference on Automated
pages 437–440. ACM, 2014. doi: 10.1145/2610384.2628055. URL Software Engineering, ASE’12, Essen, Germany, September 3-7, 2012,
https://doi.org/10.1145/2610384.2628055. pages 50–59. ACM, 2012. doi: 10.1145/2351676.2351685. URL https:
[28] O. Lhoták and L. J. Hendren. Scaling java points-to analysis using //doi.org/10.1145/2351676.2351685.
SPARK. In G. Hedin, editor, Compiler Construction, 12th International [39] F. Tip, C. Laffra, P. F. Sweeney, and D. Streeter. Practical experience
Conference, CC 2003, Held as Part of the Joint European Conferences with an application extractor for java. In B. Hailpern, L. M. Northrop,
on Theory and Practice of Software, ETAPS 2003, Warsaw, Poland, April and A. M. Berman, editors, Proceedings of the 1999 ACM SIGPLAN
7-11, 2003, Proceedings, volume 2622 of Lecture Notes in Computer Conference on Object-Oriented Programming Systems, Languages &
Science, pages 153–169. Springer, 2003. doi: 10.1007/3-540-36579-6\ Applications (OOPSLA ’99), Denver, Colorado, USA, November 1-5,
12. URL https://doi.org/10.1007/3-540-36579-6 12. 1999, pages 292–305. ACM, 1999. doi: 10.1145/320384.320414. URL
[29] A. Loginov, E. Yahav, S. Chandra, S. Fink, N. Rinetzky, and M. G. https://doi.org/10.1145/320384.320414.
Nanda. Verifying dereference safety via expanding-scope analysis. In [40] F. Tip, P. F. Sweeney, C. Laffra, A. Eisma, and D. Streeter. Practical
B. G. Ryder and A. Zeller, editors, Proceedings of the ACM/SIGSOFT extraction techniques for java. ACM Trans. Program. Lang. Syst., 24
International Symposium on Software Testing and Analysis, ISSTA 2008, (6):625–666, 2002. doi: 10.1145/586088.586090. URL https://doi.org/
Seattle, WA, USA, July 20-24, 2008, pages 213–224. ACM, 2008. 10.1145/586088.586090.
doi: 10.1145/1390630.1390657. URL https://doi.org/10.1145/1390630. [41] D. A. Tomassi. Bugs in the wild: examining the effectiveness of
1390657. static analyzers at finding real-world bugs. In Proceedings of the
[30] R. Madhavan and R. Komondoor. Null dereference verification via 2018 ACM Joint Meeting on European Software Engineering Con-
over-approximated weakest pre-conditions analysis. In C. V. Lopes ference and Symposium on the Foundations of Software Engineering,
and K. Fisher, editors, Proceedings of the 26th Annual ACM SIGPLAN ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-
Conference on Object-Oriented Programming, Systems, Languages, and 09, 2018, pages 980–982, 2018. doi: 10.1145/3236024.3275439. URL
Applications, OOPSLA 2011, part of SPLASH 2011, Portland, OR, USA, https://doi.org/10.1145/3236024.3275439.
October 22 - 27, 2011, pages 1033–1052. ACM, 2011. doi: 10.1145/ [42] D. A. Tomassi, N. Dmeiri, Y. Wang, A. Bhowmick, Y.-C. Liu, P. T.
2048066.2048144. URL https://doi.org/10.1145/2048066.2048144. Devanbu, B. Vasilescu, and C. Rubio-González. Bugswarm: Mining
[31] G. J. Myers, T. Badgett, T. M. Thomas, and C. Sandler. The art of and continuously growing a dataset of reproducible failures and fixes.
software testing, volume 2. Wiley Online Library, 2004. In Proceedings of the 41st International Conference on Software Engi-
[32] M. G. Nanda and S. Sinha. Accurate interprocedural null-dereference neering, ICSE ’19, pages 339–349, Piscataway, NJ, USA, 2019. IEEE
analysis for java. In Proceedings of the 31st International Conference Press. doi: 10.1109/ICSE.2019.00048. URL https://doi.org/10.1109/
on Software Engineering, ICSE ’09, pages 133–143, Washington, DC, ICSE.2019.00048.
USA, 2009. IEEE Computer Society. doi: 10.1109/ICSE.2009.5070515. [43] L. Yu, W. Tsai, W. Zhao, and F. Wu. Predicting defect priority based on
URL http://dx.doi.org/10.1109/ICSE.2009.5070515. neural networks. In L. Cao, J. Zhong, and Y. Feng, editors, Advanced
[33] M. M. Papi, M. Ali, T. L. C. Jr., J. H. Perkins, and M. D. Ernst. Data Mining and Applications - 6th International Conference, ADMA
Practical pluggable types for java. In B. G. Ryder and A. Zeller, 2010, Chongqing, China, November 19-21, 2010, Proceedings, Part II,
editors, Proceedings of the ACM/SIGSOFT International Symposium on volume 6441 of Lecture Notes in Computer Science, pages 356–367.
Software Testing and Analysis, ISSTA 2008, Seattle, WA, USA, July 20- Springer, 2010. doi: 10.1007/978-3-642-17313-4\ 35. URL https://
24, 2008, pages 201–212. ACM, 2008. doi: 10.1145/1390630.1390656. doi.org/10.1007/978-3-642-17313-4 35.
URL https://doi.org/10.1145/1390630.1390656. [44] M. Zhivich and R. K. Cunningham. The real cost of software errors.
[34] D. Romano, M. D. Penta, and G. Antoniol. An approach for search IEEE Secur. Priv., 7(2):87–90, 2009. doi: 10.1109/MSP.2009.56. URL
https://doi.org/10.1109/MSP.2009.56.

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on March 21,2022 at 13:17:42 UTC from IEEE Xplore. Restrictions apply.

You might also like