Ind748n maxwellATS
Ind748n maxwellATS
legitimateMap
B1
0
0
1 C1
0
0
1
1D
0
0
1 B1
0
0
1 C1
0
0
1
1D
0
0
1
0
1 0
1 0
1 0
1 0
1 0
1
E1
0
0
1
1F
0
0
1
1G
0
0
1 E1
0
0
1
1F
0
0
1
1G
0
0
1
0
1 0
1 0
1 0
1 0
1 0
1
0
1
L1
0 0M
1
1
0 0
1
L1
0 0M
1
1
0
0
1 0
1 0
1 0
1
R
(S) R
B C B A B A
B A
S S C S S S B S
C A B A C B C B A B C
A C
B C
Z Y S X Z
Z Y C A X Z Z Y X B X Z
Z Y X S X Z
G S G|S G S G|S
Figure 5: Two examples of an input graph G, a graph grammar S inferred from G, and the result of compress-
ing G on S, denoted G|S. (i) A non-recursive grammar which does not contain any embedded non-terminal
nodes or edges. (ii) Recursive grammar on node A containing an embedded non-terminal node that can
match either nodes of type B or C. The grey box containing the “(S)” label represents a recursive connection
instruction.
4.1 Preparing Heap Dumps node in the best grammar displayed in Figure 6, and edge
We obtained heap dumps from Sun’s Java VM (version labels (in parenthesis) stem from the explicit and implicit
1.6) using either the jmap tool or via the -XX:+HeapDump- variable names used in the source code from Figure 1:
OnOutOfMemoryError option, which triggers an automatic java.lang.Class
heap dump when the JVM runs out of memory. We use the | (legitimateMap)
+-+-> java.util.HashMap
com.sun.tools.hat.* API to process the dump and extract
| (table)
the reachability graph. Each node in the graph corresponds +-+-> [Ljava.util.HashMap$Entry;
to a Java object, which is labeled with its class. Each edge in | ($array$)
the graph corresponds to a reference, labeled with the name +-+-> java.util.HashMap$Entry
of the field containing the reference. We label edges from | (value)
arrays with $array$, ignoring the specific index in which a +-+-> HiddenLeak$Legitimate
| (leakyMap)
reference is stored. We remove all edges that correspond to
+-+-> java.util.HashMap
weak and soft references, because weak and soft references | (table)
to an object do not prevent that object from being reclaimed +-+-> [Ljava.util.HashMap$Entry;
if the garbage collector runs low on memory. | ($array$)
+-+-> java.util.HashMap$Entry
4.2 Synthetic Examples
The remaining paths contain an additional edge Entry-
.next, which represents the case in which a hash collision led
java.util.HashMap$Entry to chaining. This information immediately describes the lo-
cation of all instances of Leak objects in the graph, alerting
key value the developer that a large number of these structures has
accumulated underneath each Legitimate object. As dis-
cussed in Section 2, a size-based analysis of the dominator
java.lang.String HiddenLeak$Legitimate$Leak
tree as done in Eclipse’s Memory Analyzer would lead only
to the bucket array of the HashMap object referred to by ‘le-
value gitimateMap’ and require manual inspection of the subtree
emanating from it.
char[] Since programmers often choose container types depend-
ing on specific space/time trade-offs related to an expected
access pattern, we then investigated if the leaking struc-
Figure 6: Most frequent grammar mined from Hid- ture would be found if a different container type had been
denLeak example in Figure 2. The $ character de- used. We replaced both uses of HashMap with class TreeMap,
notes nested classes. For example, the node Hidden- which uses a red-black tree implementation. Our algorithm
Leak$Legitimate$Leak corresponds to Java class Hid- correctly identified a grammar consisting of TreeMap.Entry
denLeak.Legitimate.Leak in the source code. objects that refer to a (key, value) pair, near-identical to
the grammar shown in Figure 6. In addition, the aggre-
We first present the results for the motivating example gated path was expressed by the recursive grammar shown
presented in Section 2. Figure 6 shows the most frequently in Figure 7, which covers over 99% of observed paths from
occurring mined grammar, which represents a (key, value) a root to the grammar’s instances.
pair anchored by an instance of type HashMap.Entry. This This path grammar identifies the leak as hidden in a tree
information directs an expert’s attention immediately to a of trees and provides a global picture that would be nearly
HashMap mapping keys of type String to values of type Hid- impossible to obtain by visual inspection. The use of recur-
denLeak.Legitimate.Leak. We found that 70% of the paths sive productions enabled the algorithm to identify a classic
from the instances produced by this grammar to GC roots container data structure (a binary tree) without any a priori
in the original graph exhibit the following structure, where knowledge. The use of recursive grammars is also essential
java.lang.Class (top) is a root node of the dominator tree, for other recursive data structures, such as linked lists. The
java.util.HashMap$Entry (bottom) corresponds to the top following example demonstrates that our mining approach
+-+-> java.lang.Class
| (legitimateMap) (S)
+-+-> java.util.TreeMap
| (root)
{
+-+-> java.util.TreeMap$Entry OOML java.lang.String
| ( right | left )
+-+-> java.util.TreeMap$Entry
}*
| (value) next payload
+-+-> leaks.TreeMapLeaks$Leak
| (leakyMap)
+-+-> java.util.TreeMap
| (root)
{
OOML
+-+-> java.util.TreeMap$Entry
| ( right | left )
+-+-> java.util.TreeMap$Entry
}* Figure 9: Most frequent grammar in synthetic
linked list example.
Figure 7: Resulting root path grammar if a TreeMap
container is used for the example shown in Figure 2.
that makes heavy use of multiple frameworks [2] including
the Apache Tomcat 5.5 servlet container. These heap dumps
easily captures such data structures, even when they occur
were generated over a period of several months. When the
embedded in application classes (rather than in dedicated
server ran out of memory during intense testing, develop-
collection classes). Class OOML in Figure 8 embeds a link el-
ers would simply save a heap dump and restart the server,
ement next and an application-specific payload field. The
without further immediate investigation of the cause.
main method contains an infinite loop which will add el-
We obtained a total of 20 heapdumps, varying in sizes
ements to a list held in a local variable ‘root’ until heap
from 33 to 47MB. In all of these dumps, the grammar shown
memory is exhausted.
in Figure 10 percolated to the top. This grammar represents
public class OOML { instances of type BeanInfoManager that reference a HashMap
OOML next; // next element through their mPropertyByName field. 80% of the root paths
String payload; are expressed via the following grammar:
+-+-> org.apache.catalina.loader.StandardClassLoader
OOML(String payload, OOML next) {
| (classes)
this.payload = payload;
+-+-> java.util.Vector
this.next = next;
| (elementData)
}
+-+-> [Ljava.lang.Object;
| ($array$)
// add nodes to list until out of memory
+-+-> java.lang.Class
public static void main(String []av) {
| (mBeanInfoManagerByClass)
OOML root = new OOML("root", null);
+-+-> java.util.HashMap
for (int i = 0; ; i++)
| (table)
root = new OOML("content", root);
+-+-> [Ljava.util.HashMap$Entry;
}
| ($array$)
}
+-+-> java.util.HashMap$Entry
| (value)
Figure 8: A singly linked list embedded in an appli- +-+-> org.apache.commons.el.BeanInfoManager
cation class.
This grammar shows that the majority of these objects
Figure 9 shows the most frequent subgraph, which con- are kept alive via a field named mBeanInfoManagerByClass.
tains a recursive production OOM L →next OOM L, rep- Since the field is associated with a node of type Class, it
resenting a single linked list. The root path aggregation represents a static field. Examination of an actual path in-
showed the location of its instances in the graph: stance reveals that this static field belongs to class org-
+-+-> Java_Local .apache.commons.el.BeanInfoManager.
| (root) Similar to the HiddenLeak example, the Eclipse analyzer
{ reported the (legitimate!) HashMap.Entry array stored in
+-+-> OOML
| (next) the mBeanInfoManagerByClass class as accumulation point,
+-+-> OOML but could not provide insights into the structure of the ob-
}* jects kept in this table without tedious manual inspection.
We eventually found that this leak had already been re-
4.3 Web Application Heapdumps ported by another developer against the Tomcat Apache
server (Bug 38048: Classloader leak caused by EL evalu-
4.3.1 Apache Tomcat/J2EE ation). Interestingly, the original bug report had received
We obtained a series of heap dumps that resulted from re- little attention, likely because the bug reporter included only
curring out-of-memory situations during the development of a single trace to a leaked object reachable from the Bean-
the LibX Edition Builder, a complex J2EE web application InfoManager class.
(S) as memory exhaustion due to a large object kept alive by a
long list of RegExMatch objects.
org.apache.commons.el.BeanInfoManager
mPropertyByName (S)
org.mvel.ast.RegExMatch
java.util.HashMap
$array$
org.mvel.ast.ThisValDeepPropertyNode org.mvel.ASTLinkedList
java.util.HashMap$Entry name
next value [C
java.util.HashMap$Entry org.apache.commons.el.BeanInfoProperty
| (firstASTNode)
{ 400
+-+-> org.mvel.ast.RegExMatch
| (nextASTNode)
200
+-+-> org.mvel.ast.RegExMatch
}*
0
This path indicates that the cause of the memory exhaus- 0 500 1000 1500 2000 2500 3000 3500 4000
about 2–3 minutes. Conversely, the MVEL dumps involve graph for tomcat.sep20 versus the runtime on a dominator
significant recursion and several hundreds of thousands of tree from a similar heap dump, tomcat.nov05, because the
instances. Furthermore, the path summarization for the discovered graph grammar from the dominator tree in tom-
MVEL constructs require greater work since we must tra- cat.nov05 was closer in size to that found in the full heap
verse up the linked list to observe the root path for even a dump graph for tomcat.sep20 and composed of the same
single instance. leak – but was actually larger and more frequent. We found
Finally, we compared the runtime of our algorithm when that in the dominator tree the runtime was 130 seconds ver-
used on the dominator tree versus the original heap dump sus 426 seconds in the full heap dump graph. Our results
graph. We chose one particular dump, tomcat.sep20 from show that we obtain ∼ 46% runtime reduction for identify-
Table 1, to make the comparison. We used this dump be- ing small graph grammars and ∼ 69% runtime reduction for
cause the Tomcat leak graph grammar does not display sig- large graph grammars when the percentage of size reduction
nificant recursion and this dump is one of many good rep- is only 36%. This suggests that mining the dominator tree
resentatives of the leak class. In order to compare the run- is not only quicker due to size reduction, but also because its
times, we excluded the path summarization step that was tree structure contains less noise and redundancy, thereby
included in the runtime plot in Fig. 12 because this step simplifying the mining process.
would not be comparable between graphs. We note that the
preprocessing step of removing weak and soft references will
differ between graphs as well (in fact it is less complex in the 5. RELATED WORK
dominator tree), but the resulting graphs are comparable. Our research combines ideas from software engineering
Further, we found that the most frequent graph grammar and graph mining. We discuss related work in each of these
in the dominator tree is a subgraph of the most frequent areas.
graph grammar in the full heap dump graph, thus requir-
ing more iterations of candidate generation and therefore 5.1 Memory Leak Detection Tools
more runtime for discovery. To enable a fair runtime com- One of the first systems to debug leaks exploited visual-
parison, we considered the time required to generate the ization, allowing a user to interactively focus on suspected
most frequent single-edge graph grammar from the domi- problem areas in the heap [26]. Most recent existing leak de-
nator tree versus from the full graph. We found that the tection tools use temporal information, including object age
dominator tree accomplished this task in 21 seconds versus and staleness, that is obtained by monitoring a program as
38 seconds in the full graph. We also compared the com- it runs. For instance, IBM’s Leakbot [21–23] acquires snap-
plete runtime for the graph grammar in the full heap dump shots at multiple times during the execution of a program,
applies heuristics to identify leak candidates, and monitors tences” being generated are connected graphs. Graph gram-
how they evolve over time. mars have an array of applications, but have generally been
Minimizing both the space and runtime overhead of dy- researched from a theoretical perspective for graph genera-
namic analyses have been the subject of intense study. Space tion [10,28] as opposed to inference problems as studied here.
overhead is incurred because object allocation sites and last The benefit of graph grammars is that they can capture
access times must be recorded; runtime overhead because richer information about the connectivity of subgraphs than
this information must be continuously updated. Bell and traditional frequent subgraphs. Although these graph gram-
Sleigh [3] use a novel encoding to minimize space overhead mars are primarily context-free and therefore lossy, they pro-
for allocation sites. To minimize the runtime cost, statisti- vide a more descriptive representation of a subgraph than
cal profiling approaches have been developed [13]. Cork [16] just a frequency count. Our work follows the graph gram-
combines low-overhead statistic profiling with type-slicing. mar philosophy but applies it toward the characterization of
Some profilers, notably the NetBeans profiler, use informa- dominator trees, which has not been studied before.
tion already kept by generational collectors to determine ob-
ject age. Lastly, hardware support for monitoring memory 5.3 Data Mining for Software Engineering
access events has been proposed in [30]. Data from programming projects (code, bug reports, doc-
By contrast, our approach explores mining information umentation, runtime snapshots, heap dumps) is now so plen-
from a single heap dump, which is often the only source of tiful that data mining approaches have been investigated
information available when out-of-memory errors occur un- toward software engineering goals (see [31] for a survey).
expectedly, which is the common case in production environ- Graph data, in particular, resurfaces in many guises such
ments in which dynamic tools are rarely deployed. Our work as call graphs, dependencies across subprojects, and heap
is complementary to dynamic approaches. Mined structural dumps. Graph mining techniques have been used mini-
information is likely to enhance information these tools can mally for program diagnosis. For instance, program behav-
provide, especially in the common scenario in which software ior graphs have been mined for frequent closed subgraphs
engineers diagnose suspected leaks in codes with which they that become features in a classifier to predict the existence
are not familiar. Moreover, the ability to identify data struc- of “noncrashing” bugs [20]. Behavior graphs were also mined
tures could be exploited to automatically infer which oper- with the LEAP algorithm [34] in [5] to identify discrimina-
ations are add/delete operations on containers, which could tive subgraphs signifying bug signatures. However, to the
benefit approaches that rely on monitoring the membership best of our knowledge, nobody has investigated the role of
of object containers to identify leaks [33]. mining heap dumps for detecting memory leaks or used a
In the context of languages with explicit memory man- graph grammar mining tool.
agement, several static analyses have been developed that
identify where a programmer failed to deallocate memory [8,
14, 25, 32]. Similarly, trace-based tools such as Purify [12] or 6. CONCLUSION
Valgrind [24] can identify unreachable objects in such envi- We have presented a general and expressive framework
ronments. By comparison, the garbage collected languages for diagnosing memory leaks using frequent grammar min-
at which our analysis aims do not employ explicit dealloca- ing. Our work extends the arsenal of memory leak diagnosis
tion; we aim to identify reachable objects that are unlikely tools available to software developers. For the KDD commu-
to be accessed in the future. Lastly, rather than eliminat- nity, we have introduced the notion of dominators and how
ing the source of leaks, some systems implement mitigation they possess sufficient statistics for mining certain types of
strategies such as swapping objects to disk [4]. frequent subgraphs. The experimental results are promising
in their potential to debug leaks when other state-of-the-
5.2 Graph Mining art tools cannot. We expect our algorithm to be used in
Graph mining is a well studied field that expands far in complement of tools like Eclipse MAT.
both breadth and depth. Initial works such as [18, 35] fo- Our future work revolves around three themes. First, we
cused on the problem of discovering frequent subgraphs and seek to embed our algorithm in a runtime infrastructure so
these algorithms have been expanded in several directions that it can track leaking subgraphs as they build up over
over the past decade [6, 7, 34, 36, 37]. Building upon FP- time. Second, we seek to investigate the theoretical proper-
tree data structures [11], fast data structures [1] have also ties of dominators and whether they can support a frequent
been developed. Similar to our work, graph mining algo- pattern growth [11] style of subgraph mining. This approach
rithms have been tailored toward specific application do- will allow us to process larger heap dumps than our current
mains where the structure of the desired subgrahs can be approach. Third, we plan to perform a quantitative evalua-
exploited in the discovery process. tion comparing the quality of our reports to existing tools.
Cook and Holder’s work [9] takes a different approach to Such quantitative comparisons require the definition of a
graph mining by finding a single, “best” subgraph as op- metric, which could be derived by approximating the num-
posed to all subgraphs frequent above some threshold. This ber of lines of code a user would have to investigate to verify
approach uses a scoring function based on the minimum de- the presence or absence of a bug, as proposed in [27].
scription length (MDL) principle and is therefore computa-
tionally more complex. This work has also been expanded
upon for the use of finding highly descriptive graph gram- Acknowledgements
mars [15, 17] which consider the recursive nature of the sub- This work is supported in part by US NSF grant CCF-
graphs and allow for variability in the edge and node labels. 0937133 and the Institute for Critical Technology and Ap-
Graph grammars are synonymous with other formal lan- plied Science (ICTAS) at Virginia Tech. Also, we would like
guage grammars, with the difference being that the “sen- to thank Jongsoo Park, the developer of the dominator tree
algorithm used in the Boost C++ libraries, for his ready finding dominators in a flowgraph. ACM Trans.
responses to our questions and comments. Program. Lang. Syst., 1(1):121–141, 1979.
[20] C. Liu, X. Yan, H. Yu, J. Han, and P. Yu. Mining
7. REFERENCES behavior graphs for ”backtrace” of noncrashing bugs.
In SDM ’05, pages 286–297.
[1] M. Akbar and R. Angryk. Frequent pattern-growth [21] N. Mitchell. The runtime structure of object
approach for document organization. In CIKM ’08, ownership. In D. Thomas, editor, ECOOP ’06, 2006.
pages 77–82, 2008.
[22] N. Mitchell and G. Sevitsky. LeakBot: An automated
[2] A. Bailey and G. Back. LibX–a Firefox extension for and lightweight tool for diagnosing memory leaks in
enhanced library access. Library Hi Tech, large java applications. In ECOOP ’03, 2003.
24(2):290–304, 2006.
[23] N. Mitchell and G. Sevitsky. The causes of bloat, the
[3] M. Bond and K. McKinley. Bell: bit-encoding online limits of health. In OOPSLA ’07, pages 245–260, 2007.
memory leak detection. In ASPLOS-XII ’06, pages
[24] N. Nethercote and J. Seward. Valgrind: a framework
61–72, 2006.
for heavyweight dynamic binary instrumentation. In
[4] M. Bond and K. McKinley. Tolerating memory leaks. PLDI ’07, pages 89–100, 2007.
In OOPSLA ’08, pages 109–126, 2008.
[25] M. Orlovich and R. Rugina. Memory leak analysis by
[5] H. Cheng, D. Lo, Y. Zhou, X. Wang, and X. Yan. contradiction. In Lecture Notes in Computer Science,
Identifying bug signatures using discriminative graph volume 4134, pages 405–424. Springer, 2006.
mining. In ISSTA ’09, pages 141–152, New York, NY,
[26] W. De Pauw and G. Sevitsky. Visualizing reference
USA, 2009.
patterns for solving memory leaks in java.
[6] H. Cheng, X. Yan, J. Han, and P. Yu. Direct Concurrency - Practice and Experience,
discriminative pattern mining for effective 12(14):1431–1454, 2000.
classification. In ICDE ’07, pages 169–178, 2008.
[27] M. Renieris and S.Reiss. Fault localization with
[7] C. Chent, X. Yan, F. Zhu, and J. Han. gApprox: nearest neighbor queries. In ASE ’03, pages 30–39,
Mining frequent approximate patterns from a massive 2003.
network. In ICDM ’07, pages 445–450, 2007.
[28] G. Rozenberg. Handbook of Graph Grammars and
[8] S. Cherem, L. Princehouse, and R. Rugina. Practical Computing by Graph Transformation, volume 1. 1997.
memory leak detection using guarded value-flow
[29] E. Tilevich and G. Back. Program, enhance thyself!
analysis. In PLDI ’07, pages 480–491, 2007.
demand-driven pattern-oriented program
[9] D. Cook and L. Holder. Substructure discovery using enhancement. In AOSD ’08, pages 13–24, April 2008.
minimum description length and background
[30] G. Venkataramani, B. Roemer, Y. Solihin, and
knowledge. JAIR, 1:231–255, 1994.
M. Prvulovic. Memtracker: Efficient and
[10] J. Engelfriet and G. Rozenberg. Graph grammars programmable support for memory access monitoring
based on node rewriting: an introduction to nlc graph and debugging. In HPCA ’07, pages 273–284, 2007.
grammars. In Graph grammars and their application
[31] T. Xie, S. Thummalapenta, D. Lo, and C. Liu. Data
to computer science: 4th Intl. Workshop, pages 12–23,
Mining for Software Engineering. IEEE Computer,
1991.
Vol. 42(8):35–42, Aug 2009.
[11] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
[32] Y. Xie and A. Aiken. Context- and path-sensitive
without candidate generation. In SIGMOD ’00, pages
memory leak detection. In ESEC/FSE-13, pages
1–12, 2000.
115–125, 2005.
[12] R. Hastings and B. Joyce. Purify: A tool for detecting
[33] G. Xu and A. Rountev. Precise memory leak detection
memory leaks and access errors in c and c++
for Java software using container profiling. In ICSE
programs. In Winter USENIX Conference, pages
’08, pages 151–160, 2008.
125–138, 1992.
[34] X. Yan, H. Cheng, J. Han, and P. Yu. Mining
[13] M. Hauswirth and T. Chilimbi. Low-overhead memory
significant graph patterns by leap search. In SIGMOD
leak detection using adaptive statistical profiling. In
’08, pages 433–444, 2008.
ASPLOS-XI ’04, pages 156–164, 2004.
[35] X. Yan and J. Han. gSpan: graph-based substructure
[14] D. Heine and M. Lam. A practical flow-sensitive and
pattern mining. In ICDM ’02, pages 721–724, 2002.
context-sensitive C and C++ memory leak detector.
In PLDI ’03, pages 168–181, 2003. [36] X. Yan and J. Han. CloseGraph: mining closed
frequent graph patterns. In SIGKDD ’03, pages
[15] I. Jonyer, L. Holder, and D. Cook. MDL-based
286–295, 2003. 956784.
context-free graph grammar induction and
applications. IJAIT, 13(1):65–79, 2004. [37] Z. Zeng, J. Wang, L. Zhou, and G. Karypis.
Out-of-core coherent closed quasi-clique mining from
[16] M. Jump and K. McKinley. Cork: dynamic memory
large dense graph databases. ACM TODS, 32(2), 2007.
leak detection for garbage-collected languages. In
POPL ’07, pages 31–38, 2007.
[17] J. Kukluk, L. Holder, and D. Cook. Inference of node
and edge replacement graph grammars. In ICML
Grammar Induction Workshop ’07, 2007.
[18] M. Kuramochi and G. Karypis. Frequent subgraph
discovery. In ICDM ’01, pages 313–320, 2001.
[19] T. Lengauer and R. Tarjan. A fast algorithm for