TraceSim A Method For Calculating Stack Trace Similarity
TraceSim A Method For Calculating Stack Trace Similarity
Abstract—Many contemporary software products have subsys- quicker, and, on the other hand, bugs with reports that were
tems for automatic crash reporting. However, it is well-known “spread” over several buckets take a longer time to fix.
that the same bug can produce slightly different reports. To Thus, the problem of automatic handling of duplicate crash
manage this problem, reports are usually grouped, often manu-
ally by developers. Manual triaging, however, becomes infeasible reports is relevant for both academia and industry. There
for products that have large userbases, which is the reason for is already a large body of work in this research area, and
many different approaches to automating this task. Moreover, it providing its summary can not be easy, since different studies
is important to improve quality of triaging due to the big volume employ different problem formulations. However, the two most
of reports that needs to be processed properly. Therefore, even popular tasks concerning automatically created bug reports are:
a relatively small improvement could play a significant role in
overall accuracy of report bucketing. The majority of existing 1) for a given report, find similar reports in a database and
studies use some kind of a stack trace similarity metric, either rank them by the likelihood of belonging to the same
based on information retrieval techniques or string matching bug (ranked report retrieval) [7], [23];
methods. However, it should be stressed that the quality of 2) distribute a given set of reports into buckets (report
triaging is still insufficient. clusterization) [14].
In this paper, we describe TraceSim — a novel approach
to address this problem which combines TF-IDF, Levenshtein For both of these tasks, defining a good similarity measure
distance, and machine learning to construct a similarity metric. is a must, since the quality of the output largely depends on
Our metric has been implemented inside an industrial-grade it. Moreover, it is important to improve similarity algorithms
report triaging system. The evaluation on a manually labeled carefully due to the big volume of reports that needs to
dataset shows significantly better results compared to baseline
be processed properly. Even a relatively small improvement
approaches.
Index Terms—Crash Reports, Duplicate Bug Report, Dupli- could play a significant role in increasing the quality of report
cate Crash Report, Crash Report Deduplication, Information bucketing.
Retrieval, Software Engineering, Automatic Crash Reporting, In this paper, we address the problem of computing the
Deduplication, Crash Stack, Stack Trace, Automatic Problem similarity of two stack traces. The majority of deduplication
Reporting Tools, Software Repositories. studies can be classified into two groups: based either on TF-
IDF or stack trace structure. The former use an information
I. I NTRODUCTION
retrieval approach, while the latter employ string matching
Systems for collecting and processing bug feedback are algorithms (such as edit distance) to compute stack trace
nearly ubiquitous in software development companies. How- similarity. However, to the best of our knowledge, there are no
ever, writing bug reports may require substantial effort from studies that offered a proper, non-naive combination of these
users. Therefore, in order to reduce this effort, a way to create two approaches. Such combination may result in a superior
such reports automatically is implemented in most widely used quality of bucketing and may significantly outperform any
products. In most cases, information available at the time of method that belongs to these individual groups. To substantiate
the crash, i.e. stack trace, is used to form a report. importance of this idea, we would like to quote Campbell et
The drawback of this approach is the huge number of al. [8]: “a technique based on TF-IDF that also incorporates
generated reports, the majority of which are duplicates. For information about the order of frames on the stack would likely
example, the study [15] describes WER — the system used in outperform many of the presented methods...”.
Microsoft to manage crash reports. This system had collected At the same time, machine learning (ML) was rarely applied
billions of reports from 1999 to 2009. Another example is the to this domain: the majority of existing similarity calculation
Mozilla Firefox browser: according to the study [8], in 2016 methods does not rely on ML techniques. The reason behind
Firefox was receiving 2.2 million crash reports a day. this is the fact that classic (non-ML) methods are more
It was demonstrated [12] that correct automatic assignment robust and stable than ML ones, which is very important
has a positive impact on the bug fixing process. Bugs whose for the considered task. Therefore, our idea is to use classic
reports were correctly assigned to a single bucket are fixed approaches as the basis.
However, ML methods are more flexible and their applica- B. Crash Reports, Stack Traces and Software Quality
tion allowed to achieve substantial results in many areas of Currently, systems that automatically collect crash stack
software engineering. Here, in this particular problem, mod- traces from remote instances are very popular for mass-
erately employing ML allows us to efficiently integrate both deployed applications. Prominent examples of such systems
classic approaches. Therefore we believe that combining all are Mozilla Socorro1 , LibreOffice Crash Reports2 , Google
three approaches would allow us to design superior similarity Chromium3 crash reporting system, Windows Error Report-
function. ing [10], [15] and many others. These systems are not a
The contribution of this paper is TraceSim — the first substitute to traditional bug trackers, but are an addition. They
algorithm for computing stack trace similarity that structurally are tightly integrated with bug trackers in order to link stack
combines TF-IDF [29], and string distance while using ma- traces to existing bugs, to form new bugs out of a collection
chine learning to improve quality. of stack traces, and so on.
We validate our algorithm using a real-life database of crash Having this kind of system allows to obtain bug feedback
reports collected for JetBrains products. without requiring users to form and submit “classic” bug
reports. This, in turn, reduces the strain put on users and allows
II. BACKGROUND to greatly increase the amount of collected feedback, which
is used to improve software development process. Overall, the
A. Stack traces benefits are the following:
\bullet It allows to survey bug landscape at large at any given
When a contemporary application crashes, it generates a
moment. For example, LibreOffice Crash Reports show4
crash report with the following information: application de-
the aggregated view of all received reports over the last
tails, environment details, and crash location. In this paper,
N days.
we are going to examine Java exceptions only. Crash location
\bullet It helps to locate bug in the source code. Both Mozilla
information is represented by a stack trace — a snapshot of
Socorro and LibreOffice Crash Reports are integrated
the application call stack that was active at the time of the
with projects’ repositories. A user can click on stack
crash. For each entry of a call stack, its qualifier and line
frames that are attached to a bug and be transferred to
number where a function was called or an error was raised
the corresponding lines in the source code.
are recorded and stored in the stack trace. The first frame of
\bullet It allows to automate bug to developer assignment. For
the stack trace corresponds to the top of the call stack, i.e. to
example, ClusterFuzz5 system allows to automatically
the exact method where the error was raised. Next, there is a
assign bug to developer based on crash location in the
sequence of frames which correspond to other methods from
source code.
the call stack. These go up to the “main” function or thread
entry function. We will denote a stack trace that contains N Crash report management is, therefore, deeply incorporated
frames as ST = f0 , . . . , fN - 1 . An example of a crash report in the contemporary product development workflow.
is presented in Fig. 1. All the above mentioned use-cases require to manage har-
vested stack traces which includes collecting, storing, and
1 Date: 2016-01-20T22:11:48.834Z
retrieving. In its turn, for all these operations to be efficient it
2 Product: XXXXXXXXXXXX is necessary to be able to compare stack traces with respect
3 Version: 144.3143 to bugs that spawn them.
4 Action: null
5 OS: Mac OS X
The challenge is not only the large number of reports, but
6 Java: Oracle Corporation 1.8.0_40-release also the ubiquitous presence of exact and more importantly,
7 Message: new child is an ancestor inexact duplicates. For example, our internal study found that
8
9java.lang.IllegalArgumentException: new child is an ancestor
72% of crash reports of the IntelliJ Platform (a JetBrains
10 at javax.swing.tree.DefaultMutableTreeNode.insert(DefaultMutableTreeNode.java:179) product) are duplicates. Due to a large volume of data it is
11 at javax.swing.tree.DefaultMutableTreeNode.add(DefaultMutableTreeNode.java:411) necessary to have a high-quality stack trace similarity measure
12 at com.openapi.application.impl.ApplicationImpl$8.run(ApplicationImpl.java:374)
.....
in order to eliminate duplicates and to group similar crash
41 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) reports together. Therefore, such measure has great impact on
42 at java.lang.Thread.run(Thread.java:745) ensuring quality of the software product.
43 at org.ide.PooledThreadExecutor$2$1.run ....
\bfg \bfw \beta \gamma (fi ) = \mathrm{s}\mathrm{i}\mathrm{g}\mathrm{m}(\beta (\mathrm{I}\mathrm{D}\mathrm{F}(fi ) - \gamma )), (3) 6 https://github.com/hyperopt/hyperopt
V. E VALUATION First, we considered approaches that use TF-IDF [19], [24]
technique. Next, we also employed Rebucket [10] method and
A. Experimental Setup its available implementation [2]. It should be noted that it
To perform the evaluation, we have used the JetBrains crash belongs to edit distance and supervised learning methods. We
report processing system Exception Analyzer which handles also included in our baseline other edit distance methods —
reports from various IntelliJ Platform products. Exception An- Levenshtein distance [23] and Brodie [7]. Another supervised
alyzer receives generated reports and automatically distributes method that we included in our evaluation is Moroo et al. [24].
them into existing issues (buckets) or creates new ones out It combines Rebucket and Lerch et al. [19] approaches.
of them. However, it is a well-known problem that output of We didn’t compared with recently-developed DURFEX [27]
automatic bug triaging tools can be of insufficient quality [10]. approach since it relies on tight integration with bug tracker
Therefore, Exception Analyzer allows to employ user input to and requires component and severity fields. At the same time
triage “problematic” reports. If this happens, user actions are our approach concerns only stack traces.
logged and can be used later on for various purposes. Finally, we have decided to compare our approach with
In order to evaluate our approach, we had to construct a test several classic and widely known approaches: Prefix Match
corpus. We adhere to the following idea: if a developer assigns and Cosine Similarity [23]. We have employed two variations
a report to an issue manually, then there is a reason to think of the latter: Cosine Similarity with IDF component (denoted
that this report is a duplicate to the ones already contained in as Cosine (IDF)) and without (denoted as Cosine (1)).
the issue. And vice versa: if a developer extracts some reports C. Evaluation Metrics
from a given issue, it means that these reports are distinct from To answer RQs 1 and 2, we have evaluated how good our
the remaining. similarity function is. Due to nature of our dataset we have to
To construct our test corpus, we have extracted and analyzed use metric applicable for binary classification. To assess the
reports from recent user action logs of Exception Analyzer quality of our algorithm, we use the well-accepted comparison
spanning one year time frame. To create positive pairs we have measure ROC AUC [21]. It is statistically consistent, and it is
analyzed user sessions and searched for the following pattern: also a more discriminating measure than Precision/Recall, F-
for a particular unbucketed report, a user looks into some issue, measure, and Accuracy. Several studies concerning bug report
compares it to a particular report of this issue and then assigns triage also employ metrics like MAP [17], Recall Rate [11],
it into the issue. To obtain negative pairs we exploit a similar [31], and other metrics used for the ranking problem. However,
idea: we designate a pair as negative if a user compared reports in this paper we consider the binary classification task and
and did not grouped them. Eventually, we have obtained 6431 therefore we need to use other metrics.
pairs, out of which 3087 were positive and were 3344 negative. Turning to ROC AUC, an important observation that 0.5 is
We have got not too many pairs due to the fact that most considered the minimum result for ROC AUC due to simple
reports in Exception Analyzer are grouped automatically and random classifier giving a result of 0.5. In our experiments we
users rarely have to intervene. The experiments were run with didn’t used cross validation since we have sufficient data to
80/20 test-train split. run a simple test/train split.
Another observation is the following: if an algorithm in-
B. Research Questions creases ROC AUC from 0.5 to 0.55, this increase is less sig-
RQ1: How do individual steps contribute to the overall nificant than the one from 0.75 to 0.8, despite the equal gain.
quality of algorithm output? This is the reason why we have computed the error reduction
RQ2: How well does our approach perform in comparison of each algorithm (RQ2). After ranging the algorithm outputs
with state-of-the-art approaches? by ROC AUC, we calculate by how many percent the error
rate has been reduced in comparison to the previous algorithm.
RQ1 evaluates the effectiveness of individual components
For example, method of Brodie et al. has improved by 0.06 in
of our method. Since our function consists of a number of
comparison to Prefix Match (0,64 against 0,58), and its error
independent steps, it is necessary to check whether each of
reduction is 0.06 \ast 100/(1 - 0.58) = 14\%. These numbers are
them is beneficial or not. By performing these evaluations, we
presented in Table III.
demonstrate that each component is essential for our resulting
similarity function. We perform several experiments for this D. Results
purpose. For every experiment, we switch off the correspond- 1) RQ1: How do individual steps contribute to the overall
ing component in the full TraceSim, and run it on the test quality of the algorithm output?: The ROC AUC results are
corpus. We consider the following steps: TraceSim without presented in Table II. We have found out that the \bfg \bfw weight
gw (\bfg \bfw (fi )) = 1 in (1)), TraceSim without lw (\bfl \bfw \alpha (fi ) = 1 function, which is based on computing global frequency for
in (1)), TraceSim without SOEs (without separate processing frames, makes the largest contribution (+0.1). The \bfl \bfw weight
of stack overflow exceptions using the algorithm from [19]), function that considers the order of frames in a stack trace
and the full version of TraceSim. contributes less (+0.03). Finally, SOEs contribute the least
RQ2 compares the resulting similarity function with the (+0.01), which is explicitly connected to the number of stack
state-of-the-art approaches. traces containing recursion (4\% in our test corpus).
TABLE II: Contribution of individual steps [5] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-
parameter optimization. NIPS’11, page 2546–2554, 2011.
Method Results [6] J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model
TraceSim 0.79 search: Hyperparameter optimization in hundreds of dimensions for
TraceSim without SOEs 0.78 vision architectures. ICML’13, page I–115–I–123. JMLR.org, 2013.
TraceSim without lw 0.76 [7] M. Brodie, Sheng Ma, G. Lohman, L. Mignet, N. Modani, M. Wilding,
TraceSim without gw 0.69 J. Champlin, and P. Sohn. Quickly finding known software problems via
automated symptom matching. ICAC’05, pages 101–110, June 2005.
[8] J. C. Campbell, E. A. Santos, and A. Hindle. The unreasonable effec-
TABLE III: Comparison with other approaches tiveness of traditional information retrieval in crash report deduplication.
MSR ’16, pages 269–280, New York, NY, USA, 2016. ACM.
[9] Marc Claesen and Bart De Moor. Hyperparameter search in machine
Similarity ROC AUC Error red. learning. arXiv preprint arXiv:1502.02127, 2015.
TraceSim 0.79 13\% [10] Y. Dang, R. Wu, H. Zhang, D. Zhang, and P. Nobel. Rebucket: A method
Moroo et al. [24] 0.76 0\% for clustering duplicate crash reports based on call stack similarity. ICSE
Lerch [19] 0.76 11\% ’12, pages 1084–1093, Piscataway, NJ, USA, 2012. IEEE Press.
Cosine (IDF) 0.73 10\% [11] J. Deshmukh, K. M. Annervaz, S. Podder, S. Sengupta, and N. Dubash.
Rebucket [10] 0.70 6\% Towards accurate duplicate bug retrieval using deep learning techniques.
Cosine (1) 0.68 0\% ICSME ’17, pages 115–124, 2017.
Levenshtein [23] 0.68 11\% [12] T. Dhaliwal, F. Khomh, and Y. Zou. Classifying field crash reports for
Brodie et al. [7] 0.64 14\% fixing bugs: A case study of mozilla firefox. ICSM ’11, pages 333–342,
Prefix Match [23] 0.58 - Washington, DC, USA, 2011. IEEE Computer Society.
[13] Tom Fawcett. An introduction to ROC analysis. Pattern Recognition
Letters, 27(8):861–874, jun 2006.
[14] M. A. Ghafoor and J. H. Siddiqui. Cross platform bug correlation using
stack traces. FIT ’16, pages 199–204, Dec 2016.
2) RQ2: How well does our approach perform in compari- [15] K. Glerum, K Kinshumann, S. Greenberg, G. Aul, V. Orgovan,
son to state-of-the-art approaches?: The ROC AUC results G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very)
are presented in Table III. Our method turned out to be large: Ten years of implementation and experience. SOSP ’09, pages
103–116, New York, NY, USA, 2009. ACM.
superior to all others. Our contribution is significant: we have [16] Abram Hindle and Curtis Onuczko. Preventing duplicate bug reports
improved by +0.03 compared to the existing algorithm with by continuously querying bug reports. Empirical Software Engineering,
the best result on our dataset. However, it should be noted that Aug 2018.
[17] Abram Hindle and Curtis Onuczko. Preventing duplicate bug re-
the improvement of almost all other algorithms lies between ports by continuously querying bug reports. Empirical Softw. Engg.,
+0.003 and +0.06. Furthermore, our algorithm provides error 24(2):902–936, April 2019.
reduction of 13%, and only Brodie et al. provides more. [18] S. Kim, T. Zimmermann, and N. Nagappan. Crash graphs: An aggre-
gated view of multiple crashes to improve crash triage. DSN ’11, pages
486–493, June 2011.
VI. C ONCLUSION [19] Johannes Lerch and Mira Mezini. Finding duplicates of your yet
In this paper, we have proposed a novel approach to unwritten bug report. CSMR ’13, pages 69–78. IEEE Comp. Soc., 2013.
[20] V. I. Levenshtein. Binary codes capable of correcting deletions, inser-
calculating stack trace similarity that combines TF-IDF and tions and reversals. Soviet Physics Doklady, 10:707–710, 1966.
Levenshtein distance. The former is used to “demote” fre- [21] C. X. Ling, J. Huang, and H. Zhang. AUC: A statistically consistent and
quently encountered frames via an IDF analogue for stack more discriminating measure than accuracy. IJCAI’03, pages 519–524.
[22] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Informa-
frames, while the latter allows to account for differences not tion Retrieval. Cambridge University Press, 2008.
only in individual frames, but also in their depth. At the same [23] N. Modani, R. Gupta, G. Lohman, T. Syeda-Mahmood, and L. Mignet.
time, employed machine learning allowed us to efficiently Automatically identifying known software problems. ICDEW ’07, pages
433–441, Washington, DC, USA, 2007. IEEE Computer Society.
combine two classic approaches. [24] A. Moroo, A. Aizawa, and T. Hamamoto. Reranking-based crash report
To evaluate our approach, we have implemented it inside deduplication. In X. He, editor, SEKE ’17, pages 507–510, 2017.
an industrial-grade report triaging system used by JetBrains. [25] S. B. Needleman and C. D. Wunsch. A general method applicable to
the search for similarities in the amino acid sequence of two proteins.
The approach has been employed for over 6 months, receiving Journal of Molecular Biology, 48(3):443–453, 1970.
positive feedback from developers and managers, who reported [26] M. S. Rakha, C. Bezemer, and A. E. Hassan. Revisiting the performance
that the quality of bucketing had improved. Our experiments evaluation of automated approaches for the retrieval of duplicate issue
reports. IEEE Trans. on Soft. Eng., 44(12):1245–1268, Dec 2018.
have shown that our method outperforms the existing ap- [27] K. K. Sabor, A. Hamou-Lhadj, and A. Larsson. DURFEX: A feature
proaches. It should be noted that even a relatively small extraction technique for efficient detection of duplicate bug reports.
improvement plays a significant role in the quality of report ICSQRS ’17, pages 240–250, 2017.
[28] A. Schroter, A. Schröter, N. Bettenburg, and R. Premraj. Do stack traces
bucketing due to the large overall report volume. help developers fix bugs? MSR ’10, pages 118–121, May 2010.
[29] Karen Sparck Jones. A statistical interpretation of term specificity and
R EFERENCES its application in retrieval. Journal of documentation, 28(1):11–21, 1972.
[1] Elasticsearch. https://www.elastic.co/products/elasticsearch. Accessed: [30] C. Sun, D. Lo, S. Khoo, and J. Jiang. Towards more accurate retrieval
2019-11-08. of duplicate bug reports. ASE ’11, pages 253–262, Nov 2011.
[2] Implementation of Rebucket. https://github.com/ZhangShurong/ [31] C. Sun, D. Lo, X. Wang, J. Jiang, and S. Khoo. A discriminative model
rebucket. Accessed: 2019-11-08. approach for accurate duplicate bug report retrieval. ICSE ’10, pages
[3] TraceSim implementation. https://github.com/traceSimSubmission/ 45–54, 2010.
trace-sim. Accessed: 2019-11-08. [32] R. Wu, H. Zhang, S. Cheung, and S. Kim. Crashlocator: Locating
[4] K. Bartz, J. W. Stokes, J. C. Platt, R. Kivett, D. Grant, S. Calinoiu, and crashing faults based on crash stacks. ISSTA ’14, pages 204–214, 2014.
G. Loihle. Finding similar failures using callstack similarity. SysML’08,
pages 1–6, Berkeley, CA, USA, 2008. USENIX Association.