0% found this document useful (0 votes)
19 views13 pages

Exploring Software Reusability Metrics With Q&A Forum Data

Uploaded by

Trust Mary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views13 pages

Exploring Software Reusability Metrics With Q&A Forum Data

Uploaded by

Trust Mary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

The Journal of Systems & Software 168 (2020) 110652

Contents lists available at ScienceDirect

The Journal of Systems & Software


journal homepage: www.elsevier.com/locate/jss

Exploring software reusability metrics with Q&A forum data


Matthew T. Patrick
Department of Dermatology, University of Michigan, Ann Arbor MI, United States of America

article info a b s t r a c t

Article history: Question and answer (Q&A) forums contain valuable information regarding software reuse, but
Received 22 September 2019 they can be challenging to analyze due to their unstructured free text. Here we introduce a new
Received in revised form 10 May 2020 approach (LANLAN), using word embeddings and machine learning, to harness information available
Accepted 18 May 2020
in StackOverflow. Specifically, we consider two different kinds of user communication describing
Available online 20 May 2020
difficulties encountered in software reuse: ‘problem reports’ point to potential defects, while ‘support
Keywords: requests’ ask for clarification on software usage. Word embeddings were trained on 1.6 billion tokens
Software reuse from StackOverflow and applied to identify which Q&A forum messages (from two large open source
Reusability projects: Eclipse and Bioconductor) correspond to problem reports or support requests. LANLAN
Text mining achieved an area under the receiver operator curve (AUROC) of over 0.9; it can be used to explore
StackOverflow the relationship between software reusability metrics and difficulties encountered by users, as well
Machine learning as predict the number of difficulties users will face in the future. Q&A forum data can help improve
understanding of software reuse, and may be harnessed as an additional resource to evaluate software
reusability metrics.
© 2020 Elsevier Inc. All rights reserved.

1. Introduction costs (the potential for errors and bugs in reused versus newly
developed software). We hypothesize the direct costs of soft-
Software reuse is an important strategy for decreasing devel- ware reuse are likely to depend on its understandability (i.e. the
opment costs and increasing productivity, as well as avoiding software interface), while the indirect costs may be associated
defects and improving software quality (Mohagheghi and Con- with its complexity (under the assumption that more complex
radi, 2007). It was originally envisaged as a way to make software
software is more likely to go wrong).
development more efficient through modular components that
To investigate our hypothesis, we introduce a new approach
can be used over and over again in mass production (McIlroy,
1969), rather than rewriting functionality that already exists, as (LANLAN: Lexical ANalysis for LAbelling iNquiries) that extracts
was (and is) common practice. Nevertheless, there is a cost to information from question and answer (Q&A) forums. LANLAN
software reuse, as it is necessary to develop and maintain ‘glue classifies questions into ‘problem reports’ (indicating possible
code’ that connects the reusable component with the software defects) and ‘support requests’ (asking for help in understanding
under development (Svahnberg and Gorschek, 2017). There is also how to use the software). Software that has a lot of support
a concern that software written by other people may contain requests demonstrates direct costs, since users/reusers have dif-
unknown bugs, such that it is difficult to ensure the quality of ficulty applying it, while software that has many problem reports
applications constructed from reused components. may be more likely to harbor bugs (i.e. indirect costs). By ap-
The potential for bringing existing software components and plying statistical techniques to test the association between Q&A
knowledge to a new project depends on their ‘reusability’ (Frakes
messages and features derived from static analysis relating to
and Kang, 2005). Various reusability metrics have been sug-
complexity and understandabilty, we hope to be able to explore
gested (Ampatzoglou et al., 2018), based on factors ranging from
the software’s complexity (structural code quality, dependencies, the relationship between problem reports/support requests and
size etc.) through to its understandability (interface complexity software reuse.
and documentation). Previous researchers (Lemley and O’Brien, In early research, data about problems experienced during
1997) have considered software reuse in terms of direct costs software development and reuse was expensive or difficult to
(integrating/adapting existing software components in the new obtain, being primarily extracted from corporate testing activi-
application, versus rewriting them from scratch) and indirect ties (Endres, 1975) or classified military records (Goel and Oku-
moto, 1979). By contrast, the rise of open source software has
E-mail address: [email protected]. made data publicly available for mining (Antoniol et al., 2004):

https://doi.org/10.1016/j.jss.2020.110652
0164-1212/© 2020 Elsevier Inc. All rights reserved.
2 M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652

version control repositories (such as GitHub1 ) contain informa- 2. Background and related work
tion about changes made and the reasons for making them,
whilst bug tracking databases (e.g. Bugzilla2 ) record observed Q&A forum mining has frequently been applied to analyze
failures along with attempts to identify and address their cause. user behavior, from early research into Usenet (Whittaker et al.,
Researchers have applied various metrics (lines of code, coupling, 1998), through to more recent investigations of contributor mo-
churn etc. Radjenović et al., 2013) to analyze this data, and tivations (Treude et al., 2011), collective knowledge (Anderson
machine learning algorithms (e.g. SVM or Random Forest Bowes et al., 2012) and the effectiveness of code examples (Nasehi et al.,
et al., 2017) have been used in an attempt to improve software 2012) in StackOverflow. Machine learning techniques have also
been applied to make predictions from this data. For example,
quality (and hence reusability).
Yang et al. (2011) applied various classifiers to predict which
Techniques which aim to improve software quality include
questions will remain unanswered, whereas Zhang et al. (2015a)
those which direct developers towards specific parts of their code
used Latent Dirichlet Allocation (a topic modeling approach) to
more likely to contain defects (Bowes et al., 2017; Hall et al.,
predict duplicate questions. Q&A forum mining has been used
2012) or model the overall quality and health of a software to assist software developers in an IDE prompter for Java (Pon-
project (Jansen, 2014; Franco-Bedoya et al., 2014), but evalu- zanelli et al., 2014) and an interactive programming tool for
ation of these techniques depends on the quality and size of Python (Rong et al., 2016). In common with these studies, we
available data. Bug report and version control repositories are apply machine learning techniques to Q&A forum data. However,
often affected by various biases (Nguyen et al., 2010). For ex- as far as we are aware, our paper represents the first attempt
ample, experienced developers are more likely to submit bug to use data mined from Q&A forums to predict difficulties faced
reports, whilst novice users often feel discouraged to contribute during software reuse.
for fear of condescension (Lotufo et al., 2012). Bug reports can LANLAN extracts useful information by combining Q&A forum
sometimes contain contradictory claims or be impossible to re- data with other features, e.g. the GitHub repository. GitHub is
produce (Schugerl et al., 2008; Sun, 2011). For example, in one often used in repository mining, due to its large size and acces-
study, 40% of files marked as defective in five open source projects sibility through an open API (Kagdi et al., 2007). For example,
never actually contained a bug (Herzig et al., 2013). Q&A forums Ray et al. (2014) used GitHub to explore the relationship between
have their own biases and accuracy issues, since they depend programming language and code quality, and Zanetti et al. (2013)
on how users express their questions. However, by combining applied network analysis and machine learning to predict the
multiple sources of data together, we should be able to improve quality of bug reports. Zhang et al. (2015b) used topic models
to predict the interest and experience of developers as related
the robustness of our analyses when evaluating effective metrics
to specific bug reports, assigning the most appropriate developer
for software reusability.
to fix a particular bug. In software ecosystem research (Manikas,
Community-driven resources, such as mailing lists and Q&A
2016; Mens et al., 2014), software projects are compared to nat-
forums, allow users to describe problems and work together
ural ecosystems, modeling their development using techniques
to fix them (Abdalkareem et al., 2017). Issues are frequently normally applied in ecology or evolutionary theory. We also
described within these resources without being reported in any adapt techniques typically used in the natural sciences (growth
other database. For example, Bachmann et al. (2010) observed modeling and association analysis) to interpret the data we have
16% of defects in the Apache web server were addressed in the collected.
software’s mailing list instead of its bug tracking system. Q&A Zeller (2013) discussed the challenges involved in mining soft-
forums also contain information about software developer/user ware repositories further. For example, it can often be difficult
communities and their interaction (Vasilescu et al., 2013), which to distinguish fixes from other changes, such as those that add
might be helpful for understanding the social dynamics of soft- new features or refactor the code. Linking repositories to a bug
ware reuse. However, it can be difficult to derive meaningful database can help identify which changes relate to bugs, but
categorizations from the unstructured text in social media, due even when bug databases are used, a large proportion of fixes
to subtle nuances of communication and natural language (Wang are not recorded in them. For the Eclipse project, less than half
et al., 2019). In this paper, we propose a new approach (LAN- of fixes could be linked to an entry in the bug database (Bird
LAN) to mine information directly from existing Q&A forums and et al., 2009). Zeller (2013) argues software repository mining is
classify posts automatically using statistical and machine learning useful despite these issues, but it should be augmented by seeking
techniques from the field of natural language processing. input from project insiders or using approaches such as keyword
matching (to predict bug fixes from other changes). We augment
We evaluate LANLAN on two large open source projects
repository mining with information from Q&A forums and show
(Eclipse and Bioconductor) through cross-validation and testing
our machine learning approach to be more effective than simple
on different software from which the model was trained. We
keyword matching.
apply novel approaches (association analysis and growth curve
Central to our approach are numerical representations of
modeling) to interpret the results and find key differences be- words, known as embeddings (Goth, 2016), that take inspiration
tween the occurrence of problem reports and support requests from ordinary language philosophy (Wittgenstein, 1953) and
that may be useful in improving the reusabilty of software. structuralist linguistics (Firth, 1957). Word embeddings capture
The remainder of this paper is organized as follows: Section 2 the semantics of words from a corpus according to their context
introduces the background and related work, Section 3 explains (i.e. the words that surround them) (Goth, 2016). Information
our approach, Section 4 describes our evaluation procedures, is distributed among a small (fixed) number of weights, with
Section 5 provides the results and discussion, Section 6 explores the assignment of values to these weights providing a distinct
the threats to validity, Section 7 presents our conclusions, and vector (and therefore semantics) for each word. A key advantage
Section 8 lists the code availability. of word embeddings (compared to other natural language pro-
cessing techniques, such as named entity recognition, or sentence
parsing) is that they provide a uniform representation, which can
1 GitHub: https://github.com/. easily be used to train advanced machine learning models (e.g. for
2 Bugzilla: https://www.bugzilla.org. sentiment analysis Giatsoglou et al., 2017).
M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652 3

The earliest word embedding approaches used global factor- 3.1. Training word embeddings
ization. For example, Latent Semantic Analysis (LSA) (Deerwester
et al., 1990) constructs a matrix of counts for the number of We trained word embeddings on questions submitted to Stack-
times words occur in each document, then applies Singular Value Overflow, consisting of 1.6 billion tokens, with a vocabulary of 0.6
Decomposition (SVD) to factorize it into vectors for each word. million unique words. StackOverflow questions were downloaded
Global factorization is a coarse-grained approach for modeling se- from the Stack Exchange Data Dump3 and then parsed using
mantics, and is especially limited if the documents being analyzed HTMLParser in Python to remove the XML tags. Prior to training,
are large. More detailed information can be obtained through the each word was tokenized and transformed into lower case. We
analysis of local context (i.e. words that occur near each other), then removed all characters not in the roman alphabet or specific
punctuation marks (full stops, question marks or exclamation
for example the skip-gram approach (Mikolov et al., 2013), which
marks). All numbers (regardless of length) were replaced by the
was previously applied to documentation from the Java Develop-
token ‘0’ (so as to avoid creating a separate embedding for each
ment Kit for code retrieval (Nguyen et al., 2017). However, there
individual number, and to treat the presence of any number as
is a danger predicting the context of one word at a time will
the feature we wish to encode) and code blocks were replaced by
miss information available through global statistics. We aim to the token ‘<code>’. We also transformed all types of exception
find a middle ground between these two strategies using Global and error (e.g. NullPointerException) into the words ‘exception’
Vectors for Word Representation (GloVe) (Pennington et al., 2014) and ‘error’, to ensure LANLAN can easily be transferred to other
to incorporate data at both the global and local scale. To the best datasets (which may use different exception and error types).
of our knowledge, this paper represents the first time GloVe has GloVe generates two sets of word embeddings (w and w̃ ) as a
been applied to the field of software engineering. result of the matrix factorization (see Fig. 2). The embeddings are
Global Vectors for Word Representation (GloVe) (Pennington optimized by learning bias terms (bi and b̃j ) for each set, such that
et al., 2014) has been used on tasks as diverse as annotating the difference between the log of the original word–word matrix
videos from free text descriptions (Hendricks et al., 2018), to (X ) and the matrix reconstructed from the embeddings and bias
identifying implicit human bias/stereotyping (Greenwald, 2017). terms is as small as possible, i.e. the error term (ϵ ) is reduced. This
It takes advantage of local information (by counting word co- approach can be represented as an optimization function (Eq. (1)),
occurrences in their local context) as well as global (i.e. ag- and once the word embedding sets (wi and w̃j ) are optimized,
gregated) statistics. By contrast with the skipgram technique, they are added together to improve their accuracy. Furthermore,
which predicts words from their context one at a time, GloVe a weighting function [f (x) = (x/xmax )α if x < xmax , 1 otherwise] is
uses a highly parallelizable matrix factorization approach. How- applied to avoid learning only from common word pairs (where
ever, instead of factorizing a global document-word count ma- xmax = 100 and α = 34 ); for more information see Pennington
et al. (2014). In our experiments, we trained word embeddings
trix (as with LSA), GloVe factorizes a matrix of word–word co-
as 200-dimensional vectors and used the default window size of
occurrences (Xij ), produced using a sliding window.
15 words, because these settings were found to be effective in
LANLAN identifies features that may be indicative of difficul-
previous research (Pennington et al., 2014).
ties in software reuse, because they are associated with support
V
requests or problem reports. Opinion differs as to the effec- ∑
tiveness of using features of the software to improve quality f (Xij )(wiT w̃j + bi + b̃j − log(Xij ))2 = ϵ (1)
(i.e. static analysis): Rahman et al. (2014) suggested static analysis i,j=1

can complement statistical defect prediction, whereas Johnson


3.2. Classifying Q&A forum posts using word embeddings
et al. (2013) suggested it is underused in practice, due to prob-
lems with false positives. One recent work in this area (Ray
Our aim is to use the information contained within word
et al., 2016) likened the characteristics of program code to nat-
embeddings to classify forum posts into support requests and
ural language, and suggested entropy measures may be used to
problem reports. Questions are pre-processed in the same way
predict software defects. We also utilize techniques from natural as the training dataset, except code, numeric and punctuation
language processing in our research, but apply them to Q&A tokens are removed (these tokens are used only for context in
forum messages rather than the code itself. Although we apply training and not to evaluate question semantics). Each question
LANLAN to analyzing software reusability, it also has potential for consists of a title and a body: we produced a set of features for
the development process as a whole. these components by calculating the mean embedding from the
words they contain. This approach has previously been found to
be effective at comparing the similarities between short texts
3. Our approach
(similar to our Q&A forum questions) (Kenter and de Rijke, 2015).
Unlike taking the sum of values in word embeddings, the mean
Fig. 1 shows the data flow for our approach introduced in this is not influenced by the length of the text. Since each word
paper for classifying Q&A forum posts (LANLAN). In particular, we embedding consists of a vector of 200 numerical values, this gives
aim to distinguish questions that indicate a potential defect in the us 400 features for each question.
software (e.g. ‘‘I think there may be a bug in...’’) from those asking LANLAN uses these features to distinguish questions asking for
for help in achieving their goals for reuse (e.g. ‘‘please could you clarification on software usage (support requests) from those re-
tell me how to...’’). We call the first category problem reports ferring to potential defects in the software (problem reports), by
and the second category support requests. Word embeddings are applying a variety of classification algorithms through a machine
trained using a large corpus of text from Q&A forum messages. learning framework in R (MLR Bischl et al., 2016). Each algorithm
We then pre-process the questions asked about each program, is evaluated through stratified 10-fold cross-validation: dividing
mapping their words to the corresponding embedding, and cre- the questions at random into ten equal-sized partitions, then val-
ating features for each question. Machine learning is performed to idating a model on each partition (one at a time) after training it
produce a prediction model, then the results are analyzed using
growth curves and association analysis. 3 https://archive.org/details/stackexchange.
4 M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652

Fig. 1. Data flow diagram for LANLAN.

Fig. 2. GloVe matrix factorization.

on the remaining data. Stratification ensures the same proportion using a plot of residuals against fitted values). It is important to
of class labels are included in each (randomly selected) partition, ensure these assumptions are met for us to have confidence in
which is particularly important when class labels are imbalanced our evaluation of the significance of each property.
(i.e. most questions posted to Q&A forums do not indicate de- We use Bonferroni correction (Dunn, 1961) to address the
fects). We train LANLAN on manually annotated questions and, multiple comparisons problem (i.e. the more properties we test,
to evaluate whether our approach may be transferred to other the more likely p-values will be significant by chance). This
programs and datasets, we also train classification models on one involves dividing the standard significance threshold (0.05) by
program and then test them on others. the number of comparisons (i.e. properties) to identify those
which have a high likelihood of being significant. This is a con-
3.3. Association analysis servative measure, since some program properties are likely to
be correlated with each other. As well as applying association
Association analysis is a technique for identifying properties analysis to each property, we also identify subsets of properties
significantly correlated with a particular trait. For example, in
that are almost as descriptive of the underlying factors as the
bioinformatics it helps discover which genetic markers affect the
entire set. We do this by evaluating the multiple r 2 value of all
observable characteristics of an organism (Balding, 2006). In our
sets of properties of size five. This procedure is applied separately
work, we are interested in finding program properties (potential
for problem reports and support requests to identify the most
software reusability metrics) which could lead to an increase or
important properties for understanding the factors behind the
decrease in the number of support requests and problem reports.
number of questions in each category.
To achieve this, we fit a linear model to the data and test whether
the regression coefficients for each property are equal to zero
(using a t-statistic). The results can then be used to infer the prob- 3.4. Growth curve modeling
ability each property is significantly correlated with the number
of questions that report potential defects (problem reports) or ask To illustrate how the classification models produced by LAN-
for help using the software (support requests). LAN may be used to predict the rate at which support requests
Linear models assume each data point is independent (we and problem reports occur, we analyze the resulting data us-
ensured this by treating each thread in the Q&A forum as an ing growth curve models. Growth curve modeling offers a way
individual sample); the residuals (i.e. difference between the fit- to understand and compare the dynamics of problem reports
ted model and the data) should follow a normal distribution (we and support requests over the software’s lifetime. Although this
tested this using a Q-Q plot); the variance for the residuals should technique has rarely been applied in software engineering, it is
be homogeneous and there should be a linear relationship be- popular in a variety of fields, such as economics, public health,
tween the dependent and independent variables (this was tested ecology and social demography (Panik, 2014).
M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652 5

interface to the JAGS MCMC library4 ). Rather than just fitting a


curve to the data points available for each package, MCMC takes
into account prior knowledge about the distribution of each pa-
rameter. We set the prior distributions according to distributions
of fitted parameter values from nls (lognormal for κ , normal for
δ and gamma for β ). The mean of the resulting parameter values
from 100 chains (i.e. executions) of MCMC were calculated and
used for each program.

4. Evaluation

4.1. Worked examples

To evaluate the robustness of LANLAN, we selected two large


open source projects (Eclipse and Bioconductor) as worked ex-
amples. Eclipse has previously been used as the subject of studies
into software reuse (Hummel et al., 2008; Martinez et al., 2002,
2017) and Ye et al. (2014) created a database of bug reports
(BugDB) for Eclipse, which now forms part of the NASA Promise
repository.5 The programs in this project include AspectJ (an
aspect-oriented programming extension), Birt (a business intelli-
gence and reporting tool), JDT (a suite of Java development tools),
Fig. 3. Example of growth curve modeling. (For interpretation of the references SWT (a widget toolkit) and Tomcat (a web application server).
to color in this figure legend, the reader is referred to the web version of this Our second worked example, Bioconductor (Gentleman et al.,
article.)
2005), consists of a large collection of (over 1400) bioinformatics
and molecular biology software packages, written for different
purposes and by different people. However, Bioconductor pro-
Growth curve modeling attempts to identify the curve that vides a standard interface from which a wide range of statistics
most closely fits the data, by optimizing over a number of param- may be derived. This offers us the opportunity to evaluate the
eters. One way to achieve this is through least squares estimation, factors that affect the number of problem reports and support
i.e. minimizing the sum of the square distances between each requests.
data point and the curve. Fig. 3 plots two different curves against One way to divide open source software reuse (Brown and
a series of data points (in black). The red curve is an instance of an Booch, 2002) is by whether it is a pre-planned strategy of a
exponential model, while the blue curve is from a linear model. popular commercial product (e.g. the Eclipse suite) or the ad-hoc
Although some of the points lie closer to the linear (blue) model, process of finding software developed to perform a specific task
if we were to take the sum of squared distances across x and y (such as the individual software packages that make up Biocon-
(dashed lines), we would find the objective value to be smaller for ductor). By including two projects (Eclipse and Bioconductor) that
the exponential (red) model. Hence, this particular exponential are very different from each other, we should have a better idea of
model is a better fit for the data. Using this approach, we can whether LANLAN will work on a wide range of software projects.
identify not only the most appropriate family of curves, but also
their optimal parameters. We model the cumulative number of 4.2. Research questions
problem reports and support requests for each program by fitting
a generalized logistic growth model (Eq. (2)). The generalized RQ1: How accurate is LANLAN at classifying forum posts? Be-
logistic growth model is highly flexible and by changing its pa- fore we can have confidence in our technique (LANLAN),
rameter values (κ , δ and β ), can represent many different forms we need to make sure it accurately identifies questions
of growth (Panik, 2014). which refer to defects (problem reports) as opposed to
κ those asking for clarification on software usage (support
Y (t) = (2) requests). We achieve this by evaluating LANLAN on four
[1 − exp (−β (t − δ ))]−5 different programs (two from the Eclipse suite and two
Where possible we fit the growth curves through non-linear from Bioconductor). To determine whether LANLAN can
least squares estimation (using the stats package in R). The suc- reliably identify problem reports from support requests, we
cess of this technique depends, at least in part, on the start- compare the predicted categorizations against manually
ing values chosen for each parameter. We used the maximum annotated labels. This research question evaluates the steps
number of problem reports/support requests (for each program) described in Section 3.2.
as a suitable starting value for κ , because it represents the y- RQ2: Can support requests and problem reports be used to
asymptote (i.e. the number the curve will tend towards as time evaluate software reusability metrics? The number of
increases); we chose two different growth rate values for β (0.05 support requests and problem reports for each software
and 0.01) and set the starting value for δ to 0 (δ is considered package may help to provide insight into the impact of
the delay parameter). The delay parameter was set at zero as our different features on reusability. We apply association anal-
initial assumption is that the first date recorded for each software ysis to a variety of program properties, to see which cor-
is the date its usage started to grow, and the two values for the relate with each kind of post. We also look for groups of
growth rate were chosen to explore the range of possible rates properties that together are almost as representative as
at which software usage grows (some software will be adopted the entire set. By analyzing problem reports and support
quickly, whereas others take longer to become popular).
Whenever non-linear least squares estimation failed to con- 4 rjags: https://CRAN.R-project.org/package=rjags.
verge, we applied Bayesian parameter estimation (using the R 5 NASA Promise Repository: http://openscience.us/repo.
6 M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652

requests in this way, we should be able to identify program Table 1


properties that are important to consider when attempting Numbers of problem reports and support requests annotated manually.

to improve software reusability. This research question Problem Support Total


reports requests questions
evaluates the steps described in Section 3.3.
AspectJ 602 1879 2481
RQ3: Can LANLAN predict how support requests and prob-
EclipseJDT 97 637 734
lem reports will grow in the future? Given the previous edgeR 97 683 780
numbers of Q&A forum questions, it would be useful to PROcess 105 530 635
predict how many will occur in the future, since this could Total 901 3729 4630
guide developers where to focus their efforts to improve
reusability. For example, if problem reports are predicted to
grow faster than support requests, it may be more efficient
to spend time now identifying and fixing bugs, whereas
if the situation is reversed, it would be better to focus
effort on improving documentation or simplifying the soft-
ware interfaces. We evaluate the accuracy of these models
by using them to make predictions based on partial data
(i.e. up to a certain point in time) and then comparing
the results with the actual number of problem reports
and support requests subsequently observed. This research
question evaluates the steps described in Section 3.4.
RQ4: How useful is the distinction of support requests and
problem reports to potential reusers? Not only is it nec-
essary to evaluate the accuracy with which problem report-
s/support requests can be identified and predicted, but it
is important to consider whether doing so provides useful
information for software reuse. Adapting software to new
purposes can be challenging and time-consuming, so any
added expense of applying LANLAN has to be worthwhile
for the benefits it provides. To investigate this, we sampled
10 support requests and 10 problem reports at random
from AspectJ to explore in more detail. We consider the
relationship these Q&A messages may have to reuse ac-
tivities and ask what this means for the success rate or
difficulty of reusing code with many problem reports or
Fig. 4. Benchmarking 24 classifiers on AspectJ.
support requests.

4.3. Experimental setup


Section 4.3. First, we trained our prediction model on AspectJ
Word embeddings were trained with a vector size of 200, (the annotated program which has the largest number of user-
using 593,767 StackOverflow questions from the 31 August 2017 submitted questions). In our benchmark results (see Fig. 4), 19
data dump. We then extracted all questions related to each out of the 24 classification algorithms (80%) achieved an Area
program in our study individually, with the StackOverflow and Under the Receiver Operator Curve (AUROC) above 0.8 in 10-
Bioconductor API. In total, we categorized 45,093 questions for fold cross validation, highlighting the robustness of LANLAN. In
Eclipse and 23,556 questions for Bioconductor (as problem re- particular, the Support Vector Machine (SVM) had the highest
ports or support requests). Of this data, we manually annotated AUROC (0.930).
all the questions related to 4 programs: AspectJ, EclipseJDT, An alternative strategy (Zeller, 2013), previously suggested
edgeR and PROcess; this represents 4630 questions (or 7% of for distinguishing repository commits which correct faults from
the total). Each program had its questions manually annotated those that make other changes, looks for indicator keywords
three times (on separate occasions by the author), and then the in the text (specifically ‘bug’, ‘problem’, or ‘fix’). We compared
annotations were combined by consensus (see Table 1). To predict LANLAN with two different versions of this alternative: in the
whether the remaining questions were related to defects, we first version (keyword matching), we counted posted questions
benchmarked 24 different classification algorithms using cross- as problem reports if any of these keywords were present; whilst
validation in MLR. Subsequently, 29 program properties were in the second version (keyword features), we created a new
evaluated for their association with problem reports and support prediction model using the number of times keywords occur as
requests, using three covariates to control confounding factors features. As with our approach, we extracted features from the
(See Section 5.2). title and body separately, and then combined them together in
the prediction model. It is not possible to compare the AUROC of
5. Results and discussion the keyword matching approach (since it does not provide class
probabilities for each feature), but the maximum AUROC achieved
5.1. Accuracy of our approach (answer to RQ1) in a benchmark on the keyword features approach was only 0.562
(using Random Forest in 10-fold cross-validation). This suggests
LANLAN automatically predicts whether questions submitted our approach (LANLAN) to be substantially more effective than
to Q&A forums are related to defects in the software (problem the alternative.
reports) or if they request more general help/advice (support We also evaluated precision (the proportion of questions cor-
requests). To evaluate its accuracy, and answer RQ1, we compared rectly identified as problem reports) and recall (the proportion of
these predictions against the consensus annotations described in problem reports correctly identified as such). Together, these two
M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652 7

Table 2
Evaluating how well our machine learning approach (LANLAN) generalizes to
new programs.
AspectJa JDT edgeR PROcessb
AUROC 0.930 0.889 0.921 0.970
LANLAN Precision 0.810 0.659 0.842 0.919
Recall 0.720 0.577 0.330 0.752
AUROC 0.562 0.684 0.680 0.692
Keyword features Precision 0.714 0.630 0.750 0.857
Recall 0.096 0.351 0.247 0.309
AUROC NA NA NA NA
Keyword matching Precision 0.423 0.338 0.329 0.324
Recall 0.312 0.505 0.560 0.657
a
Trained and tested on the same program.
b
Trained on the previous 3 programs.

metrics represent the ability of an approach to classify questions


accurately and find the majority of problem reports. LANLAN
provides considerably higher precision and recall than both the
alternative approaches. In particular, 81% of questions LANLAN in-
dicates to be problem reports are correct (precision), as compared
to 71% for the keyword features model and only 42% for keyword
matching. Furthermore, LANLAN identifies 72% of problem re-
Fig. 5. ROC curve for first three programs.
ports correctly (recall), as opposed to 31% for keyword matching
and only 9.6% for the keyword features model. It is interesting
the keyword feature model performs better on precision, whereas
keyword matching achieves higher recall. This could be because
keyword matching includes all questions that contain the speci-
fied keywords, whereas the keyword feature model is trained in a
more sophisticated way. Crucially, LANLAN outperforms both the
alternatives on precision and recall.
To evaluate how well LANLAN generalizes to other programs,
we applied the model trained on AspectJ to the Eclipse Java De-
velopment Tools (JDT) user interface and a mathematical program
from Bioconductor (edgeR) for differential expression analysis
(see Table 2). This reduced the AUROC slightly (from 0.930 for
AspectJ to 0.889 for JDT and 0.921 for edgeR), but larger differ-
ences were seen in precision and recall. On JDT, precision was
0.659 (compared to 0.810 for AspectJ); precision was unaffected
for edgeR, but recall fell to 0.330 (compared to 0.720 for As-
pectJ). These differences are likely to be caused by variations in
the language used to communicate problems for each program.
For example, in JDT a ‘Quick Fix’ is a pop-up that helps users
with their code, so features that indicate the request for a fix
(problem report) in AspectJ may point to support requests in JDT.
Bioinformatics software is used primarily by scientists rather than
software engineers, so different vocabulary can often be used to
describe problem reports, thus reducing the recall.
We strengthened our approach by training the prediction Fig. 6. Precision/recall for PROcess.

model on a range of different programs. When trained on AspectJ,


JDT and edgeR, then tested on a different Bioconductor program
(PROcess), the AUROC, precision and recall were higher than for metrics, while allowing for the finer details of each metric to be
any individual program before. Fig. 5 shows the combined ROC explored with respect to R. For example, Chidamber and Kemerer
curve when training on AspectJ, JDT and edgeR; the precision/re- (1994) proposed evaluating the complexity of software according
call curve when testing this model on PROcess can be seen in to the number of methods per class, as well as communication
Fig. 6. Following this approach, our prediction model is accurate and inheritance between classes. Since R is not an object-oriented
on the programs for which it has been trained, as well as being language, we instead explore the number of lines of code per
robust when applied to new programs. file, as well as the dependencies (both mandatory and suggested)
between software packages. Following the suggestion of Buse
5.2. Identifying features with the most impact on reusability (answer and Weimer (2010), we include measurements of code churn as
to RQ2) a surrogate for readability, but we also consider other features,
such as comments and whitespace, as well as vignettes (separate
We applied the results of our models to evaluate the impact of documentation, illustrating examples of use).
32 features (from each software package) on the number of prob- Various metrics (number of files, functions, blank lines, com-
lem reports and support requests in Bioconductor. The features ments and lines of code) are recorded separately for the R and
were chosen to reflect previously proposed software reusability compiled code, then static analysis reports are generated (using
8 M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652

Table 3 the number of independent paths through the program, whereas


Association analysis results comparing the number of problem reports Codetools is a modern software package for identifying potential
and support requests with various program properties (potential reusability
metrics).
problems in R code. Vignettes are documentation files in R, giving
Package Problem reports Support requests
specific examples of usage, while Imports count the number of
other packages the software requires to work and Suggests counts
Features P-value Effect size P-value Effect size
the number of packages that are suggested (but not required)
Churn.Adds 0.031 2.15 0.016 1.80
by the software. Overall, these results imply that problem re-
Churn.Dels 0.024 2.27 0.012 1.87
Churn.Adds.per.Week 0.010 2.59 0.002 2.30 ports and support requests are both affected by complexity, but
Churn.Dels.per.Week 0.017 2.40 0.005 2.10 documentation has a greater impact on support requests.
R.Files <.001 4.77 <.001 4.85 Interestingly, test coverage and churn had no significant asso-
R.Blanks <.001 4.41 <.001 3.82 ciation with problem reports and support requests, despite the
R.Comments 0.028 2.21 <.001 2.74
R.LOC <.001 5.59 <.001 5.28
intuitive link between testing and quality, and the prevalence
R.LOC.per.File 0.369 0.90 0.499 0.50 of churn in fault prediction (Hall et al., 2012). We repeated our
R.Blanks.per.LOC 0.298 −1.04 0.335 −0.72 analysis, using only packages that have >0% test coverage, but
R.Comments.per.LOC 0.682 −0.41 0.494 0.53 the p-values did not improve. One potential reason for this is that
Compiled.Files 0.247 −1.16 0.113 −1.22
CodeCov uses a simplistic coverage criteria (statement coverage),
Compiled.Blanks 0.459 −0.74 0.199 −0.99
Compiled.Comments 0.361 −0.91 0.163 −1.07 that may not be sufficient to test the software thoroughly. Also,
Compiled.LOC 0.649 −0.46 0.400 −0.65 scientific software is particularly difficult to test, because of chal-
Compiled.LOC.per.File 0.348 −0.95 0.500 −0.5 lenges in constructing a suitable test oracle (Ammann and Offutt,
Compiled.Blanks.per.LOC 0.874 0.16 0.565 −0.43 2016).
Compiled.Comments.per.LOC 0.670 0.43 0.938 0.06
Max.Cyclomatic.Complexity 0.392 0.86 0.063 1.39
In addition to applying association analysis to each feature
Mean.Cyclomatic.Complexity 0.307 −1.02 0.532 −0.47 independently, we also evaluated subsets of features together.
Count.Cyclomatic.Complexity 0.066 1.84 0.003 2.28 Fig. 7 shows the number of times each feature was included in
Total.Cyclomatic.Complexity <.001 3.87 <.001 3.93 the top 10 subsets (by multiple R2 value) out of all subsets with
Codetools.Problems <.001 3.91 <.001 2.46
5 features (118,755 in total). Interestingly, Codetools.Problems
Test.Coverage 0.642 −0.47 0.234 0.93
Goodpractice.Problems 0.235 1.19 0.171 1.02 appears in half of the top 10 subsets for problem reports, but none
Vignettes <.001 4.84 <.001 4.75 for support requests. By contrast, Total.Cyclomatic.Complexity
Depends 0.026 2.23 0.022 1.74 appears in half of the subsets for support requests, but none for
Imports <.001 6.78 <.001 5.52 problem reports. When software is more complicated (has higher
Suggests <.001 3.54 <.001 3.62
cyclomatic complexity) it is generally more difficult to under-
stand and requires more explanation (through support requests),
but that does not necessarily mean it contains more defects (if it
Codetools6 and Goodpractice7 ) from the R code, as well as the is developed well). By contrast, the Codetools library highlights
(mean, maximum and total) cyclomatic complexity. We collected poor coding style, which is more likely to indicate (and possibly
test coverage from CodeCov8 and used the GitHub API9 to count cause) defects. This finding goes against our initial hypothesis
the number of months the package has been active, its downloads that problem reports would be more closely associated with
and unique downloads (by IP address), as well as additions and complexity, while support requests would be associated with
deletions (churn). understandability, and indicates the importance of avoiding over-
Whilst some features are extracted directly from the data we complicated program structure when developing software for
collected (e.g. the total number of comments in the R code of a understandability and reusability.
particular package), other features are combined from multiple Selecting the top subset of features for problem reports
data (e.g. the number of comments per line of code). These (R.Comments, R.LOC, Codetools.Problems, Vignettes and Imports)
features were extracted to give us information about the rates and support requests (R.Files, R.Comments, Total.Cyclomatic.
and proportions of certain properties of a package, rather than Complexity, Vignettes and Imports) reduces the training data to
just their absolute value. Features were extracted using a variety 17% (from 29 features down to 5). However, 96% and 97% of
of Linux tools, such as grep, sed and awk. the R2 value was maintained, for problem reports and support
Association analysis was applied to each of the 32 features (see requests respectively. This indicates these features are a good
Table 3), to identify those with the most affect on problem reports representation of the underlying factors behind the number of
and support requests (scaled using log10 transform, to ensure a problem reports and support requests, and hence are likely to be
linear relationship). The number of months active, downloads and important for making predictions about reusability.
IP addresses were used as covariates, to find significant features
independent of the amount of usage of each package. Following 5.3. Modeling support requests and problem reports (answer to RQ3)
Bonferroni correction, 8 features were identified as significant for
problem reports and support requests (R.Files, R.Blanks, R.LOC, We applied LANLAN to model the rate of support requests
Total.Cyclomatic.Complexity, Codetools.Problems, Vignettes, Im- and problem reports for multiple programs (in Eclipse and Bio-
ports and Suggests). In addition, R.Comments was significant for conductor). Growth curves were trained on Q&A forum data
support requests. R.Files counts the number of files of R code in using non-linear least squares and Bayesian parameter estima-
the software, R.LOC does the same for the number of lines of tion. Fig. 8 shows an example of fitting a curve to the growth of
code, R.Blanks for the number of blank lines and R.Comments support requests for the ArrayExpress package in Bioconductor.
for the number of comments. Cyclomatic complexity (McCabe, In this case, the fitted parameters are 83.7 for κ , 0.036 for β and
1976) is a long established measure of code complexity, based on −23.2 for δ . The number of support requests grows exponentially
at first, but then slows down, forming a classic S-curve shape. A
6 Codetools: https://CRAN.R-project.org/package=codetools. possible explanation is that the way new software works may not
7 GoodPractice: https://github.com/mangothecat/goodpractice. be immediately clear so people submit lots of support requests,
8 CodeCov: https://codecov.io/. but as it develops to maturity, the documentation and interface
9 GitHub API: https://developer.github.com/v3/. improves, and users can look back at previous forum posts, so
M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652 9

Fig. 7. Distribution of Top 10 subsets (by multiple R2 ) of Length Five.

Table 4
Mann–Whitney tests for curve fitting.
Mean SD W P-value
Problem reports 32.4 34.9
κ 33 960 1.17 × 10−20
Support requests 87.3 124
Problem reports 0.0414 0.0283
β 16 785 9.89 × 10−5
Support requests 0.0376 0.0365
Problem reports −22.0 15.9
δ 19 297 0.0925
Support requests −30.2 34.4

BugDB (see Fig. 9), but the overall shape of the time series is
similar (being suggestive of exponential growth) and pairwise
tests of the area under the curve for each program showed no
statistically significant differences (Student’s t-test: p = 0.152;
Wilcoxon signed rank test: p = 0.188). Neither the Q&A forum nor
bug tracking database provide complete information, and both
are prone to random noise, but by combining them together we
believe a more accurate estimation of problems encountered can
be achieved.
We compared growth curves fitted for problem reports against
those for support requests. Comparing fitted parameters across
all the Bioconductor packages, we found the upper asymptote
(κ ) to be significantly lower for problem reports than support
Fig. 8. Example growth curve fitting (on ArrayExpress).
requests, but the growth rate (β ) was significantly higher (see Ta-
ble 4). This suggests that, although the number of problem reports
will ultimately be smaller, they grow more quickly; many pro-
gramming issues are identified early, whereas support requests
new support requests are not needed. Alternatively, the rate of grow as the number of users increases. Although the average
new support requests may decrease as other packages become difference in the delay parameter (δ ) is small, the distributions
more popular, but ArrayExpress is still actively used (particularly differ considerably (symmetrical for problem reports, but skewed
with the rise of single cell analysis), so this seems less likely. for support requests). Since the peak is further left for problem
We tested our predictions by training growth models on the reports, the point of inflection for most packages will come earlier
occurrence of support requests and problem reports in the first (if at all), which makes sense considering programming issues
N months since creation, then evaluated their accuracy on the are often identified early. Nevertheless, the long tail to the left of
subsequent months. As more data (months) are added, the pre- the support requests’ distribution means for some packages, the
dictions move closer to the correct value. On average, we found growth curve is monomolecular. These packages may be poorly
that (when trained on the first half of the data), our prediction written or documented (at least early in their life).
of the asymptote was only 6.2% and 8.7% away from the final
value, for support requests and problem reports respectively. As 5.4. Utility of problem reports and support requests (answer to RQ4)
a further independent evaluation, we compared our predictions
for problem reports against the BugDB database for Eclipse (Ye To answer this question, we sampled 10 problem reports (Ta-
et al., 2014). We observed some differences between the rate ble 5) and 10 support requests (Table 6) at random from AspectJ
of problem reports in Q&A forum data and software failures in and investigated them to assess their potential relationship to
10 M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652

Table 5
Problem reports sampled from AspectJ.
Paraphrased question #Comments Code Accepted Bug
/Answers Included? Answer? DB?
AspectJ Maven plugin not executed 6/3 Yes Yes No
IntelliJ not working with AspectJ 4/1 Yes No No
AspectJ not working in Kotlin 12/4 Yes No No
Problem using AspectJ in MinGW 0/1 Yes Yes No
Difficulty configuring Spring security 1/1 Yes Yes Yes
Error when using Spring Roo 2/1 No No No
Spring security mode not working 1/2 Yes Yes Yes
Lombok not working with AspectJ 1/1 Yes No No
JUnit not working with Spring 4/0 Yes No No
Pointcut not matching correctly 5/1 Yes Yes Yes

One support request10 provided a code excerpt (see below)


and asked why it was outputting ‘‘true’’ when they expected it
to output ‘‘false’’. They had assumed that since String.class is not
located inside the java.util package, it would not be found. How-
ever, the pointcut matches functions with the String.class type,
so returns ‘‘true’’. This behavior is documented, but the user who
submitted the question was new to AspectJ and became confused.
Issues such as this may be ameliorated through detailed tutorials
(or Vingettes in R), explaining how the software is expected to
behave.

public void test1() {


AspectJExpressionPointcut pointcut = new
AspectJExpressionPointcut();
pointcut.setExpression("execution(public *
java.util.*.*(..))");
System.out.println(pointcut.matches(String.class));
}

The following code excerpt was taken from a problem report


and it is intended to modify a function using two different advices
(modifying functions), but the expression did not match properly
due to a potential bug (submitted to the eclipse bug database)11 .
A workaround is provided, using two different pointcuts, but it is
unclear whether the bug has been fixed. Only two other problem
Fig. 9. Comparing the growth of predicted problem reports in eclipse from Q&A reports in the 10 we sampled are associated with an entry in the
forum data with recorded software failures from bug reports. bug database and these link to the same bug. As mentioned in
the Background and Related Work section, Q&A forum mining is
essential because many bugs are not reported in bug databases, so
software reuse. We also include the number of comments and cannot otherwise be taken into account for evaluation in software
answers, to give some indication of the response each question reuse.
triggered. Comments in StackOverflow are generally used to make Table 7 provides a summary of some key findings from the
remarks and request further clarification, whereas answers are study. It remains an open question whether projects with a large
intended to provide solutions. We also indicate if code was in- number of bug reports and support requests will be more difficult
cluded, and whether one of the answers was accepted by the to reuse. Such an investigation is outside the scope of this paper
original poster. In the case of problem reports, we also specify and is challenging to answer without subjectivity. However, we
if a link is given at some point in the thread to a record of this have shown how problem reports and support requests encapsu-
problem in a bug database. late issues which could make software reuse more difficult, and
in our sample, problem reports appear to be enriched in issues
Interestingly, all but one of the problem reports (Table 5) de-
that occur when combining multiple software together. It is not
scribe difficulties integrating AspectJ with other software. Spring
necessary for reusers to train a prediction model on the software
(a framework providing aspect-oriented programming as part
they are considering to reuse — they could instead make use
of its functionality) is frequently mentioned, along with Maven,
of the features we have shown to be associated with problem
IntelliJ, Kotlin, MinGW and Lombok. By contrast, only 2 out of the
reports and support requests; in future these may be included
10 support requests (Table 6) ask for help integrating AspectJ with in a tool for software analysis (e.g. as an Eclipse plugin). For
other software (Maven and WebLogic). Although this is a small researchers interested in repeating our work to further investi-
sample, it may suggest problem reports are more indicative of gate the issues concerning software reusability, we have found
potential issues related to software reuse. The support requests that a model trained on a small but diverse range of software is
were more often aimed at understanding a specific behavior effective at predicting problem reports and support requests in
or functionality of the software. For example, one seeks to un- other software.
derstand whether creating aspects will lead to the addition of
new classes (actually, the changes are ‘weaved’ into the existing 10 https://stackoverflow.com/questions/4468097/why-pointcut-
classes) and another asks how to find the list of classes which matchesstring-class-returns-true.
meet particular pointcut criteria (pointcuts are join points at 11 https://stackoverflow.com/questions/41129938/aspectj-pointcut-matching-
which AspectJ makes changes). arguments-args-is-not-matching-correctly.
M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652 11

@Before("(execution(public static * business.security.service.LoginManagerHelper.authenticateUser(..)) && args(


username, ..)) || "
+ "(execution(public static * webapp.util.LoginManagerAction.loginJAAS(..)) && args( *, *, username, ..))")
public void setUsername(JoinPoint jp, String username) {
// inject the username into the MDC
MDCUtils.setUsername(username);
}

Table 6 receiver operator curve (AUROC), and compared our approach


Support requests sampled from AspectJ. (LANLAN) to keyword-based alternatives. We trained and tested
Paraphrased question #Comments Code Accepted LANLAN on different programs and compared the growth of fail-
/Answers Included? Answer?
ures predicted by Q&A forum posts (StackOverflow) and bug
Do aspects create new classes? 2/2 No Yes reports (BugDB). It is possible future researchers may be able to
Sharing data with annotated method? 0/1 No No
Why does my pointcut give this result? 0/1 Yes No
improve upon our results by changing some of the settings. For
Can exception handling be nested? 1/1 Yes Yes example, we used the default window size of 15 words in GloVe,
How to find methods from pointcut? 1/1 Yes No as it has previously been shown to be effective, but adjusting this
Static initialization checks in AspectJ? 1/2 No No could increase the accuracy of prediction.
How to cancel a method execution? 1/1 Yes Yes
By focusing on the Eclipse and Bioconductor worked exam-
Help using AspectJ Maven plugin? 4/3 Yes Yes
How to use AspectJ on WebLogic? 0/2 Yes Yes ples, there is an external threat to validity that LANLAN may
What is scattering and tangling? 0/1 Yes Yes not work on other software. Indeed, we have found our model
to be sensitive to domain differences, since training it on one
software (AspectJ) and applying it to others (JDT and edgeR)
Table 7
Summary of some key findings and implications. resulted in reductions in recall (although precision and AUROC
Finding Implication Relevant to
were comparable to applying our model to the software on which
it was trained). By training our model on multiple software,
>0.9 AUROC using Effective at distinguishing Practitioners
default parameters forum posts particularly those from different domains, we found it to be more
robust when applying it to a new program (PROcess). However, a
Parameter tweaking may Researchers
suitable question for future research would be whether this strat-
have limited impact, as
performance already high egy would work if applying LANLAN to other domains. Eclipse
Robustness improved by Should train on programs Practitioners
is a widely used suite of programs for software development,
training on multiple related to subject area being the subject of previous research into defect prediction,
diverse programs and Bioconductor is a large bioinformatics project containing di-
Explore optimal training set Researchers
for general use
verse packages (from data processing to mathematical modeling
and graphical interfaces). Since our worked examples represent
>95% R2 maintained The features selected can Practitioners
using 5 features for help indicate important prominent projects from two completely different fields, we be-
problem reports/support aspects of reusability lieve they are likely to be representative of a wide range of other
requests software.
Possible to make effective Researchers
predictions using reduced
number of features 7. Conclusions
R.Comments significant Indicates the role Practitioners
for support requests but documentation plays in We illustrated an approach (LANLAN) to classify Question and
not problem reports reusability Answer (Q&A) forum posts (into support requests and problem
Growth rate higher for Risk of failure increases Practitioners reports), such that they can be used to reveal information per-
problem reports than quickly if not addressed tinent to reuse and reusability. We mined data from two large
support requests
open source projects (Eclipse and Bioconductor), chosen for their
differences in purpose as well as practices of reuse (systematic
vs. ad-hoc), increasing the likelihood LANLAN can be generalized
6. Threats to validity to a wide range of software, particularly where more traditional
resources (e.g. bug tracking databases) are unavailable. It is our
We addressed internal threats to validity using statistical met- belief that by integrating a greater variety of data and using
rics and tests: to evaluate the effectiveness of LANLAN for distin- sophisticated modeling techniques to analyze the results, the ac-
guishing problem reports and support requests, we used the area curacy of the features identified and used for analyzing software
under the receiver operator curve (AUROC), precision and recall; can be improved.
when comparing their growth, we used Mann Whitney tests, LANLAN achieved an AUROC of 0.930 in cross validation on
reporting W as an effect size in Table 4; and when exploring the a single program (AspectJ) and 0.970 AUROC when training the
features that contribute to this growth (through association anal- model on three programs and testing it on a fourth (PROcess).
ysis), we applied Bonferroni correction to avoid p-values being Growth curve analysis revealed the upper asymptote (κ ) to be
significant by random chance. We also checked the assumptions lower for problem reports (i.e. developers should expect more
of the models and tests we used (regarding the residuals and requests for clarification rather than issues with the code). How-
linearity) using plots produced by the stats package in R. ever, the β growth parameter was higher for problem reports,
Experiments were repeated multiple times to improve the confirming software reusability issues related to defects increase
robustness of our results. For example, each Q&A forum post was more quickly at the beginning of the software’s life. Cyclomatic
annotated three times and the label (problem report or support complexity was more associated with support requests, whereas
request) chosen by consensus. We evaluated the effectiveness the issues identified by Codetools were more relevant to problem
of multiple classification algorithms, using the area under the reports; complex software is more difficult to understand, but not
12 M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652

necessarily more likely to be incorrect, whereas poor coding style Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalic-
often produces defects. These findings illustrate the effectiveness chio, G., Jones, Z.M., 2016. mlr: Machine learning in r. Mach. Learn. Res. 17
(170), 1–5.
of LANLAN to classify Q&A forum posts into useful categories
Bowes, D., Hall, T., Petrić, J., 2017. Software defect prediction: Do different
(problem reports and support requests) for exploring potential classifiers find the same defects? Softw. Qual. J. 1–28.
software reusability metrics, revealing aspects of the various is- Brown, A.W., Booch, G., 2002. Reusing open-source software and practices:
sues that can make software reuse more difficult. By improving The impact of open-source on commercial vendors. In: Proc. 7th Int. Conf.
understanding of the features that affect reusability, our research Software Reuse. Springer, pp. 123–136.
Buse, R.P.L., Weimer, W.R., 2010. Learning a metric for code readability. IEEE
constitutes a first step towards the development of powerful new Trans. Softw. Eng. 36 (4), 546–558.
tools to assist software development. For example, the informa- Chidamber, S.R., Kemerer, C.F., 1994. A metrics suite for object oriented design.
tion gained from this study with regards to which metrics that are IEEE Trans. Softw. Eng. 20 (6), 476–493.
more indicative of problem reports or support requests could be Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R., 1990. Indexing
by latent semantic analysis. Expert Syst. Appl. 41 (6), 391–407.
used to automatically highlight potential reusability issues, and
Dunn, O.J., 1961. Multiple comparisons among means. J. Amer. Stat. Assoc. 56
growth models can predict how quickly problems are likely to (293), 52–64.
arise, thus guiding efficient management of focused interventions Endres, A., 1975. An analysis of errors and their causes in system programs. IEEE
to improve reusability. Nevertheless, this endeavor would require Trans. Softw. Eng. SE-1 (2), 140–149.
considerable effort and may need to be tailored to different Firth, J.R., 1957. A synopsis of linguistic theory, 1930-1955. Blackwell, Oxford.
Frakes, W.B., Kang, K., 2005. Software reuse research: Status and future. Trans.
software fields. Softw. Eng. 31 (7), 529–535.
Franco-Bedoya, O., Ameller, D., Costal, D., Franch, X., 2014. Queso: A quality
8. Code availability model for open source software ecosystems. In: Proc. 9th Int. Conf. Software
Technologies. IEEE, Washington, DC, pp. 39–62.
Gentleman, R.C., Carey, V.J., Huber, W., Irizarry, R., Dudoit, S., 2005. Bioinformat-
The following code was used in this paper and is available ics and Computational Biology Solutions Using R and Bioconductor. Springer,
from the links below: New York, NY.
GloVe: Word embedding Giatsoglou, M., Vozalis, M.G., Diamantaras, K., Vakali, A., Sarigiannidis, G.,
(https://github.com/stanfordnlp/GloVe) Chatzisavvas, K.C., 2017. Sentiment analysis leveraging emotions and word
embeddings. Expert Syst. Appl. 69, 214–224.
MLR: Machine learning
Goel, A.L., Okumoto, K., 1979. Time-dependent error-detection rate model for
(https://github.com/mlr-org/mlr/) software reliability and other performance measures. IEEE Trans. Softw. Eng.
lm: Association analysis R-28 (3), 206–211.
(https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.h Goth, G., 2016. Deep or shallow, NLP is breaking out. Commun. ACM 59 (3),
tml) 13–16.
Greenwald, A.G., 2017. An AI stereotype catcher. Science 356 (6334), 133–134.
nls: Least squares curve fitting Hall, T., Beecham, S., Bowes, D., Gary, D., Counsell, S., 2012. A systematic
(https://stat.ethz.ch/R-manual/R-devel/library/stats/html/nls.h literature review on fault prediction performance in software engineering.
tml) IEEE Trans. Softw. Eng. 38 (6), 1276–1304.
rjags: Bayesian curve fitting Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B., 2018.
Localizing moments in video with natural language. In: Proc. Int. Conf.
(https://cran.r-project.org/web/packages/rjags/index.html)
Computer Vision. pp. 1380–1390.
Herzig, K., Just, S., Zeller, A., 2013. It’s not a bug, it’s a feature: How misclassifi-
CRediT authorship contribution statement cation impacts bug prediction. In: Proc. Int. Conf. Software Engineering. pp.
392–401.
Hummel, O., Janjic, W., Atkinson, C., 2008. Code conjurer: Pulling reusable
Matthew T. Patrick: Conceptualization, Methodology, Valida-
software out of thin air. Software 25 (5), 45–52.
tion, Investigation, Data curation, Visualization. Jansen, S., 2014. Measuring the health of open source software ecosystems:
Beyond the scope of project health. Inf. Softw. Technol. 56, 1508–1519.
Declaration of competing interest Johnson, B., Soong, Y., Murphy-Hill, E., Bowdidge, R., 2013. Why don’t software
developers use static analysis tools to find bugs? In: Proc. 35th Int. Conf.
Software Engineering. ACM, New York, NY, pp. 672–681.
The authors declare that they have no known competing finan- Kagdi, H., Collard, M.L., Maletic, J.I., 2007. A survey and taxonomy of approaches
cial interests or personal relationships that could have appeared for mining software repositories in the context of software evolution. J.
to influence the work reported in this paper. Softw.: Evol. Process 19, 77–131.
Kenter, T., de Rijke, M., 2015. Short text similarity with word embeddings. In:
Proc. Int. Conf. Information and Knowledge Management.
References Lemley, M., O’Brien, D., 1997. Encouraging software reuse. Stanf. Law Rev. 49
(2), 255–304.
Abdalkareem, R., Shihab, E., Rilling, J., 2017. What do developers use the crowd Lotufo, R., Passos, L., Krzysztof, C., 2012. Towards improving bug tracking systems
for? A study using Stack Overflow. IEEE Softw. 34 (2), 53–60. with game mechanisms. In: Working Conf. Mining Software Repositories. pp.
Ammann, P., Offutt, J., 2016. Introduction to Software Testing. Cambridge 2–11.
University Press, Cambridge, United Kingdom. Manikas, K., 2016. Revisiting software ecosystems research: A longitudinal
Ampatzoglou, A., Bibi, S., Chatzigeorgiou, A., Avgeriou, P., Stamelos, I., 2018. literature study. J. Syst. Softw. 117, 84–103.
Reusability index: A measure for assessing software assets reusability. In: Martinez, J., Ziadi, T., Bissyandé, T.F., Klein, J., Traon, Y.L., 2017. Bottom-up
Proc. 17th Int. Conf. Software Reuse. Springer, pp. 43–58. technologies for reuse: Automated extractive adoption of software product
Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J., 2012. Discovering lines. In: Proc. 39th Int. Conf. Software Engineering. IEEE, pp. 67–70.
value from community activity on focused question answering sites: A case Martinez, J., Ziadi, T., Papadakis, M., Bissyandé, T.F., Klein, J., Traon, Y.L., 2002.
study of Stack Overflow. In: Proc. 18th ACM SIGKDD Int. Conf. Knowledge Feature location benchmark for software families using eclipse community
Discovery and Data Mining. ACM, New York, NY, pp. 850–858. releases. In: Proc. 15th Int. Conf. Software Reuse. Springer, pp. 267–283.
Antoniol, G., Gall, H., Di Penta, M., Pinzger, M., 2004. Mozilla: Closing the Circle. McCabe, T.J., 1976. A complexity measure. IEEE Trans. Softw. Eng. 2 (4), 308–320.
Tech. Rep. TUV-1841-2004-05, Technical University of Vienna. McIlroy, M., 1969. Mass produced software components. In: Naur, P., Randell, B.
Bachmann, A., Bird, C., Rahman, F., Devanbu, P., Bernstein, A., 2010. The missing (Eds.), Nato Science Committee NATO. Scientific Affairs Division NATO,
links: Bugs and bug-fix commits. In: Int. Symp. Foundations Software Belgium, pp. 1–136.
Engineering. pp. 97–106. Mens, T., Claes, M., Grosjean, P., 2014. ECOS: Ecological studies of open
Balding, D.J., 2006. A tutorial on statistical methods for population association source software ecosystems. In: Proc. IEEE Conf. Software Maintenance,
studies. Nature Rev. Genet. 7, 781–791. Reengineering, and Reverse Engineering. IEEE, Washington, DC, pp. 403–406.
Bird, C., Bachmann, A., Aune, E., Duffy, J., Bernstein, A., Filkov, V., Devanbu, P., Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013. Distributed
2009. Fair and balanced?: Bias in bug-fix datasets. In: Proc. Int. Conf. representations of words and phrases and their compositionality. In: Proc.
Foundations Software Engineering. 26th Int. Conf. Neural Information Processing Systems. pp. 3111–3119.
M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652 13

Mohagheghi, P., Conradi, R., 2007. Quality, productivity and economic benefits Sun, J., 2011. Why are Bug Reports Invalid?. In: Proc. Int. Conf. Software Testing,
of software reuse: A review of industrial studies. Empir. Softw. Eng. 12 (5), Verification and Validation. pp. 407–410.
471–516. Svahnberg, M., Gorschek, T., 2017. A model for assessing and reassessing the
Nasehi, S.M., Sillito, J., Maurer, F., Burns, C., 2012. What makes a good code value of software reuse. J. Softw.: Evol. Process 29 (4), e1806.
example?: A study of programming q&a in stackoverflow. In: Proc. 28th Int. Treude, C., Barzilay, O., Storey, M.-A., 2011. How do programmers ask and answer
Conf. Software Maintenance. IEEE, Washington, DC, pp. 1063–6773. questions on the web? In: Proc. 33rd Int. Conf. Software Engineering. ACM,
Nguyen, T.H.D., Adams, B., Hassan, A.E., 2010. A case study of bias in bug-fix New York, NY, pp. 804–807.
datasets. In: Working Conf. Reverse Engineering. pp. 259–268. Vasilescu, B., Filkov, V., Serebrenik, A., 2013. Stackoverflow and github: Asso-
Nguyen, T.V., Nguyen, A.T., Phan, H.D., Nguyen, T.D., Nguyen, T.N., 2017. Com- ciations between software development and crowdsourced knowledge. In:
bining Word2Vec with revised vector space model for better code retrieval. Proc. 6th Int. Conf. Social Computing. IEEE, pp. 1–54.
In: Proc. Int. Conf. Software Engineering Companion. Wang, D., Szymanski, B.K., Abdelzaher, T., Ji, H., Kaplan, L., 2019. Software reuse
Panik, M.J., 2014. Growth Curve Modeling: Theory and Applications. Wiley, research: Status and future. Computer 52 (1), 36–45.
Hoboken, NJ. Whittaker, S., Terveen, L., Hill, W., Cherny, L., 1998. The dynamics of mass
Pennington, J., Socher, R., Manning, C.D., 2014. GloVe: Global vectors for interaction. In: Proc. ACM Conference on Computer Supported Cooperative
word representation. In: Proc. Conf. Empirical Methods Natural Language Work. ACM, New York, NY, pp. 257–264.
Processing. pp. 1532–1543. Wittgenstein, L., 1953. Philosophical Investigations. Blackwell, Oxford.
Ponzanelli, L., Bavota, G., Penta, M.D., Oliveto, R., Lanza, M., 2014. Mining Yang, L., Bao, S., Lin, Q., Wu, X., 2011. Analyzing and predicting not-answered
StackOverflow to turn the IDE into a self-confident programming prompter. questions in community-based question answering services. In: Proc. 25th
In: Proc. 11th Working Conf. Mining Software Repositories. pp. 102–111. AAAI Conf. Artificial Intelligence. AAAI Press, Palo Alto, CA, pp. 1273–1278.
Radjenović, D., Heric̆ko, M., Torkar, R., Z̆ivkovic̆, A., 2013. Software fault pre- Ye, X., Bunescu, R., Liu, C., 2014. Learning to rank relevant files for bug
diction metrics: A systematic literature review. Inf. Softw. Technol. 55 (8), reports using domain knowledge. In: Proc. Int. Conf. Foundations Software
1397–1418. Engineering.
Rahman, F., Khatri, S., Barr, E.T., Devanbu, P., 2014. Comparing static bug finders Zanetti, M.S., Scholtes, I., Tessone, C.J., Schweitzer, F., 2013. Categorizing bugs
and statistical prediction. In: Proc. 36th Int. Conf. Software Engineering. ACM, with social networks: A case study on four open source software communi-
New York, NY, pp. 424–434. ties. In: Proc. 35th Int. Conf. Software Engineering. ACM, New York, NY, pp.
Ray, B., Hellendoorn, V., Godhane, S., Tu, Z., Bacchelli, A., Devanbu, P., 2016. 1032–1041.
On the ‘‘naturalness" of buggy code. In: Proc. 38th Int. Conf. Software Zeller, A., 2013. Can we trust software repositories? In: Münch, J., Schmid, K.
Engineering. ACM, New York, NY, pp. 428–439. (Eds.), Perspectives on the Future of Software Engineering. Springer-Verlag,
Ray, B., Posnett, D., Filkov, V., Devanbu, P., 2014. A large scale study of Heidelberg, pp. 209–2015.
programming languages and code quality in GitHub. In: Proc. 22nd Int. Symp Zhang, Y., Lo, D., Xia, X., Sun, J.-L., 2015a. Multi-factor duplicate question
Foundations Software Engineering. ACM, New York, NY, pp. 155–165. detection in Stack Overflow. J. Comput. Sci. Tech. 30 (5), 981–997.
Rong, X., Yan, S., Oney, S., Dontcheva, M., Adar, E., 2016. Codemend: Assisting Zhang, T., Yang, G., Lee, B., Lua, E.K., 2015b. A novel developer ranking algorithm
interactive programming with bimodal embedding. In: Proc. 29th User for automatic bug triage using topic model and developer relations. In: Proc.
Interface and Software Technology Symposium. ACM, New York, NY, pp. 21st Asia-Pacific Software Engineering Conference. IEEE, Washington, DC,
247–258. 223–230.
Schugerl, P., Rilling, J., Charland, P., 2008. Mining bug repositories–A quality
assessment. In: Proc. Int. Conf. Computational Intelligence Modelling Control
Automation. pp. 1105–1110.

You might also like