Exploring Software Reusability Metrics With Q&A Forum Data
Exploring Software Reusability Metrics With Q&A Forum Data
article info a b s t r a c t
Article history: Question and answer (Q&A) forums contain valuable information regarding software reuse, but
Received 22 September 2019 they can be challenging to analyze due to their unstructured free text. Here we introduce a new
Received in revised form 10 May 2020 approach (LANLAN), using word embeddings and machine learning, to harness information available
Accepted 18 May 2020
in StackOverflow. Specifically, we consider two different kinds of user communication describing
Available online 20 May 2020
difficulties encountered in software reuse: ‘problem reports’ point to potential defects, while ‘support
Keywords: requests’ ask for clarification on software usage. Word embeddings were trained on 1.6 billion tokens
Software reuse from StackOverflow and applied to identify which Q&A forum messages (from two large open source
Reusability projects: Eclipse and Bioconductor) correspond to problem reports or support requests. LANLAN
Text mining achieved an area under the receiver operator curve (AUROC) of over 0.9; it can be used to explore
StackOverflow the relationship between software reusability metrics and difficulties encountered by users, as well
Machine learning as predict the number of difficulties users will face in the future. Q&A forum data can help improve
understanding of software reuse, and may be harnessed as an additional resource to evaluate software
reusability metrics.
© 2020 Elsevier Inc. All rights reserved.
1. Introduction costs (the potential for errors and bugs in reused versus newly
developed software). We hypothesize the direct costs of soft-
Software reuse is an important strategy for decreasing devel- ware reuse are likely to depend on its understandability (i.e. the
opment costs and increasing productivity, as well as avoiding software interface), while the indirect costs may be associated
defects and improving software quality (Mohagheghi and Con- with its complexity (under the assumption that more complex
radi, 2007). It was originally envisaged as a way to make software
software is more likely to go wrong).
development more efficient through modular components that
To investigate our hypothesis, we introduce a new approach
can be used over and over again in mass production (McIlroy,
1969), rather than rewriting functionality that already exists, as (LANLAN: Lexical ANalysis for LAbelling iNquiries) that extracts
was (and is) common practice. Nevertheless, there is a cost to information from question and answer (Q&A) forums. LANLAN
software reuse, as it is necessary to develop and maintain ‘glue classifies questions into ‘problem reports’ (indicating possible
code’ that connects the reusable component with the software defects) and ‘support requests’ (asking for help in understanding
under development (Svahnberg and Gorschek, 2017). There is also how to use the software). Software that has a lot of support
a concern that software written by other people may contain requests demonstrates direct costs, since users/reusers have dif-
unknown bugs, such that it is difficult to ensure the quality of ficulty applying it, while software that has many problem reports
applications constructed from reused components. may be more likely to harbor bugs (i.e. indirect costs). By ap-
The potential for bringing existing software components and plying statistical techniques to test the association between Q&A
knowledge to a new project depends on their ‘reusability’ (Frakes
messages and features derived from static analysis relating to
and Kang, 2005). Various reusability metrics have been sug-
complexity and understandabilty, we hope to be able to explore
gested (Ampatzoglou et al., 2018), based on factors ranging from
the software’s complexity (structural code quality, dependencies, the relationship between problem reports/support requests and
size etc.) through to its understandability (interface complexity software reuse.
and documentation). Previous researchers (Lemley and O’Brien, In early research, data about problems experienced during
1997) have considered software reuse in terms of direct costs software development and reuse was expensive or difficult to
(integrating/adapting existing software components in the new obtain, being primarily extracted from corporate testing activi-
application, versus rewriting them from scratch) and indirect ties (Endres, 1975) or classified military records (Goel and Oku-
moto, 1979). By contrast, the rise of open source software has
E-mail address: [email protected]. made data publicly available for mining (Antoniol et al., 2004):
https://doi.org/10.1016/j.jss.2020.110652
0164-1212/© 2020 Elsevier Inc. All rights reserved.
2 M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652
version control repositories (such as GitHub1 ) contain informa- 2. Background and related work
tion about changes made and the reasons for making them,
whilst bug tracking databases (e.g. Bugzilla2 ) record observed Q&A forum mining has frequently been applied to analyze
failures along with attempts to identify and address their cause. user behavior, from early research into Usenet (Whittaker et al.,
Researchers have applied various metrics (lines of code, coupling, 1998), through to more recent investigations of contributor mo-
churn etc. Radjenović et al., 2013) to analyze this data, and tivations (Treude et al., 2011), collective knowledge (Anderson
machine learning algorithms (e.g. SVM or Random Forest Bowes et al., 2012) and the effectiveness of code examples (Nasehi et al.,
et al., 2017) have been used in an attempt to improve software 2012) in StackOverflow. Machine learning techniques have also
been applied to make predictions from this data. For example,
quality (and hence reusability).
Yang et al. (2011) applied various classifiers to predict which
Techniques which aim to improve software quality include
questions will remain unanswered, whereas Zhang et al. (2015a)
those which direct developers towards specific parts of their code
used Latent Dirichlet Allocation (a topic modeling approach) to
more likely to contain defects (Bowes et al., 2017; Hall et al.,
predict duplicate questions. Q&A forum mining has been used
2012) or model the overall quality and health of a software to assist software developers in an IDE prompter for Java (Pon-
project (Jansen, 2014; Franco-Bedoya et al., 2014), but evalu- zanelli et al., 2014) and an interactive programming tool for
ation of these techniques depends on the quality and size of Python (Rong et al., 2016). In common with these studies, we
available data. Bug report and version control repositories are apply machine learning techniques to Q&A forum data. However,
often affected by various biases (Nguyen et al., 2010). For ex- as far as we are aware, our paper represents the first attempt
ample, experienced developers are more likely to submit bug to use data mined from Q&A forums to predict difficulties faced
reports, whilst novice users often feel discouraged to contribute during software reuse.
for fear of condescension (Lotufo et al., 2012). Bug reports can LANLAN extracts useful information by combining Q&A forum
sometimes contain contradictory claims or be impossible to re- data with other features, e.g. the GitHub repository. GitHub is
produce (Schugerl et al., 2008; Sun, 2011). For example, in one often used in repository mining, due to its large size and acces-
study, 40% of files marked as defective in five open source projects sibility through an open API (Kagdi et al., 2007). For example,
never actually contained a bug (Herzig et al., 2013). Q&A forums Ray et al. (2014) used GitHub to explore the relationship between
have their own biases and accuracy issues, since they depend programming language and code quality, and Zanetti et al. (2013)
on how users express their questions. However, by combining applied network analysis and machine learning to predict the
multiple sources of data together, we should be able to improve quality of bug reports. Zhang et al. (2015b) used topic models
to predict the interest and experience of developers as related
the robustness of our analyses when evaluating effective metrics
to specific bug reports, assigning the most appropriate developer
for software reusability.
to fix a particular bug. In software ecosystem research (Manikas,
Community-driven resources, such as mailing lists and Q&A
2016; Mens et al., 2014), software projects are compared to nat-
forums, allow users to describe problems and work together
ural ecosystems, modeling their development using techniques
to fix them (Abdalkareem et al., 2017). Issues are frequently normally applied in ecology or evolutionary theory. We also
described within these resources without being reported in any adapt techniques typically used in the natural sciences (growth
other database. For example, Bachmann et al. (2010) observed modeling and association analysis) to interpret the data we have
16% of defects in the Apache web server were addressed in the collected.
software’s mailing list instead of its bug tracking system. Q&A Zeller (2013) discussed the challenges involved in mining soft-
forums also contain information about software developer/user ware repositories further. For example, it can often be difficult
communities and their interaction (Vasilescu et al., 2013), which to distinguish fixes from other changes, such as those that add
might be helpful for understanding the social dynamics of soft- new features or refactor the code. Linking repositories to a bug
ware reuse. However, it can be difficult to derive meaningful database can help identify which changes relate to bugs, but
categorizations from the unstructured text in social media, due even when bug databases are used, a large proportion of fixes
to subtle nuances of communication and natural language (Wang are not recorded in them. For the Eclipse project, less than half
et al., 2019). In this paper, we propose a new approach (LAN- of fixes could be linked to an entry in the bug database (Bird
LAN) to mine information directly from existing Q&A forums and et al., 2009). Zeller (2013) argues software repository mining is
classify posts automatically using statistical and machine learning useful despite these issues, but it should be augmented by seeking
techniques from the field of natural language processing. input from project insiders or using approaches such as keyword
matching (to predict bug fixes from other changes). We augment
We evaluate LANLAN on two large open source projects
repository mining with information from Q&A forums and show
(Eclipse and Bioconductor) through cross-validation and testing
our machine learning approach to be more effective than simple
on different software from which the model was trained. We
keyword matching.
apply novel approaches (association analysis and growth curve
Central to our approach are numerical representations of
modeling) to interpret the results and find key differences be- words, known as embeddings (Goth, 2016), that take inspiration
tween the occurrence of problem reports and support requests from ordinary language philosophy (Wittgenstein, 1953) and
that may be useful in improving the reusabilty of software. structuralist linguistics (Firth, 1957). Word embeddings capture
The remainder of this paper is organized as follows: Section 2 the semantics of words from a corpus according to their context
introduces the background and related work, Section 3 explains (i.e. the words that surround them) (Goth, 2016). Information
our approach, Section 4 describes our evaluation procedures, is distributed among a small (fixed) number of weights, with
Section 5 provides the results and discussion, Section 6 explores the assignment of values to these weights providing a distinct
the threats to validity, Section 7 presents our conclusions, and vector (and therefore semantics) for each word. A key advantage
Section 8 lists the code availability. of word embeddings (compared to other natural language pro-
cessing techniques, such as named entity recognition, or sentence
parsing) is that they provide a uniform representation, which can
1 GitHub: https://github.com/. easily be used to train advanced machine learning models (e.g. for
2 Bugzilla: https://www.bugzilla.org. sentiment analysis Giatsoglou et al., 2017).
M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652 3
The earliest word embedding approaches used global factor- 3.1. Training word embeddings
ization. For example, Latent Semantic Analysis (LSA) (Deerwester
et al., 1990) constructs a matrix of counts for the number of We trained word embeddings on questions submitted to Stack-
times words occur in each document, then applies Singular Value Overflow, consisting of 1.6 billion tokens, with a vocabulary of 0.6
Decomposition (SVD) to factorize it into vectors for each word. million unique words. StackOverflow questions were downloaded
Global factorization is a coarse-grained approach for modeling se- from the Stack Exchange Data Dump3 and then parsed using
mantics, and is especially limited if the documents being analyzed HTMLParser in Python to remove the XML tags. Prior to training,
are large. More detailed information can be obtained through the each word was tokenized and transformed into lower case. We
analysis of local context (i.e. words that occur near each other), then removed all characters not in the roman alphabet or specific
punctuation marks (full stops, question marks or exclamation
for example the skip-gram approach (Mikolov et al., 2013), which
marks). All numbers (regardless of length) were replaced by the
was previously applied to documentation from the Java Develop-
token ‘0’ (so as to avoid creating a separate embedding for each
ment Kit for code retrieval (Nguyen et al., 2017). However, there
individual number, and to treat the presence of any number as
is a danger predicting the context of one word at a time will
the feature we wish to encode) and code blocks were replaced by
miss information available through global statistics. We aim to the token ‘<code>’. We also transformed all types of exception
find a middle ground between these two strategies using Global and error (e.g. NullPointerException) into the words ‘exception’
Vectors for Word Representation (GloVe) (Pennington et al., 2014) and ‘error’, to ensure LANLAN can easily be transferred to other
to incorporate data at both the global and local scale. To the best datasets (which may use different exception and error types).
of our knowledge, this paper represents the first time GloVe has GloVe generates two sets of word embeddings (w and w̃ ) as a
been applied to the field of software engineering. result of the matrix factorization (see Fig. 2). The embeddings are
Global Vectors for Word Representation (GloVe) (Pennington optimized by learning bias terms (bi and b̃j ) for each set, such that
et al., 2014) has been used on tasks as diverse as annotating the difference between the log of the original word–word matrix
videos from free text descriptions (Hendricks et al., 2018), to (X ) and the matrix reconstructed from the embeddings and bias
identifying implicit human bias/stereotyping (Greenwald, 2017). terms is as small as possible, i.e. the error term (ϵ ) is reduced. This
It takes advantage of local information (by counting word co- approach can be represented as an optimization function (Eq. (1)),
occurrences in their local context) as well as global (i.e. ag- and once the word embedding sets (wi and w̃j ) are optimized,
gregated) statistics. By contrast with the skipgram technique, they are added together to improve their accuracy. Furthermore,
which predicts words from their context one at a time, GloVe a weighting function [f (x) = (x/xmax )α if x < xmax , 1 otherwise] is
uses a highly parallelizable matrix factorization approach. How- applied to avoid learning only from common word pairs (where
ever, instead of factorizing a global document-word count ma- xmax = 100 and α = 34 ); for more information see Pennington
et al. (2014). In our experiments, we trained word embeddings
trix (as with LSA), GloVe factorizes a matrix of word–word co-
as 200-dimensional vectors and used the default window size of
occurrences (Xij ), produced using a sliding window.
15 words, because these settings were found to be effective in
LANLAN identifies features that may be indicative of difficul-
previous research (Pennington et al., 2014).
ties in software reuse, because they are associated with support
V
requests or problem reports. Opinion differs as to the effec- ∑
tiveness of using features of the software to improve quality f (Xij )(wiT w̃j + bi + b̃j − log(Xij ))2 = ϵ (1)
(i.e. static analysis): Rahman et al. (2014) suggested static analysis i,j=1
on the remaining data. Stratification ensures the same proportion using a plot of residuals against fitted values). It is important to
of class labels are included in each (randomly selected) partition, ensure these assumptions are met for us to have confidence in
which is particularly important when class labels are imbalanced our evaluation of the significance of each property.
(i.e. most questions posted to Q&A forums do not indicate de- We use Bonferroni correction (Dunn, 1961) to address the
fects). We train LANLAN on manually annotated questions and, multiple comparisons problem (i.e. the more properties we test,
to evaluate whether our approach may be transferred to other the more likely p-values will be significant by chance). This
programs and datasets, we also train classification models on one involves dividing the standard significance threshold (0.05) by
program and then test them on others. the number of comparisons (i.e. properties) to identify those
which have a high likelihood of being significant. This is a con-
3.3. Association analysis servative measure, since some program properties are likely to
be correlated with each other. As well as applying association
Association analysis is a technique for identifying properties analysis to each property, we also identify subsets of properties
significantly correlated with a particular trait. For example, in
that are almost as descriptive of the underlying factors as the
bioinformatics it helps discover which genetic markers affect the
entire set. We do this by evaluating the multiple r 2 value of all
observable characteristics of an organism (Balding, 2006). In our
sets of properties of size five. This procedure is applied separately
work, we are interested in finding program properties (potential
for problem reports and support requests to identify the most
software reusability metrics) which could lead to an increase or
important properties for understanding the factors behind the
decrease in the number of support requests and problem reports.
number of questions in each category.
To achieve this, we fit a linear model to the data and test whether
the regression coefficients for each property are equal to zero
(using a t-statistic). The results can then be used to infer the prob- 3.4. Growth curve modeling
ability each property is significantly correlated with the number
of questions that report potential defects (problem reports) or ask To illustrate how the classification models produced by LAN-
for help using the software (support requests). LAN may be used to predict the rate at which support requests
Linear models assume each data point is independent (we and problem reports occur, we analyze the resulting data us-
ensured this by treating each thread in the Q&A forum as an ing growth curve models. Growth curve modeling offers a way
individual sample); the residuals (i.e. difference between the fit- to understand and compare the dynamics of problem reports
ted model and the data) should follow a normal distribution (we and support requests over the software’s lifetime. Although this
tested this using a Q-Q plot); the variance for the residuals should technique has rarely been applied in software engineering, it is
be homogeneous and there should be a linear relationship be- popular in a variety of fields, such as economics, public health,
tween the dependent and independent variables (this was tested ecology and social demography (Panik, 2014).
M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652 5
4. Evaluation
Table 2
Evaluating how well our machine learning approach (LANLAN) generalizes to
new programs.
AspectJa JDT edgeR PROcessb
AUROC 0.930 0.889 0.921 0.970
LANLAN Precision 0.810 0.659 0.842 0.919
Recall 0.720 0.577 0.330 0.752
AUROC 0.562 0.684 0.680 0.692
Keyword features Precision 0.714 0.630 0.750 0.857
Recall 0.096 0.351 0.247 0.309
AUROC NA NA NA NA
Keyword matching Precision 0.423 0.338 0.329 0.324
Recall 0.312 0.505 0.560 0.657
a
Trained and tested on the same program.
b
Trained on the previous 3 programs.
Table 4
Mann–Whitney tests for curve fitting.
Mean SD W P-value
Problem reports 32.4 34.9
κ 33 960 1.17 × 10−20
Support requests 87.3 124
Problem reports 0.0414 0.0283
β 16 785 9.89 × 10−5
Support requests 0.0376 0.0365
Problem reports −22.0 15.9
δ 19 297 0.0925
Support requests −30.2 34.4
BugDB (see Fig. 9), but the overall shape of the time series is
similar (being suggestive of exponential growth) and pairwise
tests of the area under the curve for each program showed no
statistically significant differences (Student’s t-test: p = 0.152;
Wilcoxon signed rank test: p = 0.188). Neither the Q&A forum nor
bug tracking database provide complete information, and both
are prone to random noise, but by combining them together we
believe a more accurate estimation of problems encountered can
be achieved.
We compared growth curves fitted for problem reports against
those for support requests. Comparing fitted parameters across
all the Bioconductor packages, we found the upper asymptote
(κ ) to be significantly lower for problem reports than support
Fig. 8. Example growth curve fitting (on ArrayExpress).
requests, but the growth rate (β ) was significantly higher (see Ta-
ble 4). This suggests that, although the number of problem reports
will ultimately be smaller, they grow more quickly; many pro-
gramming issues are identified early, whereas support requests
new support requests are not needed. Alternatively, the rate of grow as the number of users increases. Although the average
new support requests may decrease as other packages become difference in the delay parameter (δ ) is small, the distributions
more popular, but ArrayExpress is still actively used (particularly differ considerably (symmetrical for problem reports, but skewed
with the rise of single cell analysis), so this seems less likely. for support requests). Since the peak is further left for problem
We tested our predictions by training growth models on the reports, the point of inflection for most packages will come earlier
occurrence of support requests and problem reports in the first (if at all), which makes sense considering programming issues
N months since creation, then evaluated their accuracy on the are often identified early. Nevertheless, the long tail to the left of
subsequent months. As more data (months) are added, the pre- the support requests’ distribution means for some packages, the
dictions move closer to the correct value. On average, we found growth curve is monomolecular. These packages may be poorly
that (when trained on the first half of the data), our prediction written or documented (at least early in their life).
of the asymptote was only 6.2% and 8.7% away from the final
value, for support requests and problem reports respectively. As 5.4. Utility of problem reports and support requests (answer to RQ4)
a further independent evaluation, we compared our predictions
for problem reports against the BugDB database for Eclipse (Ye To answer this question, we sampled 10 problem reports (Ta-
et al., 2014). We observed some differences between the rate ble 5) and 10 support requests (Table 6) at random from AspectJ
of problem reports in Q&A forum data and software failures in and investigated them to assess their potential relationship to
10 M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652
Table 5
Problem reports sampled from AspectJ.
Paraphrased question #Comments Code Accepted Bug
/Answers Included? Answer? DB?
AspectJ Maven plugin not executed 6/3 Yes Yes No
IntelliJ not working with AspectJ 4/1 Yes No No
AspectJ not working in Kotlin 12/4 Yes No No
Problem using AspectJ in MinGW 0/1 Yes Yes No
Difficulty configuring Spring security 1/1 Yes Yes Yes
Error when using Spring Roo 2/1 No No No
Spring security mode not working 1/2 Yes Yes Yes
Lombok not working with AspectJ 1/1 Yes No No
JUnit not working with Spring 4/0 Yes No No
Pointcut not matching correctly 5/1 Yes Yes Yes
necessarily more likely to be incorrect, whereas poor coding style Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalic-
often produces defects. These findings illustrate the effectiveness chio, G., Jones, Z.M., 2016. mlr: Machine learning in r. Mach. Learn. Res. 17
(170), 1–5.
of LANLAN to classify Q&A forum posts into useful categories
Bowes, D., Hall, T., Petrić, J., 2017. Software defect prediction: Do different
(problem reports and support requests) for exploring potential classifiers find the same defects? Softw. Qual. J. 1–28.
software reusability metrics, revealing aspects of the various is- Brown, A.W., Booch, G., 2002. Reusing open-source software and practices:
sues that can make software reuse more difficult. By improving The impact of open-source on commercial vendors. In: Proc. 7th Int. Conf.
understanding of the features that affect reusability, our research Software Reuse. Springer, pp. 123–136.
Buse, R.P.L., Weimer, W.R., 2010. Learning a metric for code readability. IEEE
constitutes a first step towards the development of powerful new Trans. Softw. Eng. 36 (4), 546–558.
tools to assist software development. For example, the informa- Chidamber, S.R., Kemerer, C.F., 1994. A metrics suite for object oriented design.
tion gained from this study with regards to which metrics that are IEEE Trans. Softw. Eng. 20 (6), 476–493.
more indicative of problem reports or support requests could be Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R., 1990. Indexing
by latent semantic analysis. Expert Syst. Appl. 41 (6), 391–407.
used to automatically highlight potential reusability issues, and
Dunn, O.J., 1961. Multiple comparisons among means. J. Amer. Stat. Assoc. 56
growth models can predict how quickly problems are likely to (293), 52–64.
arise, thus guiding efficient management of focused interventions Endres, A., 1975. An analysis of errors and their causes in system programs. IEEE
to improve reusability. Nevertheless, this endeavor would require Trans. Softw. Eng. SE-1 (2), 140–149.
considerable effort and may need to be tailored to different Firth, J.R., 1957. A synopsis of linguistic theory, 1930-1955. Blackwell, Oxford.
Frakes, W.B., Kang, K., 2005. Software reuse research: Status and future. Trans.
software fields. Softw. Eng. 31 (7), 529–535.
Franco-Bedoya, O., Ameller, D., Costal, D., Franch, X., 2014. Queso: A quality
8. Code availability model for open source software ecosystems. In: Proc. 9th Int. Conf. Software
Technologies. IEEE, Washington, DC, pp. 39–62.
Gentleman, R.C., Carey, V.J., Huber, W., Irizarry, R., Dudoit, S., 2005. Bioinformat-
The following code was used in this paper and is available ics and Computational Biology Solutions Using R and Bioconductor. Springer,
from the links below: New York, NY.
GloVe: Word embedding Giatsoglou, M., Vozalis, M.G., Diamantaras, K., Vakali, A., Sarigiannidis, G.,
(https://github.com/stanfordnlp/GloVe) Chatzisavvas, K.C., 2017. Sentiment analysis leveraging emotions and word
embeddings. Expert Syst. Appl. 69, 214–224.
MLR: Machine learning
Goel, A.L., Okumoto, K., 1979. Time-dependent error-detection rate model for
(https://github.com/mlr-org/mlr/) software reliability and other performance measures. IEEE Trans. Softw. Eng.
lm: Association analysis R-28 (3), 206–211.
(https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.h Goth, G., 2016. Deep or shallow, NLP is breaking out. Commun. ACM 59 (3),
tml) 13–16.
Greenwald, A.G., 2017. An AI stereotype catcher. Science 356 (6334), 133–134.
nls: Least squares curve fitting Hall, T., Beecham, S., Bowes, D., Gary, D., Counsell, S., 2012. A systematic
(https://stat.ethz.ch/R-manual/R-devel/library/stats/html/nls.h literature review on fault prediction performance in software engineering.
tml) IEEE Trans. Softw. Eng. 38 (6), 1276–1304.
rjags: Bayesian curve fitting Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B., 2018.
Localizing moments in video with natural language. In: Proc. Int. Conf.
(https://cran.r-project.org/web/packages/rjags/index.html)
Computer Vision. pp. 1380–1390.
Herzig, K., Just, S., Zeller, A., 2013. It’s not a bug, it’s a feature: How misclassifi-
CRediT authorship contribution statement cation impacts bug prediction. In: Proc. Int. Conf. Software Engineering. pp.
392–401.
Hummel, O., Janjic, W., Atkinson, C., 2008. Code conjurer: Pulling reusable
Matthew T. Patrick: Conceptualization, Methodology, Valida-
software out of thin air. Software 25 (5), 45–52.
tion, Investigation, Data curation, Visualization. Jansen, S., 2014. Measuring the health of open source software ecosystems:
Beyond the scope of project health. Inf. Softw. Technol. 56, 1508–1519.
Declaration of competing interest Johnson, B., Soong, Y., Murphy-Hill, E., Bowdidge, R., 2013. Why don’t software
developers use static analysis tools to find bugs? In: Proc. 35th Int. Conf.
Software Engineering. ACM, New York, NY, pp. 672–681.
The authors declare that they have no known competing finan- Kagdi, H., Collard, M.L., Maletic, J.I., 2007. A survey and taxonomy of approaches
cial interests or personal relationships that could have appeared for mining software repositories in the context of software evolution. J.
to influence the work reported in this paper. Softw.: Evol. Process 19, 77–131.
Kenter, T., de Rijke, M., 2015. Short text similarity with word embeddings. In:
Proc. Int. Conf. Information and Knowledge Management.
References Lemley, M., O’Brien, D., 1997. Encouraging software reuse. Stanf. Law Rev. 49
(2), 255–304.
Abdalkareem, R., Shihab, E., Rilling, J., 2017. What do developers use the crowd Lotufo, R., Passos, L., Krzysztof, C., 2012. Towards improving bug tracking systems
for? A study using Stack Overflow. IEEE Softw. 34 (2), 53–60. with game mechanisms. In: Working Conf. Mining Software Repositories. pp.
Ammann, P., Offutt, J., 2016. Introduction to Software Testing. Cambridge 2–11.
University Press, Cambridge, United Kingdom. Manikas, K., 2016. Revisiting software ecosystems research: A longitudinal
Ampatzoglou, A., Bibi, S., Chatzigeorgiou, A., Avgeriou, P., Stamelos, I., 2018. literature study. J. Syst. Softw. 117, 84–103.
Reusability index: A measure for assessing software assets reusability. In: Martinez, J., Ziadi, T., Bissyandé, T.F., Klein, J., Traon, Y.L., 2017. Bottom-up
Proc. 17th Int. Conf. Software Reuse. Springer, pp. 43–58. technologies for reuse: Automated extractive adoption of software product
Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J., 2012. Discovering lines. In: Proc. 39th Int. Conf. Software Engineering. IEEE, pp. 67–70.
value from community activity on focused question answering sites: A case Martinez, J., Ziadi, T., Papadakis, M., Bissyandé, T.F., Klein, J., Traon, Y.L., 2002.
study of Stack Overflow. In: Proc. 18th ACM SIGKDD Int. Conf. Knowledge Feature location benchmark for software families using eclipse community
Discovery and Data Mining. ACM, New York, NY, pp. 850–858. releases. In: Proc. 15th Int. Conf. Software Reuse. Springer, pp. 267–283.
Antoniol, G., Gall, H., Di Penta, M., Pinzger, M., 2004. Mozilla: Closing the Circle. McCabe, T.J., 1976. A complexity measure. IEEE Trans. Softw. Eng. 2 (4), 308–320.
Tech. Rep. TUV-1841-2004-05, Technical University of Vienna. McIlroy, M., 1969. Mass produced software components. In: Naur, P., Randell, B.
Bachmann, A., Bird, C., Rahman, F., Devanbu, P., Bernstein, A., 2010. The missing (Eds.), Nato Science Committee NATO. Scientific Affairs Division NATO,
links: Bugs and bug-fix commits. In: Int. Symp. Foundations Software Belgium, pp. 1–136.
Engineering. pp. 97–106. Mens, T., Claes, M., Grosjean, P., 2014. ECOS: Ecological studies of open
Balding, D.J., 2006. A tutorial on statistical methods for population association source software ecosystems. In: Proc. IEEE Conf. Software Maintenance,
studies. Nature Rev. Genet. 7, 781–791. Reengineering, and Reverse Engineering. IEEE, Washington, DC, pp. 403–406.
Bird, C., Bachmann, A., Aune, E., Duffy, J., Bernstein, A., Filkov, V., Devanbu, P., Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013. Distributed
2009. Fair and balanced?: Bias in bug-fix datasets. In: Proc. Int. Conf. representations of words and phrases and their compositionality. In: Proc.
Foundations Software Engineering. 26th Int. Conf. Neural Information Processing Systems. pp. 3111–3119.
M.T. Patrick / The Journal of Systems & Software 168 (2020) 110652 13
Mohagheghi, P., Conradi, R., 2007. Quality, productivity and economic benefits Sun, J., 2011. Why are Bug Reports Invalid?. In: Proc. Int. Conf. Software Testing,
of software reuse: A review of industrial studies. Empir. Softw. Eng. 12 (5), Verification and Validation. pp. 407–410.
471–516. Svahnberg, M., Gorschek, T., 2017. A model for assessing and reassessing the
Nasehi, S.M., Sillito, J., Maurer, F., Burns, C., 2012. What makes a good code value of software reuse. J. Softw.: Evol. Process 29 (4), e1806.
example?: A study of programming q&a in stackoverflow. In: Proc. 28th Int. Treude, C., Barzilay, O., Storey, M.-A., 2011. How do programmers ask and answer
Conf. Software Maintenance. IEEE, Washington, DC, pp. 1063–6773. questions on the web? In: Proc. 33rd Int. Conf. Software Engineering. ACM,
Nguyen, T.H.D., Adams, B., Hassan, A.E., 2010. A case study of bias in bug-fix New York, NY, pp. 804–807.
datasets. In: Working Conf. Reverse Engineering. pp. 259–268. Vasilescu, B., Filkov, V., Serebrenik, A., 2013. Stackoverflow and github: Asso-
Nguyen, T.V., Nguyen, A.T., Phan, H.D., Nguyen, T.D., Nguyen, T.N., 2017. Com- ciations between software development and crowdsourced knowledge. In:
bining Word2Vec with revised vector space model for better code retrieval. Proc. 6th Int. Conf. Social Computing. IEEE, pp. 1–54.
In: Proc. Int. Conf. Software Engineering Companion. Wang, D., Szymanski, B.K., Abdelzaher, T., Ji, H., Kaplan, L., 2019. Software reuse
Panik, M.J., 2014. Growth Curve Modeling: Theory and Applications. Wiley, research: Status and future. Computer 52 (1), 36–45.
Hoboken, NJ. Whittaker, S., Terveen, L., Hill, W., Cherny, L., 1998. The dynamics of mass
Pennington, J., Socher, R., Manning, C.D., 2014. GloVe: Global vectors for interaction. In: Proc. ACM Conference on Computer Supported Cooperative
word representation. In: Proc. Conf. Empirical Methods Natural Language Work. ACM, New York, NY, pp. 257–264.
Processing. pp. 1532–1543. Wittgenstein, L., 1953. Philosophical Investigations. Blackwell, Oxford.
Ponzanelli, L., Bavota, G., Penta, M.D., Oliveto, R., Lanza, M., 2014. Mining Yang, L., Bao, S., Lin, Q., Wu, X., 2011. Analyzing and predicting not-answered
StackOverflow to turn the IDE into a self-confident programming prompter. questions in community-based question answering services. In: Proc. 25th
In: Proc. 11th Working Conf. Mining Software Repositories. pp. 102–111. AAAI Conf. Artificial Intelligence. AAAI Press, Palo Alto, CA, pp. 1273–1278.
Radjenović, D., Heric̆ko, M., Torkar, R., Z̆ivkovic̆, A., 2013. Software fault pre- Ye, X., Bunescu, R., Liu, C., 2014. Learning to rank relevant files for bug
diction metrics: A systematic literature review. Inf. Softw. Technol. 55 (8), reports using domain knowledge. In: Proc. Int. Conf. Foundations Software
1397–1418. Engineering.
Rahman, F., Khatri, S., Barr, E.T., Devanbu, P., 2014. Comparing static bug finders Zanetti, M.S., Scholtes, I., Tessone, C.J., Schweitzer, F., 2013. Categorizing bugs
and statistical prediction. In: Proc. 36th Int. Conf. Software Engineering. ACM, with social networks: A case study on four open source software communi-
New York, NY, pp. 424–434. ties. In: Proc. 35th Int. Conf. Software Engineering. ACM, New York, NY, pp.
Ray, B., Hellendoorn, V., Godhane, S., Tu, Z., Bacchelli, A., Devanbu, P., 2016. 1032–1041.
On the ‘‘naturalness" of buggy code. In: Proc. 38th Int. Conf. Software Zeller, A., 2013. Can we trust software repositories? In: Münch, J., Schmid, K.
Engineering. ACM, New York, NY, pp. 428–439. (Eds.), Perspectives on the Future of Software Engineering. Springer-Verlag,
Ray, B., Posnett, D., Filkov, V., Devanbu, P., 2014. A large scale study of Heidelberg, pp. 209–2015.
programming languages and code quality in GitHub. In: Proc. 22nd Int. Symp Zhang, Y., Lo, D., Xia, X., Sun, J.-L., 2015a. Multi-factor duplicate question
Foundations Software Engineering. ACM, New York, NY, pp. 155–165. detection in Stack Overflow. J. Comput. Sci. Tech. 30 (5), 981–997.
Rong, X., Yan, S., Oney, S., Dontcheva, M., Adar, E., 2016. Codemend: Assisting Zhang, T., Yang, G., Lee, B., Lua, E.K., 2015b. A novel developer ranking algorithm
interactive programming with bimodal embedding. In: Proc. 29th User for automatic bug triage using topic model and developer relations. In: Proc.
Interface and Software Technology Symposium. ACM, New York, NY, pp. 21st Asia-Pacific Software Engineering Conference. IEEE, Washington, DC,
247–258. 223–230.
Schugerl, P., Rilling, J., Charland, P., 2008. Mining bug repositories–A quality
assessment. In: Proc. Int. Conf. Computational Intelligence Modelling Control
Automation. pp. 1105–1110.