A Deep Learning Model For Estimating Story Points
A Deep Learning Model For Estimating Story Points
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Abstract—Although there has been substantial research in software analytics for effort estimation in traditional software projects, little
work has been done for estimation in agile projects, especially estimating the effort required for completing user stories or issues. Story
points are the most common unit of measure used for estimating the effort involved in completing a user story or resolving an issue. In
this paper, we propose a prediction model for estimating story points based on a novel combination of two powerful deep learning
architectures: long short-term memory and recurrent highway network. Our prediction system is end-to-end trainable from raw input
data to prediction outcomes without any manual feature engineering. We offer a comprehensive dataset for story points-based
estimation that contains 23,313 issues from 16 open source projects. An empirical evaluation demonstrates that our approach
consistently outperforms three common baselines (Random Guessing, Mean, and Median methods) and six alternatives (e.g. using
Doc2Vec and Random Forests) in Mean Absolute Error, Median Absolute Error, and the Standardized Accuracy.
Index Terms—software analytics, effort estimation, story point estimation, deep learning.
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
knowledge, this is the largest dataset (in terms of number of estimators: Random Guessing, Mean, and Median meth-
data points) for story point estimation where the focus is at ods and six alternatives (e.g. using Doc2Vec and Random
the issue/user story level rather than at the project level as Forests) in Mean Absolute Error, Median Absolute Error,
in traditional effort estimation datasets. and the Standardized Accuracy. These claims have also been
We also propose a prediction model which supports a tested using a non-parametric Wilcoxon test and Vargha and
team by recommending a story-point estimate for a given Delaney’s statistic to demonstrate the statistical significance
user story. Our model learns from the team’s previous story and the effect size.
point estimates to predict the size of new issues. This pre- The remainder of this paper is organized as follows. Sec-
diction system will be used in conjunction with (instead of a tion 2 provides a background of the story point estimation
replacement for) existing estimation techniques practiced by and deep neural networks. We then present the Deep-SE
the team. It can be used in an completely automated manner, model and explain how it can be trained in Section 3 and
i.e. the team will use the story points given by the prediction Section 4 respectively. Section 5 reports on the experimental
system. Alternatively, it could be used as a decision support evaluation of our approach. Related work is discussed in
system and takes part in the estimation process. This is Section 6 before we conclude and outline future work in
similar to the notions of combination-based effort estimation Section 7.
in which estimates come from different sources, e.g. a com-
bination of expert and formal model-based estimates [10].
The key novelty of our approach resides in the combination
2 BACKGROUND
of two powerful deep learning architectures: long short-term 2.1 Story point estimation
memory (LSTM) and recurrent highway network (RHN). When a team estimates with story points, it assigns a point
LSTM allows us to model the long-term context in the value (i.e. story points) to each user story. A story point
textual description of an issue, while RHN provides us estimate reflects the relative amount of effort involved in
with a deep representation of that model. We named this resolving or completing the user story: a user story that is
approach as Deep learning model for Story point Estimation assigned two story points should take twice as much effort
(Deep-SE). as a user story assigned one story point. Many projects
Our Deep-SE model is a fully end-to-end system where have now adopted this story point estimation approach
raw data signals (i.e. words) are passed from input nodes [25]. Projects that use issue tracking systems (e.g. JIRA
up to the final output node for estimating story points, [28]) record their user stories as issues. Figure 1 shows an
and the prediction errors are propagated from the output example of issue XD-2970 in the Spring XD project [29]
node all the way back to the word layer. Deep-SE au- which is recorded in JIRA. An issue typically has a title (e.g.
tomatically learns semantic features which represent the “Standardize XD logging to align with Spring Boot”) and
meaning of user stories or issue reports, thus liberating description. Projects that use JIRA Agile also record story
the users from manually designing and extracting features. points. For example, the issue in Figure 1 has 8 story points.
Feature engineering usually relies on domain experts who
use their specific knowledge of the data to create features
for machine learners to work. For example, the performance
of most of existing defect prediction models heavily relies
on the careful designing of good features (e.g. size of code,
number of dependencies, cyclomatic complexity, and code
churn metrics) which can discriminate between defective
and non-defective code [26]. Coming up with good features
is difficult, time-consuming, and requires domain-specific
knowledge, and hence poses a major challenge. In many
situations, manually designed features normally do not
generalize well: features that work well in a certain software
project may not perform well in other projects [27]. Bag-
of-Words (BoW) is a traditional technique to “engineer”
features representing textual data like issue description.
However, the BoW approach has two major weaknesses: it
ignores the semantics of words, e.g. fails to recognize the
semantic relations between “she” and “he”, and it ignore Fig. 1. An example of an issue with estimated story points
the sequential nature of text. In our approach, features are
automatically learned, thus obviating the need for designing Story points are usually estimated by the whole team
them manually. within a project. For example, the widely-used Planning
Although our Deep-SE is a deep model of multiple Poker [30] method suggests that each team member pro-
layers, it is recurrent and thus model parameters are shared vides an estimate and a consensus estimate is reached after
across layers. Hence, the number of parameters does not a few rounds of discussion and (re-)estimation. This practice
grow with the depth and consequently avoid overfitting. is different from traditional approaches (e.g. function points)
We also employ a number of common techniques such in several aspects. Both story points and function points
as dropout and early stopping combat overfitting. Our reflect an effort for resolving an issue. However, function
approach consistently outperforms three common baseline points can be determined by an external estimator based on
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
a standard set of rules (e.g. counting inputs, outputs, and ht−1 , and the previous memory ct−1 in order to compute
inquiries) that can be applied consistently by any trained the hidden state ht . The hidden state is used to produce an
practitioner. On the other hand, story points are developed output at each step t. For example, the output of predicting
by a specific team based on the team’s cumulative knowl- the next word k in a sentence would be a vector of proba-
edge and biases, and thus may not be useful outside the bilities across our vocabulary, i.e. sof tmax(Vk ht ) where Vk
team (e.g. in comparing performance across teams). Since is a row in the output parameter matrix Wout .
story points represent the effort required for completing a
user story, an estimate should cover different factors which XD logging to align
can affect the effort. These factors include how much work hk h1 h2 h3 h4 hk
needed to be done, the complexity of the work, and any c h1 h2 h3
uncertainty involving in the work [24]. LSTM LSTM LSTM LSTM LSTM … LSTM
c1 c2 c3
In agile development, user stories or issues are com-
wk w1 w2 w3 w4 wk
monly viewed as the first-class entity of a project since
they describe what has to be built in the software project, Standardize XD logging to
forming the basis for design, implementation and testing.
Story point sizes are used for measuring a team’s progress Fig. 2. An LSTM network
rate, prioritizing user stories, planning and scheduling for
future iterations and releases, and even costing and allocat- The most important element of LSTM is a short-term
ing resources. Story points are also the basis for other effort- memory cell – a vector that stores accumulated information
related estimation. For example, in our recent work [31], over time. The information stored in the memory is re-
they are used for predicting delivery capability for an on- freshed at each time step through partially forgetting old,
going iteration. Specifically, we predict the amount of work irrelevant information and accepting fresh new input. An
delivered at the end of an iteration, relative to the amount LSTM unit uses the forget gate f t to control how much
of work which the team has originally committed to. The information from the memory of previous context (i.e. ct−1 )
amount of work done in an iteration is then quantified in should be removed from the memory cell. The forget gate
terms of story points from the issues completed within that looks at the the previous output state ht−1 and the current
iteration. To enable such a prediction, we have taken into word wt , and outputs a number between 0 and 1. A value
account both the information of an iteration and user stories of 1 indicates that all the past memory is preserved, while
or issues involving in the iteration. Interaction between user a value of 0 means “completely forget everything”. The
stories and between user stories and resources are captured next step is updating the memory with new information
through extracting information related to the dependencies obtained from the current word wt . The input gate it is
between user stories and the assignment of user stories to used to control which new information will be stored in
developers. the memory. Information stored in the memory cell will be
Velocity is the sum of the story-point estimates of the used to produce an output ht . The output gate ot looks at
issues that the team resolved during an iteration. For ex- the current code token wt and the previous hidden state
ample, if the team resolves four stories each estimated at ht−1 , and determines which parts of the memory should be
three story points, their velocity is twelve. Velocity is used output.
for planning and predicting when a software (or a release)
should be completed. For example, if the team estimates ht
the next release to include 100 story points and the team’s wt
current velocity is 20 points per 2-week iteration, then it ot
*
would take 5 iterations (or 10 weeks) to complete the project.
Hence, it is important that the team is consistent in their ht-1
story point estimates to avoid reducing the predictability
in planning and managing their project. A machine learner ct-1 ct
* wt
can help the team maintain this consistency, especially in
coping with increasingly large numbers of issues. It does so wt * it
by learning insight from past issues and estimations to make ft ht-1
future estimations. ht-1
wt ht-1
2.2 Long Short Term Memory
Long Short-Term Memory (LSTM) [32, 33] is a special vari- Fig. 3. The internal structure of an LSTM unit
ant of recurrent neural networks [34]. While a feedforward
neural network maps an input vector into an output vector, This mechanism allows LSTM to effectively learn long-
an LSTM network uses a loop in a network that allows term dependencies in text. Consider trying to predict the last
information to persist and it can map a sequence into a word in the following text extracted from the description of
sequence (see Figure 2). Let w1 , ..., wn be the input sequence issue XD-2970 in Figure 1: “Boot uses slf4j APIs backed by
(e.g. words in a sentence) and y1 , ..., yn be the sequence of logback. This causes some build incompatibilities .... An additional
corresponding labels (e.g. the next words). At time step t, step is to replace log4j with .”. Recent information suggests
an LSTM unit reads the input wt , the previous hidden state that the next word is probably the name of a logging library,
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
but if we want to narrow down to a specific library, we need Regression Recurrent Highway Net
to remember that “logback” and “log4j” are logging libraries
from the earlier text. There could be a big gap between
relevant information and the point where it is needed, but
LSTM is capable to learn to connect the information. In story point
fact, LSTM has demonstrated ground-breaking results in estimate
many applications such as language models [35], speech
recognition [36] and video analysis [37]. document representation
The reading of the new input, writing of the output,
and the forgetting (i.e. all those gates) are all learnable. pooling
As an recurrent network, LSTM network shares the same
h1 h2 h3 h4 h5 h6
parameters across all steps since the same task is performed ….
at each step, just with different inputs. Thus, comparing LSTM
to traditional feedforward networks, using an LSTM net-
work significantly reduces the total number of parameters Embedding ….
which we need to learn. An LSTM model is trained using word vector
many input sequences with known actual output sequences.
Learning is done by minimizing the error between the actual Embedding matrix M
output and the predicted output by adjusting the model
parameters. Learning involves computing the gradient of
L(θ) during the backpropagation phase, and parameters are
….
updated using a stochastic gradient descent. It means that W1 W2 W3 W4 W5 W6
parameters are updated after seeing only a small random Standardize XD logging to align with ….
subset of sequences. We refer the readers to the seminal
paper [32] for more details about LSTM. Fig. 4. Deep learning model for Story point Estimation (Deep-SE).
The input layer (bottom) is a sequence of words (represented as filled
circles). Words are first embedded into a continuous space, then fed into
3 A PPROACH the LSTM layer. The LSTM outputs a sequence of state vectors, which
are then pooled to form a document-level vector. This global vector is
Our overall research goal is to build a prediction system then fed into a Recurrent Highway Net for multiple transformations (See
that takes as input the title and description of an issue Eq. (1) for detail). Finally, a regressor predicts an outcome (story-point).
and produces a story-point estimate for the issue. Title and
description are required information for any issue tracking
system. Hence, our prediction system is applicable to a wide These word vectors then serve as an input sequence to the
range of issue tracking systems, and can be used at any time, Long Short-Term Memory (LSTM) layer which computes a
even when an issue is created. vector representation for the whole document.
We combine the title and description of an issue report
After that, the document vector is fed into the Recurrent
into a single text document where the title is followed by the
Highway Network (RHWN), which transforms the docu-
description. Our approach computes vector representations
ment vector multiple times, before outputting a final vector
for these documents. These representations are then used
which represents the text. The vector serves as input for
as features to predict the story points of each issue. It
the regressor which predicts the output story-point. While
is important to note that these features are automatically
many existing regressors can be employed, we are mainly
learned from raw text, hence removing us from manually
interested in regressors that are differentiable with respect to
engineering the features.
the training signals and the input vector. In our implemen-
Figure 4 shows the Deep learning model for Story
tation, we use the simple linear regression that outputs the
point Estimation (Deep-SE) that we have designed for the
story-point estimate.
story point prediction system. It is composed of four com-
ponents arranged sequentially: (i) word embedding, (ii) Our entire system is trainable from end-to-end: (a) data
document representation using Long Short-Term Memory signals are passed from the words in issue reports to the
(LSTM) [32], (iii) deep representation using Recurrent High- final output node; and (b) the prediction error is propagated
way Net (RHWN) [38]; and (iv) differentiable regression. from the output node all the way back to the word layer.
Given a document which consists of a sequence of words
s = (w1 , w2 , ..., wn ), e.g. the word sequence (Standardize,
XD, logging, to, align, with, ....) in the title and description of 3.1 Word embedding
issue XD-2970 in Figure 1.
We model a document’s semantics based on the principle We represent each word as a low dimensional, continuous
of compositionality: the meaning of a document is deter- and real-valued vector, also known as word embedding. Here
mined by the meanings of its constituents (e.g. words) and we maintain a look-up table, which is a word embedding
the rules used to combine them (e.g. one word followed matrix M ∈ Rd×|V | where d is the dimension of word
by another). Hence, our approach models document rep- vector and |V | is vocabulary size. These word vectors are
resentation in two stages. It first converts each word in a pre-trained from corpora of issue reports, which will be
document into a fixed-length vector (i.e. word embedding). described in details in Section 4.1.
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
3.2 Document representation using LSTM is notoriously difficult due to two main problems: (i) the
Since an issue document consists of a sequence of words, we number of parameters grows with the number of layers,
model the document by accumulating information from the leading to overfitting; and (ii) stacking many non-linear
start to the end of the sequence. A powerful accumulator functions makes it difficult for the information and the
is a Recurrent Neural Network (RNN) [34], which can be gradients to pass through.
seen as multiple copies of the same single-hidden-layer To address these problems, we designed a deep repre-
network, each passing information to a successor. Thus, sentation that performs multiple non-linear transformations
recurrent networks allow information to be accumulated. using the idea from Highway Networks. Highway Nets are
While RNNs are theoretically powerful, they are difficult to the latest idea that enables efficient learning through those
train for long sequences [34], which are often seen in issue many non-linear layers [50]. A Highway Net is a special
reports (e.g. see the description of issue XD-2970 in Figure type of feedforward neural networks with a modification to
1). Hence, our approach employs Long Short-Term Memory the transformation taking place at a hidden unit to let infor-
(LSTM), a special variant of RNN (see Section 2 for more mation from lower layers pass linearly through. Specifically,
details of how LSTM work). the hidden state at layer l is defined as:
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
Random Guessing, Mean Effort, and Median Effort. more accurate results than the traditional Doc2Vec and
Random guessing is a naive benchmark used to Bag-of-Words (BoW) approach?
assess if an estimation model is useful [55]. Random The most popular text representation is Bag-of-
guessing performs random sampling (with equal Words (BoW) [58], where a text is represented as
probability) over the set of issues with known story a vector of word counts. For example, the title and
points, chooses randomly one issue from the sample, description of issue XD-2970 in Figure 1 would be
and uses the story point value of that issue as the converted into a sparse binary vector of vocabulary
estimate of the target issue. Random guessing does size, whose elements are mostly zeros, except for
not use any information associated with the target those at the positions designated to “standardize”,
issue. Thus any useful estimation model should out- “XD”, “logging” and so on. However, BoW has two
perform random guessing. Mean and Median Effort major weaknesses: they lose the sequence of the
estimations are commonly used as baseline bench- words and they also ignore semantics of the words.
marks for effort estimation [19]. They use the mean For example, “Python”, “Java”, and “logging ” are
or median story points of the past issues to estimate equally distant, while semantically “Python” should
the story points of the target issue. Note that the be closer to “Java” than “logging”. To address this
samples used for all the naive baselines (i.e. Random issue, Doc2vec [59] (i.e. alternatively known as para-
Guessing, Mean Effort, and Median Effort) were from graph2vec) is an unsupervised algorithm that learns
the training set. fixed-length feature representations from texts (e.g.
• RQ2. Benefits of deep representation: Does the use title and description of issues). Each document is
of Recurrent Highway Nets provide more accurate story represented in a dense vector which is trained to
point estimates than using a traditional regression tech- predict next words in the document.
nique? Both BoW and Doc2vec representations however ef-
To answer this question, we replaced the Recur- fectively destroys the sequential nature of text. This
rent Highway Net component with a regressor for question aims to explore whether LSTM with its ca-
immediate prediction. Here, we compare our ap- pability of modeling this sequential structure would
proach against four common regressors: Random improve the story point estimation. To answer this
Forests (RF), Support Vector Machine (SVM), Auto- question, we feed three different feature vectors: one
matically Transformed Linear Model (ATLM), and learned by LSTM and the other two derived from
Linear Regression (LR). We choose RF over other BoW technique and Doc2vec to the same Random
baselines since ensemble methods like RF, which Forrests regressor, and compare the predictive per-
combine the estimates from multiple estimators, are formance of the former (i.e. LSTM+RF) against that
an effective method for effort estimation [20]. RF of the latter (i.e. BoW+RF and Doc2vec+RF).We used
achieves a significant improvement over the decision Gensim1 , a well-known implementation for Doc2vec
tree approach by generating many classification and in our experiments.
regression trees, each of which is built on a random • RQ4. Cross-project estimation: Is the proposed ap-
resampling of the data, with a random subset of proach suitable for cross-project estimation?
variables at each node split. Tree predictions are then Story point estimation in new projects is often dif-
aggregated through averaging. We used the issues ficult due to lack of training data. One common
in the validation set to fine-tune parameters (i.e. the technique to address this issue is training a model
number of tress, the maximum depth of the tree, using data from a (source) project and applying it to
and The minimum number of samples). For SVM, the new (target) project. Since our approach requires
it has been widely use in software analytics (e.g. only the title and description of issues in the source
defect prediction) and document classification (e.g. and target projects, it is readily applicable to both
sentiment analysis) [56]. SVM is known as Support within-project estimation and cross-project estima-
Vector Regression (SVR) for regression problems. We tion. In practice, story point estimation is however
also used the issues in the validation set to find the known to be specific to teams and projects. Hence,
kernel type (e.g. linear, polynomial) for testing. We this question aims to investigate whether our ap-
used the Automatically Transformed Linear Model proach is suitable for cross-project estimation. We
(ATLM) [57] recently proposed as the baseline model have implemented Analogy-based estimation called
for software effort estimation. Although ATLM is ABE0, which were proposed in previous work [60–
simple and requires no parameter tuning, it performs 63] for cross-project estimation, and used it as a
well over a range of various project types in the benchmark. The ABE0 estimation bases on the dis-
traditional effort estimation [57]. Since LR is the top tances between individual issues. Specifically, the
layer of our approach, we also used LR as the imme- story point of issues in the target project is the
diate regressor after LSTM layers to assess whether mean of story points of k -nearest issues from the
RHWN improves the predictive performance. We source project. We used the Euclidean distance as a
then compare the performance of these alternatives, distance measure, Bag-of-Words of the title and the
namely LSTM+RF, LSTM+SVM, LSTM+ATLM, and description as the features of an issue, and k = 3.
LSTM+LR against our Deep-SE model. • RQ5. Normalizing/adjusting story points: Does our
• RQ3. Benefits of LSTM document representation:
Does the use of LSTM for modeling issue reports provide 1. https://radimrehurek.com/gensim/models/doc2vec.html
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
approach still perform well with normalized/adjusted story measure from the nine open source repositories up until
points? August 8, 2016. We then extracted the story point, title and
We have ran our experiments again using the new la- description from the collected issue reports. Each repository
bels (i.e. the normalized story points) for addressing contains a number of projects, and we chose to include in
the concern that whether our approach still performs our dataset only projects that had more than 300 issues
well on those adjusted ground-truths. We adjusted with story points. Issues that were assigned a story point
the story points of each issue using a range of infor- of zero (e.g., a non-reproducible bug), as well as issues with
mation, including the number of days from creation a negative, or unrealistically large story point (e.g. greater
to resolved time, the development time, the number than 100) were filtered out. Ultimately, about 2.66% of the
of comments, the number of users who commented collected issues were filtered out in this fashion. In total, our
on the issue, the number of times that an issue had dataset has 23,313 issues with story points from 16 differ-
their attributes changed, the number of users who ent projects: Apache Mesos (ME), Apache Usergrid (UG),
changed the issue’s attributes, the number of issue Appcelerator Studio (AS), Aptana Studio (AP), Titanum
links, the number of affect versions, and the number SDK/CLI (TI), DuraCloud (DC), Bamboo (BB), Clover (CV),
of fix versions. These information reflect the actual JIRA Software (JI), Moodle (MD), Data Management (DM),
effort and we thus refer to them as effort indicators. Mule (MU), Mule Studio (MS), Spring XD (XD), Talend Data
The values of these indicators were extracted after Quality (TD), and Talend ESB (TE). Table 1 summarizes
the issue was completed. The normalized story point the descriptive statistics of all the projects in terms of the
(SPnormalized ) is then computed as the following: minimum, maximum, mean, median, mode, variance, and
standard deviations of story points assigned used and the
SPnormalized = (0.5)SPoriginal + (0.5)SPnearest average length of the title and description of issues in each
where SPorginal is the original story point, and project. These sixteen projects bring diversity to our dataset
SPnearest is the mean of story points from 10 nearest in terms of both application domains and project’s char-
issues based on their actual effort indicators. Note acteristics. Specifically, they are different in the following
that we use K-Nearest Neighbour (KNN) to find the aspects: number of observation (from 352 to 4,667 issues),
nearest issues and the Euclidean metric to measure technical characteristics (different programming languages
the distance. We ran the experiment on the new la- and different application domains), sizes (from 88 KLOC to
bels (i.e SPnormalized ) using our proposed approach 18 millions LOC), and team characteristics (different team
against all other baseline benchmark methods. structures and participants from different regions).
• RQ6. Compare against the existing approach: How Since story points rate the relative effort of work between
does our approach perform against existing approaches in user stories, they are usually measured on a certain scale
story point estimation? (e.g. 1, 2, 4, 8, etc.) to facilitate comparison (e.g. a user
Recently, Porru et. al. [64] also proposed an esti- story is double the effort of the other) [25]. The story points
mation model for story points. Their approach uses used in planning poker typically follow a Fibonacci scale,
the type of an issue, the component(s) assigned to i.e. 1, 2, 3, 5, 8, 13, 21, and so on [24]. Among the projects
it, and the TF-IDF derived from its summary and we studied, only seven of them (i.e. Usergrid, Talend ESB,
description as features representing the issue. They Talend Data Quality, Mule Studio, Mule, Appcelerator Stu-
also performed univariate feature selection to choose dio, and Aptana Studio followed the Fibonacci scale, while
a subset of features for building a classifier. By the other nine projects did not use any scale. When our
contrast, our approach automatically learns seman- prediction system gives an estimate, we did not round it
tic features which represent the actual meaning of to the nearest story point value on the Fibonacci scale.
the issue’s report, thus potentially providing more An alternative approach (for those project which follow a
accurate estimates. To answer this research question, Fibonacci scale) is treating this as a classification problem:
we ran Deep-SE on the dataset used in Porru et. each value on the Fibonacci scale represents a class. The
al, re-implemented their approach, and performed limitations of this approach is that the number of classes
a comparison on the results produced by the two must be pre-determined and that it is not applicable to
approaches. projects that do not follow this scale. We however note that
the Fibonacci scale is only a guidance for estimating story
points. In practice, teams may follow other common scales,
5.1 Story point datasets define their own scales or may not follow any scale at all.
To collect data for our dataset, we looked for issues that Our approach does not rely on these specific scales, thus
were estimated with story points. JIRA is one of the few making it applicable to a wider range of projects. It predicts
widely-used issue tracking systems that support agile de- a scalar value (regression) rather than a class (classification).
velopment (and thus story point estimation) with its JIRA
Agile plugin. Hence, we selected a diverse collection of nine
major open source repositories that use the JIRA issue track- 5.2 Experimental setting
ing system: Apache, Appcelerator, DuraSpace, Atlassian, We performed experiments on the sixteen projects in our
Moodle, Lsstcorp, MuleSoft, Spring, and Talendforge. We dataset – see Table 1 for their details. To mimic a real
then used the Representational State Transfer (REST) API deployment scenario that prediction on a current issue is
provided by JIRA to query and collected those issue reports. made by using knowledge from estimations of the past
We collected all the issues which were assigned a story point issues, the issues in each project were split into training set
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
TABLE 1
Descriptive statistics of our story point dataset
Repo. Project Abb. # issues min SP max SP mean SP median SP mode SP var SP std SP mean TD length LOC
Apache Mesos ME 1,680 1 40 3.09 3 3 5.87 2.42 181.12 247,542+
Usergrid UG 482 1 8 2.85 3 3 1.97 1.40 108.60 639,110+
Appcelerator Appcelerator Studio AS 2,919 1 40 5.64 5 5 11.07 3.33 124.61 2,941,856#
Aptana Studio AP 829 1 40 8.02 8 8 35.46 5.95 124.61 6,536,521+
Titanium SDK/CLI TI 2,251 1 34 6.32 5 5 25.97 5.10 205.90 882,986+
DuraSpace DuraCloud DC 666 1 16 2.13 1 1 4.12 2.03 70.91 88,978+
Atlassian Bamboo BB 521 1 20 2.42 2 1 4.60 2.14 133.28 6,230,465#
Clover CV 384 1 40 4.59 2 1 42.95 6.55 124.48 890,020#
JIRA Software JI 352 1 20 4.43 3 5 12.35 3.51 114.57 7,070,022#
Moodle Moodle MD 1,166 1 100 15.54 8 5 468.53 21.65 88.86 2,976,645+
Lsstcorp Data Management DM 4,667 1 100 9.57 4 1 275.71 16.61 69.41 125,651*
Mulesoft Mule MU 889 1 21 5.08 5 5 12.24 3.50 81.16 589,212+
Mule Studio MS 732 1 34 6.40 5 5 29.01 5.39 70.99 16,140,452#
Spring Spring XD XD 3,526 1 40 3.70 3 1 10.42 3.23 78.47 107,916+
Talendforge Talend Data Quality TD 1,381 1 40 5.92 5 8 26.96 5.19 104.86 1,753,463#
Talend ESB TE 868 1 13 2.16 2 1 2.24 1.50 128.97 18,571,052#
Total 23,313
SP: story points, TD length: the number of words in the title and description of an issue, LOC: line of code
(+: LOC obtained from www.openhub.net, *: LOC from GitHub, and #: LOC from the reverse engineering)
(60% of the issues), development/validation set (i.e. 20%), against random guessing. Predictive performance can be
and test set (i.e. 20%) based on their creation time. The issues improved by decreasing MAE or increasing SA.
in the training set and the validation set were created before We assess the story point estimates produced by the
the issues in the test set, and the issues in the training set estimation models using MAE, MdAE and SA. To com-
were also created before the issues in the validation set. pare the performance of two estimation models, we tested
the statistical significance of the absolute errors achieved
5.3 Performance measures with the two models using the Wilcoxon Signed Rank Test
[71]. The Wilcoxon test is a safe test since it makes no
There are a range of measures used in evaluating the accu- assumptions about underlying data distributions. The null
racy of an effort estimation model. Most of them are based hypothesis here is: “the absolute errors provided by an
on the Absolute Error, (i.e. |ActualSP − EstimatedSP |). estimation model are not different to those provided by
where AcutalSP is the real story points assigned to an issue another estimation model”. We set the confidence limit at
and EstimatedSP is the outcome given by an estimation 0.05 and also applied Bonferroni correction [72] (0.05/K,
model. Mean of Magnitude of Relative Error (MRE) or Mean where K is the number of statistical tests) when multiple
Percentage Error and Prediction at level l [65], i.e. Pred(l), testing were performed.
have also been used in effort estimation. However, a number
In addition, we also employed a non-parametric effect
of studies [66–69] have found that those measures bias
size measure, the correlated samples case of the Vargha and
towards underestimation and are not stable when compar-
ing effort estimation models. Thus, the Mean Absolute Error Delaney’s ÂXY statistic [73] to assess whether the effect
(MAE), Median Absolute Error (MdAE), and the Standardized size is interesting. The ÂXY measure is chosen since it is
Accuracy (SA) have recently been recommended to compare agnostic to the underlying distribution of the data, and is
the performance of effort estimation models [19, 70]. MAE is suitable for assessing randomized algorithms in software
defined as: engineering generally [74] and effort estimation in particular
[19]. Specifically, given a performance measure (e.g. the
N
1 X Absolute Error from each estimation in our case), the ÂXY
M AE = |ActualSPi − EstimatedSPi | measures the probability that estimation model X achieves
N i=1
better results (with respect to the performance measure)
where N is the number of issues used for evaluating than estimation model Y . We note that this falls into the
the performance (i.e. test set), ActualSPi is the actual story correlated samples case of the Vargha and Delaney [73]
point, and EstimatedSPi is the estimated story point, for where the Absolute Error is derived by applying different
the issue i. estimation methods on the same data (i.e. same issues). We
We also report the Median Absolute Error (MdAE) since thus use the following formula to calculate the stochastic
it is more robust to large outliers. MdAE is defined as: superiority value between two estimation methods:
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
against each of the others using the following formular: are divided into 9 clusters (using K-means clustering) based
P on their embedding which was learned through the pre-
k6=i Aik training process. We used t-distributed stochastic neighbor
Aiu = ,
l−1 embedding (t-SNE) [75] to display high-dimensional vectors
where Aik is the pairwise stochastic superiority values in two dimensions.
(ÂXY ) for all (i, k) pairs of estimation methods, k = 1, ..., l, We show here some representative words from some
and l is a number of estimation methods, e.g. variable i clusters for a brief illustration. Words that are semantically
refers to Deep-SE and l = 4 when comparing Deep-SE related are grouped in the same cluster. For example, words
against Random, Mean and Median methods. related to networking like soap, configuration, tcp, and load
are in one cluster. This indicates that to some extent, the
learned vectors effectively capture the semantic relations be-
5.4 Hyper-parameter settings for training a Deep-SE
tween words, which is useful for the story-point estimation
model
task we do later.
We focused on tuning two important hyper-parameters: the
number of word embedding dimensions and the number of 15
1.8 Fig. 7. Top-500 word clusters used in the Apache’s issue reports
1.6
1.4 The pre-training step is known to effectively deal with
1.2 limited labelled data [76–78]. Here, pre-training does not
1 require story-point labels since it is trained by predicting
2 3 5 10 20 30 40 50 60 80 100 200 the next words. Hence the number of data points equals to
Number of hidden layers
DIM10 DIM50 DIM100 DIM200
the number of words. Since for each project repository we
used 50,000 issues for pre-training, we had approximately 5
Fig. 6. Story point estimation performance with different parameter. million data points per repository for pre-training.
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
RQ2: Benefits of deep representation improvement of our approach over LSTM+RF, LSTM+SVM,
and LSTM+ATLM is still significant after applying p-value
TABLE 5 correction with the effect size greater than 0.5 in 59/64 cases.
Evaluation results of Deep-SE, LSTM+RF, LSTM+SVM, LSTM+ATLM, In most cases, when comparing the proposed model against
and LSTM+LR (the best results are highlighted in bold). MAE and LSTM+RF, LSTM+SVM, LSTM+ATLM, and LSTM+LR, the
MdAE - the lower the better, SA - the higher the better.
effect sizes are small (between 0.5 and 0.6). A major part of
those improvement were brought by our use of the deep
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
learning LSTM architecture to model the textual description
ME Deep-SE 1.02 0.73 59.84 JI Deep-SE 1.38 1.09 59.52
lstm+rf 1.08 0.90 57.57 lstm+rf 1.71 1.27 49.71 of an issue. The use of highway recurrent networks (on top
lstm+svm 1.07 0.90 58.02 lstm+svm 2.04 1.89 40.05
lstm+atlm 1.08 0.95 57.60 lstm+atlm 2.10 1.95 38.26 of LSTM) has also improved the predictive performance, but
lstm+lr 1.10 0.96 56.94 lstm+lr 2.10 1.95 38.26 not as large effects as the LSTM itself (especially for those
UG Deep-SE 1.03 0.80 52.66 MD Deep-SE 5.97 4.93 50.29 projects which have very small number of issues). However,
lstm+rf 1.07 0.85 50.70 lstm+rf 9.86 9.69 17.86
lstm+svm 1.06 1.04 51.23 lstm+svm 6.70 5.44 44.19 our approach, Deep-SE, achieved Aiu greater than 0.6 in the
lstm+atlm 1.40 1.20 35.55 lstm+atlm 9.97 9.61 16.92
lstm+lr 1.40 1.20 35.55 lstm+lr 9.97 9.61 16.92 most cases.
AS Deep-SE 1.36 0.58 60.26 DM Deep-SE 3.77 2.22 47.87
lstm+rf 1.62 1.40 52.38 lstm+rf 4.51 3.69 37.71 TABLE 6
lstm+svm 1.46 1.42 57.20 lstm+svm 4.20 2.87 41.93 Comparison between the Recurrent Highway Net against Random
lstm+atlm 1.59 1.30 53.29 lstm+atlm 4.70 3.74 35.01
lstm+lr 1.68 1.46 50.78 lstm+lr 5.30 3.66 26.68 Forests, Support Vector Machine, Automatically Transformed Linear
AP Deep-SE 2.71 2.52 42.58 MU Deep-SE 2.18 1.96 40.09
Model, and Linear Regression using Wilcoxon test and Â12 effect size
lstm+rf 2.96 2.80 37.34 lstm+rf 2.20 2.21 38.73 (in brackets)
lstm+svm 3.06 2.90 35.26 lstm+svm 2.28 2.89 37.44
lstm+atlm 3.06 2.76 35.21 lstm+atlm 2.46 2.39 32.51
lstm+lr 3.75 3.66 20.63 lstm+lr 2.46 2.39 32.51 Deep-SE vs LSTM+RF LSTM+SVM LSTM+ATLM LSTM+LR Aiu
TI Deep-SE 1.97 1.34 55.92 MS Deep-SE 3.23 1.99 17.17 ME <0.001 [0.57] <0.001 [0.54] <0.001 [0.59] <0.001 [0.59] 0.57
lstm+rf 2.32 1.97 48.02 lstm+rf 3.30 2.77 15.30 UG 0.004 [0.59] 0.010 [0.55] <0.001 [1.00] <0.001 [0.73] 0.72
lstm+svm 2.00 2.10 55.20 lstm+svm 3.31 3.09 15.10 AS <0.001 [0.69] <0.001 [0.51] <0.001 [0.71] <0.001 [0.75] 0.67
lstm+atlm 2.51 2.03 43.87 lstm+atlm 3.42 2.75 12.21 AP <0.001 [0.60] <0.001 [0.52] <0.001 [0.62] <0.001 [0.64] 0.60
lstm+lr 2.71 2.31 39.32 lstm+lr 3.42 2.75 12.21 TI <0.001 [0.65] 0.007 [0.51] <0.001 [0.69] <0.001 [0.71] 0.64
DC Deep-SE 0.68 0.53 69.92 XD Deep-SE 1.63 1.31 46.82 DC 0.406 [0.55] 0.015 [0.60] <0.001 [0.97] 0.024 [0.58] 0.68
lstm+rf 0.69 0.62 69.52 lstm+rf 1.81 1.63 40.99 BB <0.001 [0.73] 0.007 [0.60] <0.001 [0.84] <0.001 [0.75] 0.73
lstm+svm 0.75 0.90 67.02 lstm+svm 1.80 1.77 41.33 CV <0.001 [0.70] 0.140 [0.63] <0.001 [0.82] 0.001 [0.70] 0.71
lstm+atlm 0.87 0.59 61.57 lstm+atlm 1.83 1.65 40.45 JI 0.006 [0.71] 0.001 [0.67] 0.002 [0.89] <0.001 [0.79] 0.77
lstm+lr 0.80 0.67 64.96 lstm+lr 1.85 1.72 39.63 MD <0.001 [0.76] <0.001 [0.57] <0.001 [0.74] <0.001 [0.69] 0.69
DM <0.001 [0.62] <0.001 [0.56] <0.001 [0.61] <0.001 [0.62] 0.60
BB Deep-SE 0.74 0.61 71.24 TD Deep-SE 2.97 2.92 48.28
MU 0.846 [0.53] 0.005 [0.62] 0.009 [0.67] 0.003 [0.64] 0.62
lstm+rf 1.01 1.00 60.95 lstm+rf 3.89 4.37 32.14
lstm+svm 0.81 1.00 68.55 lstm+svm 3.49 3.37 39.13
MS 0.502 [0.53] 0.054 [0.50] <0.001 [0.82] 0.195 [0.56] 0.60
lstm+atlm 1.97 1.78 23.70 lstm+atlm 3.86 4.11 32.71 XD <0.001 [0.63] <0.001 [0.57] <0.001 [0.65] <0.001 [0.60] 0.61
lstm+lr 1.26 1.16 51.24 lstm+lr 3.79 3.67 33.88 TD <0.001 [0.78] <0.001 [0.68] <0.001 [0.70] <0.001 [0.70] 0.72
TE 0.020 [0.53] 0.002 [0.59] <0.001 [0.66] 0.006 [0.65] 0.61
CV Deep-SE 2.11 0.80 50.45 TE Deep-SE 0.64 0.59 69.67
lstm+rf 3.08 2.77 27.58 lstm+rf 0.66 0.65 68.51
lstm+svm 2.50 2.32 41.22 lstm+svm 0.70 0.90 66.61
lstm+atlm 3.11 2.49 26.90 lstm+atlm 0.70 0.72 66.51 The proposed approach of using Recurrent Highway Net-
lstm+lr 3.36 2.76 21.07 lstm+lr 0.77 0.71 63.20
works is effective in building a deep representation of
issue reports and consequently improving story point
Table 5 shows MAE, MdAE, and SA achieved from
estimation.
Deep-SE using Recurrent Highway Networks (RHWN) for
deep representation of issue reports against using Ran-
dom Forests, Support Vector Machine, Automatically Trans- RQ3: Benefits of LSTM document representation
formed Linear Model, and Linear Regression Model cou- To study the benefits of using LSTM in representing
pled with LSTM (i.e. LSTM+RF, LSTM+SVM, LSTM+ATLM, issue reports, we compared the improved accuracy achieved
and LSTM+LR). The distribution of the Absolute Error is by Random Forest using the features derived from LSTM
reported in Appendix A.2. When we use MAE, MdAE, against that using the features derived from BoW and
and SA as evaluation criteria, Deep-SE is still the best ap- Doc2vec. For a fair comparison we used Random Forests as
proach, consistently outperforming LSTM+RF, LSTM+SVM, the regressor in all settings and the result is reported in Table
LSTM+ATLM, and LSTM+LR across all sixteen projects. 7 (see the distribution of the Absolute Error in Appendix
Using RHWN improved over RF between 0.91% (in MU) A.3). LSTM performs better than BoW and Doc2vec with
to 39.45% (in MD) in MAE, 5.88% (in UG) to 71.12% (in CV) respect to the MAE, MdAE, and SA measures in twelve
in MdAE, and 0.58% (in DC) to 181.58% (in MD) in SA. The projects (e.g. ME, UG, and AS) from sixteen projects. LSTM
improvements of RHWN over SVM are between 1.50% (in improved 4.16% and 11.05% in MAE over Doc2vec and
TI) to 32.35% (in JI) in MAE, 9.38% (in MD) to 65.52% (in BoW, respectively, averaging across all projects.
CV) in MdAE, and 1.30% (in TI) to 48.61% (in JI). In terms Among those twelve projects, LSTM improved over BoW
of using ATLM, RHWN improved over it between 5.56% (in between 0.30% (in MS) to 28.13% (in DC) in terms of MAE,
MS) to 62.44% (in BB) in MAE, 8.70% (in AP) to 67.87% (in 1.06% (in AP) to 45.96% (in JI) in terms of MdAE, and 0.67%
CV) in MdAE, and 3.89% (in ME) to 200.59% (in BB) in SA. (in AP) to 47.77% (in TD) in terms of SA. It also improved
Overall, RHWN improved , in terms of MAE, 9.63% over over Doc2vec between 0.45% (in MU) to 18.57% (in JI) in
SVM, 13.96% over RF, 21.84% over ATLM, and 23.24% over terms of MAE, 0.71% (in AS) to 40.65% (in JI) in terms of
LR, averaging across all projects. MdAE, and 2.85% (in TE) to 31.29% (in TD) in terms of SA.
In addition, the results for the Wilcoxon test to compare We acknowledge that BoW and Doc2vec perform better
our approach (Deep-SE) against LSTM+RF, LSTM+SVM, than LSTM in some cases. For example, in the Moodle
LSTM+ATLM, and LSTM+LR is shown in Table 6. The project (MD), D2V+RF performed better than LSTM+RF
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
TABLE 7 TABLE 9
Evaluation results of LSTM+RF, BoW+RF, and Doc2vec+RF (the best Mean Absolute Error (MAE) on cross-project estimation and
results are highlighted in bold). MAE and MdAE - the lower the better, comparison of Deep-SE and ABE0 using Wilcoxon test and ÂXY effect
SA - the higher the better. size (in brackets)
Proj Method MAE MdAE SA Proj Method MAE MdAE SA Source Target Deep-SE ABE0 Deep-SE vs ABE0
ME lstm+rf 1.08 0.90 57.57 JI lstm+rf 1.71 1.27 49.71
bow+rf 1.31 1.34 48.66 bow+rf 2.10 2.35 38.34 (i) within-repository
d2v+rf 1.14 0.98 55.28 d2v+rf 2.10 2.14 38.29 ME UG 1.07 1.23 <0.001 [0.78]
UG lstm+rf 1.07 0.85 50.70 MD lstm+rf 9.86 9.69 17.86 UG ME 1.14 1.22 0.012 [0.52]
bow+rf 1.19 1.28 45.24 bow+rf 10.20 10.22 15.07 AS AP 2.75 3.08 <0.001 [0.67]
d2v+rf 1.12 0.92 48.47 d2v+rf 8.02 9.87 33.19
AS TI 1.99 2.56 <0.001 [0.70]
AS lstm+rf 1.62 1.40 52.38 DM lstm+rf 4.51 3.69 37.71 AP AS 2.85 3.00 0.051 [0.55]
bow+rf 1.83 1.53 46.34 bow+rf 4.78 3.98 33.84
d2v+rf 1.62 1.41 52.38 d2v+rf 4.71 3.99 34.87 AP TI 3.41 3.53 0.003 [0.56]
AP lstm+rf 2.96 2.80 37.34 MU lstm+rf 2.20 2.21 38.73
MU MS 3.14 3.55 0.041 [0.55]
bow+rf 2.97 2.83 37.09 bow+rf 2.31 2.54 36.64 MS MU 2.31 2.64 0.030 [0.56]
d2v+rf 3.20 2.91 32.29 d2v+rf 2.21 2.69 39.36
Avg 2.33 2.60
TI lstm+rf 2.32 1.97 48.02 MS lstm+rf 3.30 2.77 15.30
bow+rf 2.58 2.30 42.15 bow+rf 3.31 2.57 15.58 (ii) cross-repository
d2v+rf 2.41 2.16 46.02 d2v+rf 3.40 2.93 12.79
AS UG 1.57 2.04 0.004 [0.61]
DC lstm+rf 0.69 0.62 69.52 XD lstm+rf 1.81 1.63 40.99 AS ME 2.08 2.14 0.022 [0.51]
bow+rf 0.96 1.11 57.78 bow+rf 1.98 1.72 35.56
d2v+rf 0.77 0.77 66.14 d2v+rf 1.88 1.73 38.72
MD AP 5.37 6.95 <0.001 [0.58]
MD TI 6.36 7.10 0.097 [0.54]
BB lstm+rf 1.01 1.00 60.95 TD lstm+rf 3.89 4.37 32.14
bow+rf 1.34 1.26 48.06 bow+rf 4.49 5.05 21.75 MD AS 5.55 6.77 <0.001 [0.61]
d2v+rf 1.12 1.16 56.51 d2v+rf 4.33 4.80 24.48 DM TI 2.67 3.94 <0.001 [0.64]
CV lstm+rf 3.08 2.77 27.58 TE lstm+rf 0.66 0.65 68.51 UG MS 4.24 4.45 0.005 [0.54]
bow+rf 2.98 2.93 29.91 bow+rf 0.86 0.69 58.89 ME MU 2.70 2.97 0.015 [0.53]
d2v+rf 3.16 2.79 25.70 d2v+rf 0.70 0.89 66.61
Avg 3.82 4.55
TABLE 8
Comparison of Random Forest with LSTM, Random Forests with BoW,
and Random Forests with Doc2vec using Wilcoxon test and ÂXY RQ4: Cross-project estimation
effect size (in brackets) We performed sixteen sets of cross-project estimation
experiments to test two settings: (i) within-repository: both
LSTM vs BoW Doc2Vec Aiu the source and target projects (e.g. Apache Mesos and
Apache Usergrid) were from the same repository, and pre-
ME <0.001 [0.70] 0.142 [0.53] 0.62
training was done using only the source projects, not the
UG <0.001 [0.71] 0.135 [0.60] 0.66
target projects; and (ii) cross-repository: the source project
AS <0.001 [0.66] <0.001 [0.51] 0.59
AP 0.093 [0.51] 0.144 [0.52] 0.52 (e.g. Appcelerator Studio) was in a different repository from
TI <0.001 [0.67] <0.001 [0.55] 0.61 the target project Apache Usergrid, and pre-training was
DC <0.001 [0.73] 0.008 [0.59] 0.66 done using only the source project.
BB <0.001 [0.77] 0.002 [0.66] 0.72 Table 9 shows the performance of our Deep-SE model
CV 0.109 [0.61] 0.581 [0.57] 0.59 and ABE0 for cross-project estimation (see the distribution
JI 0.009 [0.67] 0.011 [0.62] 0.65 of the Absolute Error in Appendix A.4). We also used a
MD 0.022 [0.63] 0.301 [0.51] 0.57 benchmark of within-project estimation where older issues
DM <0.001 [0.60] <0.001 [0.55] 0.58 of the target project were used for training (see Table 3).
MU 0.006 [0.59] 0.011 [0.57] 0.58 In all cases, the proposed approach when used for cross-
MS 0.780 [0.54] 0.006 [0.57] 0.56 project estimation performed worse than when used for
XD <0.001 [0.60] 0.005 [0.55] 0.58 within-project estimation (e.g. on average 20.75% reduction
TD <0.001 [0.73] <0.001 [0.67] 0.70 in performance for within-repository and 97.92% for cross-
TE <0.001 [0.69] 0.005 [0.61] 0.65
repository). However, our approach outperformed the cross-
project baseline (i.e. ABE0) in all cases – it achieved 2.33
and 3.82 MAE in within and cross repository, while ABE0
in MAE and SA – it achieved 8.02 MAE and 33.19 SA. achieved 2.60 and 4.55 MAE. The improvement of our
This could reflect that the combination between LSTM and approach over ABE0 is still significant after applying p-
RHWN significantly improves the accuracy of the estima- value correction with the effect size greater than 0.5 in 14/16
tions. cases.
The improvement of LSTM over BoW and Doc2vec is These results confirm a universal understanding [25] in
significant after applying Bonferroni correction with effect agile development that story point estimation is specific to
size greater than 0.5 in 24/32 cases and Aiu being greater teams and projects. Since story points are relatively mea-
than 0.5 in all projects (see Table 8). sured, it is not uncommon that two different same-sized
teams could give different estimates for the same user story.
The proposed LSTM-based approach is effective in au- For example, team A may estimate 5 story points for user
tomatically learning semantic features representing issue story U C1 while team B gives 10 story points. However,
description, which improves story-point estimation. it does not necessarily mean that team B would do more
work for completing U C1 than team A. It more likely means
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
that team’B baselines are twice bigger than team A’s, i.e. for
“baseline” user story which requires 5 times less the effort
than U C1 takes, team A would give it 1 story point while
team B gives 2 story points. Hence, historical estimates
are more valuable for within-project estimation, which is
demonstrated by this result. TABLE 10
Evaluation results on the adjusted story points (the best results are
Given the specificity of story points to teams and projects, highlighted in bold). MAE and MdAE - the lower the better, SA - the
higher the better.
our proposed approach is more effective for within-project
estimation.
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
ME Deep-SE 0.27 0.03 76.58 JI Deep-SE 0.60 0.51 63.20
RQ5: Adjusted/normalized story points lstm+rf 0.34 0.15 70.43 lstm+rf 0.74 0.79 54.42
bow+rf 0.36 0.16 68.82 bow+rf 0.66 0.53 58.99
d2v+rf 0.35 0.15 69.87 d2v+rf 0.70 0.53 56.99
Table 10 shows the results of our Deep-SE and the lstm+svm 0.33 0.10 71.20 lstm+svm 0.94 0.89 41.97
other baseline methods in predicting the normalized story lstm+atlm 0.33 0.14 70.97 lstm+atlm 0.89 0.89 45.18
lstm+lr 0.37 0.21 67.68 lstm+lr 0.89 0.89 45.18
points. Deep-SE performs well across all projects. Deep-SE mean 1.12 1.07 3.06 mean 1.31 1.71 18.95
median 1.05 1.00 8.87 median 1.60 2.00 1.29
improved MAE between 2.13% to 93.40% over the Mean
UG Deep-SE 0.07 0.01 93.50 MD Deep-SE 2.56 2.29 31.83
method, 9.45% to 93.27% over the Median method, 7.02% to lstm+rf 0.08 0.00 92.59 lstm+rf 3.45 3.55 8.24
53.33% over LSTM+LR, 1.20% to 61.96% over LSTM+ATLM, bow+rf 0.11 0.01 90.31 bow+rf 3.32 3.27 11.54
d2v+rf 0.10 0.01 91.22 d2v+rf 3.39 3.48 9.70
1.20% to 53.33% over LSTM+SVM, 4.00% to 30.00% over lstm+svm 0.15 0.10 86.38 lstm+svm 3.12 3.07 16.94
lstm+atlm 0.15 0.08 86.25 lstm+atlm 3.48 3.49 7.41
Doc2vec+RF, 2.04% to 36.36% over BoW+RF, and 0.86% to lstm+lr 0.15 0.08 86.25 lstm+lr 3.57 3.28 4.98
25.80% over LSTM+RF. The best result is obtained in the mean 1.04 0.98 4.79 mean 3.60 3.67 4.18
median 1.06 1.00 2.64 median 2.95 3.00 21.48
Usergrid project (UG), it is 0.07 MAE, 0.01 MdAE, and 93.50 AS Deep-SE 0.53 0.20 69.16 DM Deep-SE 2.30 1.43 31.99
SA. We however note that the adjusted story points benefits lstm+rf 0.56 0.45 67.49 lstm+rf 2.83 2.59 16.23
bow+rf 0.56 0.49 67.39 bow+rf 2.83 2.63 16.33
all methods since it narrows the gap between minimum and d2v+rf 0.56 0.46 67.37 d2v+rf 2.92 2.80 13.80
maximum value and the distribution of the story points. lstm+svm 0.55 0.32 68.34 lstm+svm 2.45 1.78 27.56
lstm+atlm 0.57 0.46 66.87 lstm+atlm 2.83 2.57 16.28
lstm+lr 0.57 0.49 67.12 lstm+lr 2.83 2.57 16.28
Our proposed approach still outperformed other tech- mean 1.18 0.79 31.89 mean 3.27 3.41 3.25
median 1.35 1.00 21.54 median 2.61 2.00 22.94
niques in estimating the new adjusted story points.
AP Deep-SE 0.92 0.86 21.95 MU Deep-SE 0.68 0.59 63.83
lstm+rf 0.99 0.87 16.23 lstm+rf 0.70 0.55 63.01
bow+rf 1.00 0.87 15.33 bow+rf 0.70 0.57 62.79
RQ6: Compare Deep-SE against the existing approach d2v+rf 0.99 0.86 15.94 d2v+rf 0.71 0.57 62.17
lstm+svm 1.12 0.92 5.26 lstm+svm 0.70 0.62 62.62
lstm+atlm 1.03 0.84 12.63 lstm+atlm 0.93 0.74 50.77
We applied our approach, Deep-SE, and the Porru et. lstm+lr 1.17 1.05 1.14 lstm+lr 0.79 0.61 58.00
al.’s approach on their dataset consisted of eight projects. mean 1.15 0.64 2.49 mean 1.21 1.51 35.86
median 0.94 1.00 20.29 median 1.64 2.00 12.80
Table 11 shows the evaluation results in MAE and the
TI Deep-SE 0.59 0.17 56.53 MS Deep-SE 0.86 0.65 56.82
comparison of Deep-SE and the Porru et. al.’s approach. The lstm+rf 0.72 0.56 46.22 lstm+rf 0.91 0.76 54.37
bow+rf 0.73 0.58 46.10 bow+rf 0.89 0.93 55.48
distribution of the Absolute Error is reported in Appendix d2v+rf 0.72 0.56 46.17 d2v+rf 0.90 0.69 54.66
A.5. Deep-SE outperforms the existing approach in all cases. lstm+svm 0.73 0.62 45.74 lstm+svm 0.94 0.78 52.91
lstm+atlm 0.73 0.57 45.86 lstm+atlm 0.99 0.87 50.45
Deep-SE improved between 18.18% (in TIMOB) to 56.48% lstm+lr 0.73 0.56 45.77 lstm+lr 0.99 0.87 50.45
mean 1.32 1.56 1.57 mean 1.23 0.62 38.49
(in DNN) in terms of MAE. In addition, the improvement median 0.86 1.00 36.04 median 1.44 1.00 27.83
of our approach over the Porru et. al.’s approach is still DC Deep-SE 0.48 0.48 55.77 XD Deep-SE 0.35 0.08 80.66
significant after applying p-value correction with the effect lstm+rf 0.49 0.49 55.02 lstm+rf 0.44 0.37 75.78
bow+rf 0.49 0.48 54.76 bow+rf 0.45 0.38 75.33
size greater than 0.5 in all cases. Especially, the large effect d2v+rf 0.50 0.50 53.59 d2v+rf 0.45 0.32 75.31
lstm+svm 0.49 0.43 55.24 lstm+svm 0.38 0.20 79.16
size (ÂXY > 0.7) of the improvement is obtained in the lstm+atlm 0.53 0.47 51.02 lstm+atlm 0.92 0.76 49.05
lstm+lr 0.53 0.47 51.02 lstm+lr 0.45 0.40 75.33
DNN project. mean 1.07 1.49 1.29 mean 1.03 1.28 43.06
median 0.58 1.00 46.76 median 0.75 1.00 58.74
Our proposed approach outperformed the existing tech- BB Deep-SE 0.41 0.12 72.00 TD Deep-SE 0.82 0.64 53.36
nique using TF-IDF in estimating the story points. lstm+rf 0.43 0.38 70.37 lstm+rf 0.84 0.68 52.65
bow+rf 0.45 0.40 69.33 bow+rf 0.88 0.65 50.30
d2v+rf 0.49 0.45 66.34 d2v+rf 0.86 0.70 51.46
lstm+svm 0.42 0.21 71.21 lstm+svm 0.83 0.62 53.24
lstm+atlm 0.47 0.41 67.53 lstm+atlm 0.83 0.58 52.82
lstm+lr 0.47 0.41 67.53 lstm+lr 0.90 0.74 48.88
5.8 Training/testing time mean 1.15 0.76 20.92 mean 1.29 1.42 27.20
median 1.39 1.00 4.50 median 0.99 1.00 44.17
Deep learning models are known for taking a long time for CV Deep-SE 1.15 0.79 23.29 TE Deep-SE 0.40 0.05 74.58
lstm+rf 1.16 1.05 22.55 lstm+rf 0.47 0.46 70.39
training. This is an important factor in considering adopting bow+rf 1.22 1.10 18.95 bow+rf 0.48 0.48 69.52
our approach, especially in an agile development setting. d2v+rf 1.20 1.09 20.30 d2v+rf 0.48 0.48 69.41
lstm+svm 1.22 1.15 18.77 lstm+svm 0.45 0.41 71.77
If training time takes longer than the duration of a sprint lstm+atlm 1.47 1.28 2.22 lstm+atlm 0.49 0.48 69.14
lstm+lr 1.47 1.28 2.22 lstm+lr 0.49 0.48 69.14
(e.g. one or two weeks), the prediction system would not mean 1.27 1.11 15.18 mean 0.99 0.60 37.28
be useful in practice. We have found that the training time median 1.29 1.00 13.92 median 1.39 1.00 12.09
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16
more research on estimation at the issue or user story example, word importance can be credited using various
level. Our work opens a new research area for the use of techniques (e.g., using gradient with respect to word value).
software analytics in estimating story points. The assertion Alternatively, there are model agnostic technique to explain
demonstrated by our results is that our current method any prediction [87]. Even with partly interpretable RNN, if
works and no other methods has been demonstrated to the prediction is accurate, then we can still expect a high
work at this scale of above 23,000 data points. Existing level of adoption.
work in software effort estimation have dealt with a much What do the results mean for project managers and
smaller number of observations (i.e. data points) than our developers?
work did. For example, the China dataset has only 499 Our positive results indicate that it is possible to build a
data points, Desharnais has 77, and Finish has 38 (see the prediction system to support project managers and develop-
datasets for effort estimation on the PROMISE repository) ers in estimating story points. Our proposal enables teams
– these are commonly used in existing effort estimation to be consistent in their estimation of story points. Achieving
work (e.g. [19, 84]). By contrast, in this work we deal with this consistency is central to effectively leveraging story
the scale of thousands of data points. Since we make our points for project planning. The machine learner learns
dataset publicly available, further research (e.g. modeling from past estimates made by the specific team which it is
the codebase and adding team-specific features into the deployed to assist. The insights that the learner acquires
estimation model) can be advanced in this topic, and our are therefore team-specific. The intent is not to have the
current results can serve as the baseline. machine learner supplant existing agile estimation practices.
Should we adopt deep learning? To the best of our The intent, instead, is to deploy the machine learner to
knowledge, our work is the first major research in using complement these practices by playing the role of a decision
deep learning for effort estimation. The use of deep learning support system. Teams would still meet, discuss user stories
has allowed us to automatically learn a good representation and generate estimates as per current practice, but would
of an issue report and use this for estimating the effort of have the added benefit of access to the insights acquired
resolving the issue. The evaluation results demonstrates the by the machine learner. Teams would be free to reject the
significant improvement that our deep learning approach suggestions of the machine learner, as is the case with any
has brought in terms of predictive performance. This is a decision support system. In every such estimation exercise,
powerful result since it helps software practitioners move the actual estimates generated are recorded as data to be
away from the manual feature engineering process. Feature fed to the machine learner, independent of whether these
engineering usually relies on domain experts who use their estimates are based on the recommendations of the machine
specific knowledge of the data to create features for machine learner or not. This estimation process helps the team not
learners to work. In our approach, features are automatically only understand sufficient details about what it will take to
learned from a textual description of an issue, thus obviating to resolve those issues, but also align with their previous
the need for designing them manually. We of course need estimations.
to collect the labels (i.e. story points assigned to issues)
as the ground truths used for learning and testing. Hence,
we believe that the wide adoption of software analytics in 6 R ELATED WORK
industry crucially depends on the ability to automatically
derive (learn) features from raw software engineering data. Existing estimation methods can generally be classified into
In our context of story point estimation, if the number of three major groups: expert-based, model-based, and hybrid
new words is large, transfer learning is needed, e.g. by using approaches. Expert-based methods rely on human expertise
the existing model as a strong prior for the new model. to make estimations, and are the most popular technique
However, this can be mitigated by pre-training on a large in practice [88, 89]. Expert-based estimation however tends
corpus so that most of the terms are covered. After pre- to require large overheads and the availability of experts
training, our model is able to automatically learn semantic each time the estimation needs to be made. Model-based ap-
relations between words. For example, words related to proaches use data from past projects but they are also varied
networking like “soap”, “configuration”, “tcp”, and “load” in terms of building customized models or using fixed mod-
are in one cluster (see Figure 7). Hence, even when a user els. The well-known construction cost (COCOMO) model
story has several unique terms (but already pre-trained), [11] is an example of a fixed model where factors and their
retraining the main model is not necessary. Pre-training may relationships are already defined. Such estimation models
however take time and effort. One potential research direc- were built based on data from a range of past projects.
tion is therefore building up a community for sharing pre- Hence, they tend to be suitable only for a certain kinds
trained networks, which can be used for initialization, thus of project that were used to build the model. The cus-
reducing training times (similar to Model Zoo [85]). As the tomized model building approach requires context-specific
first step towards this direction, we make our pre-trained data and uses various methods such as regression (e.g. [12,
models publicly available for the research community. 13]), Neural Network (e.g. [14, 90]), Fuzzy Logic (e.g. [15]),
We acknowledged that the explainability of a model is Bayesian Belief Networks (e.g.[16]), analogy-based (e.g. [17,
important for full adoption of machine learning techniques. 18]), and multi-objective evolutionary approaches (e.g. [19]).
This is not a unique problem only for recurrent networks It is however likely that no single method will be the
(RNN), but also for many powerful modern machine learn- best performer for all project types [10, 20, 91]. Hence, some
ing techniques (e.g. Random Forests and SVM). However, recent work (e.g. [20]) proposes to combine the estimates
RNN is not entirely a black-box as it seems (e.g. see [86]). For from multiple estimators. Hybrid approaches (e.g. [21, 22])
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17
combine expert judgements with the available data – simi- errors in C programs. Deep Belief Network [108] is another
larly to the notions of our proposal. common deep learning model, which has been used in
While most existing work focuses on estimating a whole software engineering, e.g. for building defection prediction
project, little work has been done in building models specif- models [109, 110].
ically for agile projects. Today’s agile, dynamic and change-
driven projects require different approaches to planning and
estimating [24]. Some recent approaches leverage machine 7 C ONCLUSION
learning techniques to support effort estimation for agile In this paper, we have contributed to the research com-
projects. Recently, the work in [64] proposed an approach munity the dataset for story point estimations, sourcing
which extracts TF-IDF features from issue description to from 16 large and diverse software projects. We have also
develop an story-point estimation model. The univariate proposed a deep learning-based, fully end-to-end prediction
feature selection technique are then applied on the extracted system for estimating story points, removing the users from
features and fed into classifiers (e.g. SVM). In addition, the manually designing features from the textual description of
work in [92] applied Cosmic Function Points (CFP) [93] to issues. A key novelty of our approach is the combination
estimate the effort for completing an agile project. The work of two powerful deep learning architectures: Long Short-
in [94] developed an effort prediction model for iterative Term Memory (to learn a vector representation for issue
software development setting using regression models and reports) and Recurrent Highway Network (for building a
neural networks. Differing from traditional effort estimation deep representation).
models, this model is built after each iteration (rather than The proposed approach has consistently outperformed
at the end of a project) to estimate effort for the next three common baselines and four alternatives according to
iteration. The work in [95] built a Bayesian network model our evaluation results. Compared against the Mean and
for effort prediction in software projects which adhere to the Median techniques, the proposed approach has improved
agile Extreme Programming method. Their model however 34.06% and 26.77% respectively in MAE averaging across
relies on several parameters (e.g. process effectiveness and 16 projects we studied. Compared against the BoW and
process improvement) that require learning and extensive Doc2Vec techniques, our approach has improved 23.68%
fine tuning. Bayesian networks are also used in [96] to model and 17.90% in MAE. These are significant results in the
dependencies between different factors (e.g. sprint progress literature of effort estimation. A major part of those im-
and sprint planning quality influence product quality) in provement were brought by our use of the deep learning
Scrum-based software development project in order to de- LSTM architecture to model the textual description of an
tect problems in the project. Our work specifically focuses issue. The use of highway recurrent networks (on top of
on estimating issues with story points using deep learning LSTM) has also improved the predictive performance, but
techniques to automatically learn semantic features repre- not as significantly as the LSTM itself (especially for those
senting the actual meaning of issue descriptions, which is project which have very small number of issues).
the key difference from previous work. Previous research Our future work would involve expanding our study to
(e.g. [97–100]) has also been done in predicting the elapsed commercial software projects and other large open source
time for fixing a bug or the delay risk of resolving an issue. projects to further evaluate our proposed method. We also
However, effort estimation using story points is a more consider performing team analytics (e.g. features character-
preferable practice in agile development. izing a team) to model team changes over time and feed it
LSTM has shown successes in many applications such into our prediction model. We also plan to investigate how
as language models [35], speech recognition [36] and video to learn a semantic representation of the codebase and use
analysis [37]. Our Deep-SE is a generic in which it maps text it as another input to our model. Furthermore, we will look
to a numerical score or a class, and can be used for other into experimenting with a sliding window setting to explore
tasks, e.g. mapping a movie review to a score, or assigning incremental learning. In addition, we will also investigate
scores to essays, or sentiment analysis. Deep learning has re- how to best use the issue’s metadata (e.g. priority and type)
cently attracted increasing interests in software engineering. and still maintain the end-to-end nature of our entire model.
Our previous work [101] proposed a generic deep learning Our future work also involve comparing our use of the
framework based on LSTM for modeling software and its LSTM model against other state-of-the-art models of natural
development process. White et. al. [102] has employed re- language such as paragraph2vec [59] or Convolutional Neu-
current neural networks (RNN) to build a language model ral Network [111]. We have discussed (informally) our work
for source code. Their later work [103] extended these RNN with several software developers who has been practising
models for detecting code clones. The work in [104] also agile and estimating story points. They all agreed that our
used RNNs to build a statistical model for code completion. prediction system could be useful in practice. However, to
Our recent work [105] used LSTM to build a language make such a claim, we need to implement it into a tool and
model for code and demonstrated the improvement of this perform a user study. Hence, we would like to evaluate
model compared to the one using RNNs. Gu et. al. [106] empirically the impact of our prediction system for story
used a special RNN Encoder–Decoder, which consists of an point estimation in practice by project managers and/or
encoder RNN to process the input sequence and a decoder software developers. This would involve developing the
RNN with attention to generate the output sequence. This model into a tool (e.g. a JIRA plugin) and then organising
model takes as input a given API-related natural language trial use in practice. This is an important part of our future
query and returns API usage sequences. The work in [107] work to confirm the ultimate benefits of our approach in
also uses RNN Encoder–Decoder but for fixing common general.
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19
overview,” Neural Networks, vol. 61, pp. 85–117, 2015. [Online]. Statistics, vol. 25, no. 2, pp. 101–132, 2000. [Online]. Available:
Available: http://dx.doi.org/10.1016/j.neunet.2014.09.003 http://jeb.sagepub.com/cgi/doi/10.3102/10769986025002101
[52] M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation [74] A. Arcuri and L. Briand, “A practical guide for using statistical
of unnormalized statistical models, with applications to natural tests to assess randomized algorithms in software engineering,”
image statistics,” Journal of Machine Learning Research, vol. 13, no. in Proceedings of the 33rd International Conference on Software Engi-
Feb, pp. 307–361, 2012. neering (ICSE), 2011, pp. 1–10.
[53] T. D. Team, “Theano: A {Python} framework for fast [75] L. van der Maaten and G. Hinton, “Visualizing high-dimensional
computation of mathematical expressions,” arXiv e-prints, vol. data using t-sne,” Journal of Machine Learning Research, vol. 9, pp.
abs/1605.0, 2016. [Online]. Available: http://deeplearning.net/ 2579–2605, Nov 2008.
software/theano [76] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent,
[54] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and and S. Bengio, “Why does unsupervised pre-training help deep
R. Salakhutdinov, “Dropout: A simple way to prevent neural learning?” J. Mach. Learn. Res., vol. 11, pp. 625–660, Mar. 2010.
networks from overfitting,” Journal of Machine Learning Research, [77] J. Weston, F. Ratle, and R. Collobert, “Deep learning via semi-
vol. 15, pp. 1929–1958, 2014. supervised embedding,” in Proceedings of the 25th International
[55] M. Shepperd and S. MacDonell, “Evaluating prediction systems Conference on Machine Learning, ser. ICML ’08. New York,
in software project estimation,” Information and Software NY, USA: ACM, 2008, pp. 1168–1175. [Online]. Available:
Technology, vol. 54, no. 8, pp. 820–827, 2012. [Online]. Available: http://doi.acm.org/10.1145/1390156.1390303
http://dx.doi.org/10.1016/j.infsof.2011.12.008 [78] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
[56] R. Moraes, J. F. Valiati, and W. P. Gavião Neto, “Document-level and P. Kuksa, “Natural language processing (almost) from
sentiment classification: An empirical comparison between SVM scratch,” J. Mach. Learn. Res., vol. 12, pp. 2493–2537, Nov. 2011.
and ANN,” Expert Systems with Applications, vol. 40, no. 2, pp. [Online]. Available: http://dl.acm.org/citation.cfm?id=1953048.
621–633, 2013. 2078186
[57] P. A. Whigham, C. A. Owen, and S. G. Macdonell, “A Baseline [79] D. Zwillinger and S. Kokoska, CRC standard probability and statis-
Model for Software Effort Estimation,” ACM Transactions on Soft- tics tables and formulae. Crc Press, 1999.
ware Engineering and Methodology (TOSEM), vol. 24, no. 3, p. 20, [80] J. McCarthy, “From here to human-level AI,” Artificial Intelligence,
2015. vol. 171, no. 18, pp. 1174–1182, 2007.
[58] P. Tirilly, V. Claveau, and P. Gros, “Language modeling for bag- [81] A. Arcuri and L. Briand, “A Hitchhiker’s guide to statistical tests
of-visual words image categorization,” in Proceedings of the 2008 for assessing randomized algorithms in software engineering,”
international conference on Content-based image and video retrieval, Software Testing, Verification and Reliability, vol. 24, no. 3, pp. 219–
2008, pp. 249–258. 250, 2014.
[59] Q. Le and T. Mikolov, “Distributed Representations of Sentences [82] T. Menzies and M. Shepperd, “Special issue on repeatable results
and Documents,” in Proceedings of the 31st International Conference in software engineering prediction,” Empirical Software Engineer-
on Machine Learning (ICML), vol. 32, 2014, pp. 1188–1196. ing, vol. 17, no. 1-2, pp. 1–17, 2012.
[60] E. Kocaguneli, S. Member, and T. Menzies, “Exploiting the Essen- [83] T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, F. Peters,
tial Assumptions of Analogy-Based Effort Estimation,” vol. 38, and B. Turhan, “The PROMISE Repository of empirical software
no. 2, pp. 425–438, 2012. engineering data,” 2012.
[61] E. Kocaguneli, T. Menzies, and E. Mendes, “Transfer learning in [84] P. L. Braga, A. L. I. Oliveira, and S. R. L. Meira, “Software
effort estimation,” Empirical Software Engineering, vol. 20, no. 3, Effort Estimation using Machine Learning Techniques with Ro-
pp. 813–843, 2015. bust Confidence Intervals,” in Proceedings of the 7th International
[62] E. Mendes, I. Watson, and C. Triggs, “A Comparative Study Conference on Hybrid Intelligent Systems (HIS), 2007, pp. 352–357.
of Cost Estimation Models for Web Hypermedia Applications,” [85] T. Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and
Empirical Software Engineering, vol. 8, pp. 163–196, 2003. Karayev, Sergey and Long, Jonathan and Girshick, Ross and
[63] Y. F. Li, M. Xie, and T. N. Goh, “A Study of Project Selection and Guadarrama, Sergio and Darrell, “Caffe: Convolutional Architec-
Feature Weighting for Analogy Based Software Cost Estimation,” ture for Fast Feature Embedding,” arXiv preprint arXiv:1408.5093,
J. Syst. Softw., vol. 82, no. 2, pp. 241–252, feb 2009. [Online]. 2014.
Available: http://dx.doi.org/10.1016/j.jss.2008.06.001 [86] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing
[64] S. Porru, A. Murgia, S. Demeyer, and M. Marchesi, “Estimating and Understanding Recurrent Networks,” in arXiv preprint
Story Points from Issue Reports,” 2016. arXiv:1506.02078, 2015, pp. 1–12. [Online]. Available:
[65] S. D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineer- http://arxiv.org/abs/1506.02078
ing Metrics and Models. Redwood City, CA, USA: Benjamin- [87] M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why Should I
Cummings Publishing Co., Inc., 1986. Trust You?”: Explaining the Predictions of Any Classifier,” in
[66] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit, “A sim- Proceedings of the 22nd ACM SIGKDD International Conference
ulation study of the model evaluation criterion MMRE,” IEEE on Knowledge Discovery and Data Mining. ACM, 2016, pp.
Transactions on Software Engineering, vol. 29, no. 11, pp. 985–995, 1135–1144. [Online]. Available: http://arxiv.org/abs/1602.04938
2003. [88] M. Jorgensen, “A review of studies on expert estimation of
[67] B. Kitchenham, L. Pickard, S. MacDonell, and M. Shepperd, software development effort,” Journal of Systems and Software,
“What accuracy statistics really measure,” IEE Proceedings - Soft- vol. 70, no. 1-2, pp. 37–60, 2004.
ware, vol. 148, no. 3, p. 81, 2001. [89] M. Jorgensen and T. M. Gruschke, “The impact of lessons-learned
[68] M. Korte and D. Port, “Confidence in software cost estima- sessions on effort estimation and uncertainty assessments,” IEEE
tion results based on MMRE and PRED,” Proceedings of the 4th Transactions on Software Engineering, vol. 35, no. 3, pp. 368–383,
international workshop on Predictor models in software engineering 2009.
(PROMISE), pp. 63–70, 2008. [90] A. Panda, S. M. Satapathy, and S. K. Rath, “Empirical Validation
[69] D. Port and M. Korte, “Comparative studies of the model eval- of Neural Network Models for Agile Software Effort Estimation
uation criterions mmre and pred in software cost estimation based on Story Points,” Procedia Computer Science, vol. 57, pp.
research,” in Proceedings of the 2nd ACM-IEEE international sym- 772–781, 2015.
posium on Empirical software engineering and measurement. ACM, [91] F. Collopy, “Difficulty and complexity as factors in software effort
2008, pp. 51–60. estimation,” International Journal of Forecasting, vol. 23, no. 3, pp.
[70] T. Menzies, E. Kocaguneli, B. Turhan, L. Minku, and F. Peters, 469–471, 2007.
Sharing data and models in software engineering. Morgan Kauf- [92] R. D. A. Christophe Commeyne, Alain Abran, “Effort
mann, 2014. Estimation with Story Points and COSMIC Function Points -
[71] K. Muller, “Statistical power analysis for the behavioral sciences,” An Industry Case Study,” pp. 25–36, 2008. [Online]. Avail-
Technometrics, vol. 31, no. 4, pp. 499–500, 1989. able: http://cosmic-sizing.org/wp-content/uploads/2016/03/
[72] H. H. Abdi, “The Bonferonni and Sidak Corrections Estimation-model-v-Print-Format-adapter.pdf
for Multiple Comparisons,” Encyclopedia of Measurement [93] C. Group, INTERNATIONAL STANDARD ISO/IEC Software engi-
and Statistics., vol. 1, pp. 1–9, 2007. [Online]. Avail- neering COSMIC: a functional size measurement method, 2011, vol.
able: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1. 2011.
1.78.8747{\&}rep=rep1{\&}type=pdf [94] P. Abrahamsson, R. Moser, W. Pedrycz, A. Sillitti, and G. Succi,
[73] A. Vargha and H. D. Delaney, “A Critique and Improvement “Effort prediction in iterative software development processes –
of the CL Common Language Effect Size Statistics of incremental versus global prediction models,” 1st International
McGraw and Wong,” Journal of Educational and Behavioral Symposium on Empirical Software Engineering and Measurement
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 21
PLACE
PHOTO
HERE
0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.