0% found this document useful (0 votes)
43 views21 pages

A Deep Learning Model For Estimating Story Points

Uploaded by

chiragsgreddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views21 pages

A Deep Learning Model For Estimating Story Points

Uploaded by

chiragsgreddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

A deep learning model for estimating story


points
Morakot Choetkiertikul, Hoa Khanh Dam, Truyen Tran, Trang Pham, Aditya Ghose, and Tim Menzies

Abstract—Although there has been substantial research in software analytics for effort estimation in traditional software projects, little
work has been done for estimation in agile projects, especially estimating the effort required for completing user stories or issues. Story
points are the most common unit of measure used for estimating the effort involved in completing a user story or resolving an issue. In
this paper, we propose a prediction model for estimating story points based on a novel combination of two powerful deep learning
architectures: long short-term memory and recurrent highway network. Our prediction system is end-to-end trainable from raw input
data to prediction outcomes without any manual feature engineering. We offer a comprehensive dataset for story points-based
estimation that contains 23,313 issues from 16 open source projects. An empirical evaluation demonstrates that our approach
consistently outperforms three common baselines (Random Guessing, Mean, and Median methods) and six alternatives (e.g. using
Doc2Vec and Random Forests) in Mean Absolute Error, Median Absolute Error, and the Standardized Accuracy.

Index Terms—software analytics, effort estimation, story point estimation, deep learning.

1 I NTRODUCTION developing a complete software system, relying on a set


Effort estimation is an important part of software project of features manually designed for characterizing a software
management, particularly for planning and monitoring a project.
software project. Cost and schedule overruns have been In modern agile development settings, software is devel-
a common risk in software projects. Mckinsey and the oped through repeated cycles (iterative) and in smaller parts
University of Oxford has conducted a study on 5,400 large at a time (incremental), allowing for adaptation to changing
scale IT projects, and found that on average large soft- requirements at any point during a project’s life. A project
ware projects run 66% over budget and 33% overtime [1]. has a number of iterations (e.g. sprints in Scrum [23]). An
A different study on 1,471 software projects [2] revealed iteration is usually a short (usually 2–4 weeks) period in
similar findings: one in six software projects has a budget which the development team designs, implements, tests and
overrun of 200% and a schedule overrun of almost 70%. delivers a distinct product increment, e.g. a working mile-
Activities involving effort estimation forms a critical part in stone version or a working release. Each iteration requires
planning and managing a software project to ensure it com- the completion of a number of user stories, which are a com-
plete in time and within budget [3–5]. Effort estimates may mon way for agile teams to express user requirements. This
be used by different stakeholders as input for developing is a shift from a model where all functionalities are delivered
project plans, scheduling iteration or release plans, budget- together (in a single delivery) to a model involving a series
ing, and costing [6]. Hence, incorrect estimates may have of incremental deliveries.
adverse impact on the project outcomes [3, 7–9]. Research in There is thus a need to focus on estimating the effort
software effort estimation dates back several decades and of completing a single user story at a time rather than
they can generally be divided into model-based methods, the entire project. In fact, it has now become a common
expert-based methods, and hybrid methods which combine practice for agile teams to go through each user story and
model-based and expert-based methods [10]. Model-based estimate the effort required for completing it. Story points are
approaches leverages data from old projects to make pre- commonly used as a unit of effort measure for a user story
dictions about new projects. Expert-based methods rely on [24]. Currently, most agile teams heavily rely on experts’
human expertise to make such judgements. Most of the subjective assessment (e.g. planning poker, analogy, and
existing work (e.g. [11–22]) focus on the effort required for expert judgment) to arrive at an estimate. This may lead to
completing a whole project (as opposed to user stories or inaccuracy and more importantly inconsistencies between
issues). These approaches estimate the effort required for estimates [25].
To facilitate research in effort estimation for agile devel-
• M. Choetkiertikul, H. Dam, and A. Ghose are with the School of Comput-
opment, we have developed a new dataset for story point
ing and Information Technology, Faculty of Engineering and Information effort estimation. This dataset contains 23,313 user stories
Sciences, University of Wollongong, NSW, Australia, 2522. or issues with ground truth story points. Note that ground-
E-mail: {mc650, hoa, aditya}@uow.edu.au truth story points refer to the actual story points that were
• T. Tran and T. Pham are with the School of Information Technology,
Deakin University, Victoria, Australia, 3216. assigned to an issue by the team. We collected these issues
E-mail: {truyen.tran, phtra}@deakin.edu.au from 16 large open source projects in 9 repositories namely
• T. Menzies is with North Carolina State University, USA. Apache, Appcelerator, DuraSpace, Atlassian, Moodle, Lsst-
E-mail: [email protected]
corp, Mulesoft, Spring, and Talendforge. To the best of our

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

knowledge, this is the largest dataset (in terms of number of estimators: Random Guessing, Mean, and Median meth-
data points) for story point estimation where the focus is at ods and six alternatives (e.g. using Doc2Vec and Random
the issue/user story level rather than at the project level as Forests) in Mean Absolute Error, Median Absolute Error,
in traditional effort estimation datasets. and the Standardized Accuracy. These claims have also been
We also propose a prediction model which supports a tested using a non-parametric Wilcoxon test and Vargha and
team by recommending a story-point estimate for a given Delaney’s statistic to demonstrate the statistical significance
user story. Our model learns from the team’s previous story and the effect size.
point estimates to predict the size of new issues. This pre- The remainder of this paper is organized as follows. Sec-
diction system will be used in conjunction with (instead of a tion 2 provides a background of the story point estimation
replacement for) existing estimation techniques practiced by and deep neural networks. We then present the Deep-SE
the team. It can be used in an completely automated manner, model and explain how it can be trained in Section 3 and
i.e. the team will use the story points given by the prediction Section 4 respectively. Section 5 reports on the experimental
system. Alternatively, it could be used as a decision support evaluation of our approach. Related work is discussed in
system and takes part in the estimation process. This is Section 6 before we conclude and outline future work in
similar to the notions of combination-based effort estimation Section 7.
in which estimates come from different sources, e.g. a com-
bination of expert and formal model-based estimates [10].
The key novelty of our approach resides in the combination
2 BACKGROUND
of two powerful deep learning architectures: long short-term 2.1 Story point estimation
memory (LSTM) and recurrent highway network (RHN). When a team estimates with story points, it assigns a point
LSTM allows us to model the long-term context in the value (i.e. story points) to each user story. A story point
textual description of an issue, while RHN provides us estimate reflects the relative amount of effort involved in
with a deep representation of that model. We named this resolving or completing the user story: a user story that is
approach as Deep learning model for Story point Estimation assigned two story points should take twice as much effort
(Deep-SE). as a user story assigned one story point. Many projects
Our Deep-SE model is a fully end-to-end system where have now adopted this story point estimation approach
raw data signals (i.e. words) are passed from input nodes [25]. Projects that use issue tracking systems (e.g. JIRA
up to the final output node for estimating story points, [28]) record their user stories as issues. Figure 1 shows an
and the prediction errors are propagated from the output example of issue XD-2970 in the Spring XD project [29]
node all the way back to the word layer. Deep-SE au- which is recorded in JIRA. An issue typically has a title (e.g.
tomatically learns semantic features which represent the “Standardize XD logging to align with Spring Boot”) and
meaning of user stories or issue reports, thus liberating description. Projects that use JIRA Agile also record story
the users from manually designing and extracting features. points. For example, the issue in Figure 1 has 8 story points.
Feature engineering usually relies on domain experts who
use their specific knowledge of the data to create features
for machine learners to work. For example, the performance
of most of existing defect prediction models heavily relies
on the careful designing of good features (e.g. size of code,
number of dependencies, cyclomatic complexity, and code
churn metrics) which can discriminate between defective
and non-defective code [26]. Coming up with good features
is difficult, time-consuming, and requires domain-specific
knowledge, and hence poses a major challenge. In many
situations, manually designed features normally do not
generalize well: features that work well in a certain software
project may not perform well in other projects [27]. Bag-
of-Words (BoW) is a traditional technique to “engineer”
features representing textual data like issue description.
However, the BoW approach has two major weaknesses: it
ignores the semantics of words, e.g. fails to recognize the
semantic relations between “she” and “he”, and it ignore Fig. 1. An example of an issue with estimated story points
the sequential nature of text. In our approach, features are
automatically learned, thus obviating the need for designing Story points are usually estimated by the whole team
them manually. within a project. For example, the widely-used Planning
Although our Deep-SE is a deep model of multiple Poker [30] method suggests that each team member pro-
layers, it is recurrent and thus model parameters are shared vides an estimate and a consensus estimate is reached after
across layers. Hence, the number of parameters does not a few rounds of discussion and (re-)estimation. This practice
grow with the depth and consequently avoid overfitting. is different from traditional approaches (e.g. function points)
We also employ a number of common techniques such in several aspects. Both story points and function points
as dropout and early stopping combat overfitting. Our reflect an effort for resolving an issue. However, function
approach consistently outperforms three common baseline points can be determined by an external estimator based on

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

a standard set of rules (e.g. counting inputs, outputs, and ht−1 , and the previous memory ct−1 in order to compute
inquiries) that can be applied consistently by any trained the hidden state ht . The hidden state is used to produce an
practitioner. On the other hand, story points are developed output at each step t. For example, the output of predicting
by a specific team based on the team’s cumulative knowl- the next word k in a sentence would be a vector of proba-
edge and biases, and thus may not be useful outside the bilities across our vocabulary, i.e. sof tmax(Vk ht ) where Vk
team (e.g. in comparing performance across teams). Since is a row in the output parameter matrix Wout .
story points represent the effort required for completing a
user story, an estimate should cover different factors which XD logging to align
can affect the effort. These factors include how much work hk h1 h2 h3 h4 hk
needed to be done, the complexity of the work, and any c h1 h2 h3
uncertainty involving in the work [24]. LSTM LSTM LSTM LSTM LSTM … LSTM
c1 c2 c3
In agile development, user stories or issues are com-
wk w1 w2 w3 w4 wk
monly viewed as the first-class entity of a project since
they describe what has to be built in the software project, Standardize XD logging to
forming the basis for design, implementation and testing.
Story point sizes are used for measuring a team’s progress Fig. 2. An LSTM network
rate, prioritizing user stories, planning and scheduling for
future iterations and releases, and even costing and allocat- The most important element of LSTM is a short-term
ing resources. Story points are also the basis for other effort- memory cell – a vector that stores accumulated information
related estimation. For example, in our recent work [31], over time. The information stored in the memory is re-
they are used for predicting delivery capability for an on- freshed at each time step through partially forgetting old,
going iteration. Specifically, we predict the amount of work irrelevant information and accepting fresh new input. An
delivered at the end of an iteration, relative to the amount LSTM unit uses the forget gate f t to control how much
of work which the team has originally committed to. The information from the memory of previous context (i.e. ct−1 )
amount of work done in an iteration is then quantified in should be removed from the memory cell. The forget gate
terms of story points from the issues completed within that looks at the the previous output state ht−1 and the current
iteration. To enable such a prediction, we have taken into word wt , and outputs a number between 0 and 1. A value
account both the information of an iteration and user stories of 1 indicates that all the past memory is preserved, while
or issues involving in the iteration. Interaction between user a value of 0 means “completely forget everything”. The
stories and between user stories and resources are captured next step is updating the memory with new information
through extracting information related to the dependencies obtained from the current word wt . The input gate it is
between user stories and the assignment of user stories to used to control which new information will be stored in
developers. the memory. Information stored in the memory cell will be
Velocity is the sum of the story-point estimates of the used to produce an output ht . The output gate ot looks at
issues that the team resolved during an iteration. For ex- the current code token wt and the previous hidden state
ample, if the team resolves four stories each estimated at ht−1 , and determines which parts of the memory should be
three story points, their velocity is twelve. Velocity is used output.
for planning and predicting when a software (or a release)
should be completed. For example, if the team estimates ht
the next release to include 100 story points and the team’s wt
current velocity is 20 points per 2-week iteration, then it ot
*
would take 5 iterations (or 10 weeks) to complete the project.
Hence, it is important that the team is consistent in their ht-1
story point estimates to avoid reducing the predictability
in planning and managing their project. A machine learner ct-1 ct
* wt
can help the team maintain this consistency, especially in
coping with increasingly large numbers of issues. It does so wt * it
by learning insight from past issues and estimations to make ft ht-1
future estimations. ht-1

wt ht-1
2.2 Long Short Term Memory
Long Short-Term Memory (LSTM) [32, 33] is a special vari- Fig. 3. The internal structure of an LSTM unit
ant of recurrent neural networks [34]. While a feedforward
neural network maps an input vector into an output vector, This mechanism allows LSTM to effectively learn long-
an LSTM network uses a loop in a network that allows term dependencies in text. Consider trying to predict the last
information to persist and it can map a sequence into a word in the following text extracted from the description of
sequence (see Figure 2). Let w1 , ..., wn be the input sequence issue XD-2970 in Figure 1: “Boot uses slf4j APIs backed by
(e.g. words in a sentence) and y1 , ..., yn be the sequence of logback. This causes some build incompatibilities .... An additional
corresponding labels (e.g. the next words). At time step t, step is to replace log4j with .”. Recent information suggests
an LSTM unit reads the input wt , the previous hidden state that the next word is probably the name of a logging library,

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

but if we want to narrow down to a specific library, we need Regression Recurrent Highway Net
to remember that “logback” and “log4j” are logging libraries
from the earlier text. There could be a big gap between
relevant information and the point where it is needed, but
LSTM is capable to learn to connect the information. In story point
fact, LSTM has demonstrated ground-breaking results in estimate
many applications such as language models [35], speech
recognition [36] and video analysis [37]. document representation
The reading of the new input, writing of the output,
and the forgetting (i.e. all those gates) are all learnable. pooling
As an recurrent network, LSTM network shares the same
h1 h2 h3 h4 h5 h6
parameters across all steps since the same task is performed ….
at each step, just with different inputs. Thus, comparing LSTM
to traditional feedforward networks, using an LSTM net-
work significantly reduces the total number of parameters Embedding ….
which we need to learn. An LSTM model is trained using word vector
many input sequences with known actual output sequences.
Learning is done by minimizing the error between the actual Embedding matrix M
output and the predicted output by adjusting the model
parameters. Learning involves computing the gradient of
L(θ) during the backpropagation phase, and parameters are
….
updated using a stochastic gradient descent. It means that W1 W2 W3 W4 W5 W6
parameters are updated after seeing only a small random Standardize XD logging to align with ….
subset of sequences. We refer the readers to the seminal
paper [32] for more details about LSTM. Fig. 4. Deep learning model for Story point Estimation (Deep-SE).
The input layer (bottom) is a sequence of words (represented as filled
circles). Words are first embedded into a continuous space, then fed into
3 A PPROACH the LSTM layer. The LSTM outputs a sequence of state vectors, which
are then pooled to form a document-level vector. This global vector is
Our overall research goal is to build a prediction system then fed into a Recurrent Highway Net for multiple transformations (See
that takes as input the title and description of an issue Eq. (1) for detail). Finally, a regressor predicts an outcome (story-point).
and produces a story-point estimate for the issue. Title and
description are required information for any issue tracking
system. Hence, our prediction system is applicable to a wide These word vectors then serve as an input sequence to the
range of issue tracking systems, and can be used at any time, Long Short-Term Memory (LSTM) layer which computes a
even when an issue is created. vector representation for the whole document.
We combine the title and description of an issue report
After that, the document vector is fed into the Recurrent
into a single text document where the title is followed by the
Highway Network (RHWN), which transforms the docu-
description. Our approach computes vector representations
ment vector multiple times, before outputting a final vector
for these documents. These representations are then used
which represents the text. The vector serves as input for
as features to predict the story points of each issue. It
the regressor which predicts the output story-point. While
is important to note that these features are automatically
many existing regressors can be employed, we are mainly
learned from raw text, hence removing us from manually
interested in regressors that are differentiable with respect to
engineering the features.
the training signals and the input vector. In our implemen-
Figure 4 shows the Deep learning model for Story
tation, we use the simple linear regression that outputs the
point Estimation (Deep-SE) that we have designed for the
story-point estimate.
story point prediction system. It is composed of four com-
ponents arranged sequentially: (i) word embedding, (ii) Our entire system is trainable from end-to-end: (a) data
document representation using Long Short-Term Memory signals are passed from the words in issue reports to the
(LSTM) [32], (iii) deep representation using Recurrent High- final output node; and (b) the prediction error is propagated
way Net (RHWN) [38]; and (iv) differentiable regression. from the output node all the way back to the word layer.
Given a document which consists of a sequence of words
s = (w1 , w2 , ..., wn ), e.g. the word sequence (Standardize,
XD, logging, to, align, with, ....) in the title and description of 3.1 Word embedding
issue XD-2970 in Figure 1.
We model a document’s semantics based on the principle We represent each word as a low dimensional, continuous
of compositionality: the meaning of a document is deter- and real-valued vector, also known as word embedding. Here
mined by the meanings of its constituents (e.g. words) and we maintain a look-up table, which is a word embedding
the rules used to combine them (e.g. one word followed matrix M ∈ Rd×|V | where d is the dimension of word
by another). Hence, our approach models document rep- vector and |V | is vocabulary size. These word vectors are
resentation in two stages. It first converts each word in a pre-trained from corpora of issue reports, which will be
document into a fixed-length vector (i.e. word embedding). described in details in Section 4.1.

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

3.2 Document representation using LSTM is notoriously difficult due to two main problems: (i) the
Since an issue document consists of a sequence of words, we number of parameters grows with the number of layers,
model the document by accumulating information from the leading to overfitting; and (ii) stacking many non-linear
start to the end of the sequence. A powerful accumulator functions makes it difficult for the information and the
is a Recurrent Neural Network (RNN) [34], which can be gradients to pass through.
seen as multiple copies of the same single-hidden-layer To address these problems, we designed a deep repre-
network, each passing information to a successor. Thus, sentation that performs multiple non-linear transformations
recurrent networks allow information to be accumulated. using the idea from Highway Networks. Highway Nets are
While RNNs are theoretically powerful, they are difficult to the latest idea that enables efficient learning through those
train for long sequences [34], which are often seen in issue many non-linear layers [50]. A Highway Net is a special
reports (e.g. see the description of issue XD-2970 in Figure type of feedforward neural networks with a modification to
1). Hence, our approach employs Long Short-Term Memory the transformation taking place at a hidden unit to let infor-
(LSTM), a special variant of RNN (see Section 2 for more mation from lower layers pass linearly through. Specifically,
details of how LSTM work). the hidden state at layer l is defined as:

XD logging to hl+1 = αl ∗ hl + (1 − αl ) ∗ σl (hl ) (1)


where σl is a non-linear transform (e.g., a logistic or a
[‐0.27 ‐0.33 ‐0.67]
tanh) and αl = logit(hl ) is a linear logistic transform
[1 ‐0.5 ‐3] [‐1.3 0 2] [‐0.5 ‐0.5 ‐1] of hl . Here αl plays the role of a highway gate that lets
LSTM LSTM LSTM
Sequence embedding information passing from layer l to layer l+1 without loss of
information. For example, αl → 1 enables simple copying.
Output states
We need to learn a mapping from the raw words in an
[0.1 0.3 ‐0.2] [‐1 ‐2.1 0.5] [1.5 0.5 ‐1.2] issue description to the story points. A deep feedforward
Word embedding neural network like Highway Net effectively breaks the
Standardize XD logging mapping into a series of nested simple mappings, each
described by a different layer of the network. The first
Fig. 5. An example of how a vector representation is obtained for issue layer provides a (rough) estimate, and subsequent layers
reports iteratively refine that estimate. As the number of layers
increase, further refinement can be achieved. Comparing to
After the vector output state has been computed for ev- traditional feedforward networks, the special gating scheme
ery word in the input sequence, the next step is aggregating in Highway Net is highly effective in letting the information
those vectors into a single vector representing the whole and the gradients to pass through while stacking many non-
document. The aggregation operation is known as pooling. linear functions. In fact, earlier work has demonstrated that
There are multiple ways to perform pooling, but the main Highway Net can have up to a thousand layers [50], while
requirement is that pooling must be length invariant. In traditional deep neural nets cannot go beyond several layers
other words, pooling is not sensitive to variable length of [51].
the document. For example, the simplest statistical pooling We have also modified the standard Highway Network
method is mean-pooling where we take the sum of the by sharing parameters between layers, i.e. all the hidden
state vectors and divide it by the number of vectors. Other layers having the same hidden units. In other words, all the
pooling methods are such as max pooling (e.g. choose the hidden layers to have the same hidden units. This is similar
maximum value in each dimension), min pooling and sum to the notion of a recurrent network, and thus we called it
pooling. From our experience in other settings, a simple but a Recurrent Highway Network. Our previous work [38] has
often effective pooling method is averaging, which we also demonstrated the effectiveness of this approach in pattern
employed here [39]. recognition. This key novelty allows us to create a very
compact version of Recurrent Highway Network with only
3.3 Deep representation using Recurrent Highway Net- one set of parameters in αl and σl . This clearly produces
work a great advantage of avoiding overfitting. We note that
the number of layers here refers to the number of hidden
Given that vector representation of an issue report has been
layers of a Recurrent Highway Network, not the number of
extracted by the LSTM layer, we can use a differentiable
LSTM layers. The number of LSTM layers is the same as the
regressor for immediate prediction. However, this may be
number of words in an issue’s description.
sub-optimal since the network is rather shallow. Deep neu-
ral networks have become a popular method with many
ground-breaking successes in vision [40], speech recogni- 3.4 Regression
tion [41] and NLP [42, 43]. Deep nets represent complex At the top-layer of Deep-SE, we employ linear activation
data more efficiently than shallow ones [44]. Deep models function in a feedforward neural network as the final re-
can be expressive while staying compact, as theoretically gressor (see Figure 4) to produce a story-point estimate. This
analysed by recent work [45–49]. This have been empirically function can be defined as follows.
validated in recent record-breaking results in vision, speech n
recognition and machine translation. However, learning X
y = b0 + bi xi (2)
standard feedforward networks with many hidden layers
i=1

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

hyper-parameters, we used the default values. This will be


where y is the output story point , xi is an input signal discussed in more details in the evaluation section.
from RHWN layer, bi is trained coefficient (weight), and n Recall that the entire network can be reduced to a pa-
is the size of embedding dimension. rameterized function defined, which maps sequences of raw
words (in issue reports) to story points. Let θ be the set of
all parameters in the model. We define a loss function L(θ)
4 M ODEL TRAINING that measures the quality of a particular set of parameters
4.1 Pre-training based on the difference between the predicted story points
and the ground truth story points in the training data. A
Pre-training is a way to come up with a good parame- setting of the parameters θ that produces a prediction for an
ter initialization without using the labels (i.e. ground-truth issue in the training data consistent with its ground truth
story points). We pre-train the lower layers of Deep-SE (i.e. story points would have a very low loss L. Hence, learning
embedding and LSTM), which operate at the word level. is achieved through the optimization process of finding the
Pre-training is effective when the labels are not abundant. set of parameters θ that minimizes the loss function.
During pre-training, we do not use the ground-truth story Since every component in the model is differentiable,
points, but instead leverage two sources of information: the we use the popular stochastic gradient descent to perform
strong predictiveness of natural language, and availability optimization: through backpropagation, the model parame-
of free texts without labels (e.g. issue reports without story ters θ are updated in the opposite direction of the gradient
points). The first source comes from the property of lan- of the loss function L(θ). In this search, a learning rate η
guages that the next word can be predicted using previous is used to control how large of a step we take to reach a
words, thanks to grammars and common expressions. Thus, (local) minimum. We use RMSprop, an adaptive stochastic
at each time step t, we can predict the next word wt+1 using gradient method (unpublished note by Geoffrey Hinton),
the state ht , using the softmax function: which is known to work best for recurrent models. We tuned
exp (Uk ht ) RMSprop by partitioning the data into mutually exclusive
P (wt+1 = k | w1:t ) = P (3) training, validation, and test sets and running the training
k0 exp (Uk0 ht )
multiple times. Specifically, the training set is used to learn a
where Uk is a free parameter. Essentially we are building a useful model. After each training epoch, the learned model
language model, i.e., P (s) = P (w1:n Q)n, which can be factor- was evaluated on the validation set and its performance was
ized using the chain-rule as: P (w1 ) t=2 P (wt+1 | w1:t ). used to assess against hyperparameters (e.g. learning rate in
We note that the probability of the first word P (w1 ) in gradient searches). Note that the validation set was not used
a sequence is the number of sequences in the corpus which to learn any of the model’s parameters. The best performing
has that word w1 starting first. At step t, ht is computed by model in the validation set was chosen to be evaluated on
feeding ht−1 and wt to the LSTM unit (see Figure 2). Since the test set. We also employed the early stopping strategy
wt is a word embedding vector, Eq. (3) indirectly refers to (see Section 5.4), i.e. monitoring the model’s performance
the embedding matrix . during the validation phase and stopping when the perfor-
The language model can be learned by optimizing the mance got worse. If the log-loss does not improve for ten
log-loss − log P (s). However, the main bottleneck is compu- consecutive runs, we than terminate the training.
tational: Equation (3) costs |V | time to evaluate where |V | is To prevent overfitting in our neural network, we have
the vocabulary size, which can be hundreds of thousands for implemented an effective solution called dropout in our
a big corpus. For that reason, we implemented an approx- model [54], where the elements of input and output states
imate but very fast alternative based on Noise-Contrastive are randomly set to zeros during training. During testing,
Estimation [52], which reduces the time to M  |V |, where parameter averaging is used. In effect, dropout implicitly
M can be as small as 100. We also run the pre-training trains many models in parallel, and all of them share the
multiple times against a validation set to choose the best same parameter set. The final model parameters represent
model. We use perplexity, a common intrinsic evaluation the average of the parameters across these models. Typically,
metric based on the log-loss, as a criterion for choosing the dropout rate is set at 0.5.
the best model and early stopping. A smaller perplexity An important step prior to optimization is parameter
implies a better language model. The word embedding initialization. Typically the parameters are initialized ran-
matrix M ∈ Rd×|V | (which is first randomly initialized) and domly, but our experience shows that a good initialization
the initialization for LSTM parameters are learned through (through pre-training of embedding and LSTM layers) helps
this pre-training process. learning converge faster to good solutions.

4.2 Training Deep-SE 5 E VALUATION


The empirical evaluation we carried out aimed to answer
We have implemented the Deep-SE model in Python using
the following research questions:
Theano [53]. To simplify our model, we set the size of the
memory cell in an LSTM unit and the size of a recurrent • RQ1. Sanity Check: Is the proposed approach suitable
layer in RHWN to be the same as the embedding size. for estimating story points?
We tuned some important hyper-parameters (e.g. embed- This sanity check requires us to compare our Deep-
ding size and the number of hidden layers) by conducting SE prediction model with the three common baseline
experiments with different values, while for some other benchmarks used in the context of effort estimation:

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Random Guessing, Mean Effort, and Median Effort. more accurate results than the traditional Doc2Vec and
Random guessing is a naive benchmark used to Bag-of-Words (BoW) approach?
assess if an estimation model is useful [55]. Random The most popular text representation is Bag-of-
guessing performs random sampling (with equal Words (BoW) [58], where a text is represented as
probability) over the set of issues with known story a vector of word counts. For example, the title and
points, chooses randomly one issue from the sample, description of issue XD-2970 in Figure 1 would be
and uses the story point value of that issue as the converted into a sparse binary vector of vocabulary
estimate of the target issue. Random guessing does size, whose elements are mostly zeros, except for
not use any information associated with the target those at the positions designated to “standardize”,
issue. Thus any useful estimation model should out- “XD”, “logging” and so on. However, BoW has two
perform random guessing. Mean and Median Effort major weaknesses: they lose the sequence of the
estimations are commonly used as baseline bench- words and they also ignore semantics of the words.
marks for effort estimation [19]. They use the mean For example, “Python”, “Java”, and “logging ” are
or median story points of the past issues to estimate equally distant, while semantically “Python” should
the story points of the target issue. Note that the be closer to “Java” than “logging”. To address this
samples used for all the naive baselines (i.e. Random issue, Doc2vec [59] (i.e. alternatively known as para-
Guessing, Mean Effort, and Median Effort) were from graph2vec) is an unsupervised algorithm that learns
the training set. fixed-length feature representations from texts (e.g.
• RQ2. Benefits of deep representation: Does the use title and description of issues). Each document is
of Recurrent Highway Nets provide more accurate story represented in a dense vector which is trained to
point estimates than using a traditional regression tech- predict next words in the document.
nique? Both BoW and Doc2vec representations however ef-
To answer this question, we replaced the Recur- fectively destroys the sequential nature of text. This
rent Highway Net component with a regressor for question aims to explore whether LSTM with its ca-
immediate prediction. Here, we compare our ap- pability of modeling this sequential structure would
proach against four common regressors: Random improve the story point estimation. To answer this
Forests (RF), Support Vector Machine (SVM), Auto- question, we feed three different feature vectors: one
matically Transformed Linear Model (ATLM), and learned by LSTM and the other two derived from
Linear Regression (LR). We choose RF over other BoW technique and Doc2vec to the same Random
baselines since ensemble methods like RF, which Forrests regressor, and compare the predictive per-
combine the estimates from multiple estimators, are formance of the former (i.e. LSTM+RF) against that
an effective method for effort estimation [20]. RF of the latter (i.e. BoW+RF and Doc2vec+RF).We used
achieves a significant improvement over the decision Gensim1 , a well-known implementation for Doc2vec
tree approach by generating many classification and in our experiments.
regression trees, each of which is built on a random • RQ4. Cross-project estimation: Is the proposed ap-
resampling of the data, with a random subset of proach suitable for cross-project estimation?
variables at each node split. Tree predictions are then Story point estimation in new projects is often dif-
aggregated through averaging. We used the issues ficult due to lack of training data. One common
in the validation set to fine-tune parameters (i.e. the technique to address this issue is training a model
number of tress, the maximum depth of the tree, using data from a (source) project and applying it to
and The minimum number of samples). For SVM, the new (target) project. Since our approach requires
it has been widely use in software analytics (e.g. only the title and description of issues in the source
defect prediction) and document classification (e.g. and target projects, it is readily applicable to both
sentiment analysis) [56]. SVM is known as Support within-project estimation and cross-project estima-
Vector Regression (SVR) for regression problems. We tion. In practice, story point estimation is however
also used the issues in the validation set to find the known to be specific to teams and projects. Hence,
kernel type (e.g. linear, polynomial) for testing. We this question aims to investigate whether our ap-
used the Automatically Transformed Linear Model proach is suitable for cross-project estimation. We
(ATLM) [57] recently proposed as the baseline model have implemented Analogy-based estimation called
for software effort estimation. Although ATLM is ABE0, which were proposed in previous work [60–
simple and requires no parameter tuning, it performs 63] for cross-project estimation, and used it as a
well over a range of various project types in the benchmark. The ABE0 estimation bases on the dis-
traditional effort estimation [57]. Since LR is the top tances between individual issues. Specifically, the
layer of our approach, we also used LR as the imme- story point of issues in the target project is the
diate regressor after LSTM layers to assess whether mean of story points of k -nearest issues from the
RHWN improves the predictive performance. We source project. We used the Euclidean distance as a
then compare the performance of these alternatives, distance measure, Bag-of-Words of the title and the
namely LSTM+RF, LSTM+SVM, LSTM+ATLM, and description as the features of an issue, and k = 3.
LSTM+LR against our Deep-SE model. • RQ5. Normalizing/adjusting story points: Does our
• RQ3. Benefits of LSTM document representation:
Does the use of LSTM for modeling issue reports provide 1. https://radimrehurek.com/gensim/models/doc2vec.html

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

approach still perform well with normalized/adjusted story measure from the nine open source repositories up until
points? August 8, 2016. We then extracted the story point, title and
We have ran our experiments again using the new la- description from the collected issue reports. Each repository
bels (i.e. the normalized story points) for addressing contains a number of projects, and we chose to include in
the concern that whether our approach still performs our dataset only projects that had more than 300 issues
well on those adjusted ground-truths. We adjusted with story points. Issues that were assigned a story point
the story points of each issue using a range of infor- of zero (e.g., a non-reproducible bug), as well as issues with
mation, including the number of days from creation a negative, or unrealistically large story point (e.g. greater
to resolved time, the development time, the number than 100) were filtered out. Ultimately, about 2.66% of the
of comments, the number of users who commented collected issues were filtered out in this fashion. In total, our
on the issue, the number of times that an issue had dataset has 23,313 issues with story points from 16 differ-
their attributes changed, the number of users who ent projects: Apache Mesos (ME), Apache Usergrid (UG),
changed the issue’s attributes, the number of issue Appcelerator Studio (AS), Aptana Studio (AP), Titanum
links, the number of affect versions, and the number SDK/CLI (TI), DuraCloud (DC), Bamboo (BB), Clover (CV),
of fix versions. These information reflect the actual JIRA Software (JI), Moodle (MD), Data Management (DM),
effort and we thus refer to them as effort indicators. Mule (MU), Mule Studio (MS), Spring XD (XD), Talend Data
The values of these indicators were extracted after Quality (TD), and Talend ESB (TE). Table 1 summarizes
the issue was completed. The normalized story point the descriptive statistics of all the projects in terms of the
(SPnormalized ) is then computed as the following: minimum, maximum, mean, median, mode, variance, and
standard deviations of story points assigned used and the
SPnormalized = (0.5)SPoriginal + (0.5)SPnearest average length of the title and description of issues in each
where SPorginal is the original story point, and project. These sixteen projects bring diversity to our dataset
SPnearest is the mean of story points from 10 nearest in terms of both application domains and project’s char-
issues based on their actual effort indicators. Note acteristics. Specifically, they are different in the following
that we use K-Nearest Neighbour (KNN) to find the aspects: number of observation (from 352 to 4,667 issues),
nearest issues and the Euclidean metric to measure technical characteristics (different programming languages
the distance. We ran the experiment on the new la- and different application domains), sizes (from 88 KLOC to
bels (i.e SPnormalized ) using our proposed approach 18 millions LOC), and team characteristics (different team
against all other baseline benchmark methods. structures and participants from different regions).
• RQ6. Compare against the existing approach: How Since story points rate the relative effort of work between
does our approach perform against existing approaches in user stories, they are usually measured on a certain scale
story point estimation? (e.g. 1, 2, 4, 8, etc.) to facilitate comparison (e.g. a user
Recently, Porru et. al. [64] also proposed an esti- story is double the effort of the other) [25]. The story points
mation model for story points. Their approach uses used in planning poker typically follow a Fibonacci scale,
the type of an issue, the component(s) assigned to i.e. 1, 2, 3, 5, 8, 13, 21, and so on [24]. Among the projects
it, and the TF-IDF derived from its summary and we studied, only seven of them (i.e. Usergrid, Talend ESB,
description as features representing the issue. They Talend Data Quality, Mule Studio, Mule, Appcelerator Stu-
also performed univariate feature selection to choose dio, and Aptana Studio followed the Fibonacci scale, while
a subset of features for building a classifier. By the other nine projects did not use any scale. When our
contrast, our approach automatically learns seman- prediction system gives an estimate, we did not round it
tic features which represent the actual meaning of to the nearest story point value on the Fibonacci scale.
the issue’s report, thus potentially providing more An alternative approach (for those project which follow a
accurate estimates. To answer this research question, Fibonacci scale) is treating this as a classification problem:
we ran Deep-SE on the dataset used in Porru et. each value on the Fibonacci scale represents a class. The
al, re-implemented their approach, and performed limitations of this approach is that the number of classes
a comparison on the results produced by the two must be pre-determined and that it is not applicable to
approaches. projects that do not follow this scale. We however note that
the Fibonacci scale is only a guidance for estimating story
points. In practice, teams may follow other common scales,
5.1 Story point datasets define their own scales or may not follow any scale at all.
To collect data for our dataset, we looked for issues that Our approach does not rely on these specific scales, thus
were estimated with story points. JIRA is one of the few making it applicable to a wider range of projects. It predicts
widely-used issue tracking systems that support agile de- a scalar value (regression) rather than a class (classification).
velopment (and thus story point estimation) with its JIRA
Agile plugin. Hence, we selected a diverse collection of nine
major open source repositories that use the JIRA issue track- 5.2 Experimental setting
ing system: Apache, Appcelerator, DuraSpace, Atlassian, We performed experiments on the sixteen projects in our
Moodle, Lsstcorp, MuleSoft, Spring, and Talendforge. We dataset – see Table 1 for their details. To mimic a real
then used the Representational State Transfer (REST) API deployment scenario that prediction on a current issue is
provided by JIRA to query and collected those issue reports. made by using knowledge from estimations of the past
We collected all the issues which were assigned a story point issues, the issues in each project were split into training set

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

TABLE 1
Descriptive statistics of our story point dataset

Repo. Project Abb. # issues min SP max SP mean SP median SP mode SP var SP std SP mean TD length LOC
Apache Mesos ME 1,680 1 40 3.09 3 3 5.87 2.42 181.12 247,542+
Usergrid UG 482 1 8 2.85 3 3 1.97 1.40 108.60 639,110+
Appcelerator Appcelerator Studio AS 2,919 1 40 5.64 5 5 11.07 3.33 124.61 2,941,856#
Aptana Studio AP 829 1 40 8.02 8 8 35.46 5.95 124.61 6,536,521+
Titanium SDK/CLI TI 2,251 1 34 6.32 5 5 25.97 5.10 205.90 882,986+
DuraSpace DuraCloud DC 666 1 16 2.13 1 1 4.12 2.03 70.91 88,978+
Atlassian Bamboo BB 521 1 20 2.42 2 1 4.60 2.14 133.28 6,230,465#
Clover CV 384 1 40 4.59 2 1 42.95 6.55 124.48 890,020#
JIRA Software JI 352 1 20 4.43 3 5 12.35 3.51 114.57 7,070,022#
Moodle Moodle MD 1,166 1 100 15.54 8 5 468.53 21.65 88.86 2,976,645+
Lsstcorp Data Management DM 4,667 1 100 9.57 4 1 275.71 16.61 69.41 125,651*
Mulesoft Mule MU 889 1 21 5.08 5 5 12.24 3.50 81.16 589,212+
Mule Studio MS 732 1 34 6.40 5 5 29.01 5.39 70.99 16,140,452#
Spring Spring XD XD 3,526 1 40 3.70 3 1 10.42 3.23 78.47 107,916+
Talendforge Talend Data Quality TD 1,381 1 40 5.92 5 8 26.96 5.19 104.86 1,753,463#
Talend ESB TE 868 1 13 2.16 2 1 2.24 1.50 128.97 18,571,052#
Total 23,313
SP: story points, TD length: the number of words in the title and description of an issue, LOC: line of code
(+: LOC obtained from www.openhub.net, *: LOC from GitHub, and #: LOC from the reverse engineering)

(60% of the issues), development/validation set (i.e. 20%), against random guessing. Predictive performance can be
and test set (i.e. 20%) based on their creation time. The issues improved by decreasing MAE or increasing SA.
in the training set and the validation set were created before We assess the story point estimates produced by the
the issues in the test set, and the issues in the training set estimation models using MAE, MdAE and SA. To com-
were also created before the issues in the validation set. pare the performance of two estimation models, we tested
the statistical significance of the absolute errors achieved
5.3 Performance measures with the two models using the Wilcoxon Signed Rank Test
[71]. The Wilcoxon test is a safe test since it makes no
There are a range of measures used in evaluating the accu- assumptions about underlying data distributions. The null
racy of an effort estimation model. Most of them are based hypothesis here is: “the absolute errors provided by an
on the Absolute Error, (i.e. |ActualSP − EstimatedSP |). estimation model are not different to those provided by
where AcutalSP is the real story points assigned to an issue another estimation model”. We set the confidence limit at
and EstimatedSP is the outcome given by an estimation 0.05 and also applied Bonferroni correction [72] (0.05/K,
model. Mean of Magnitude of Relative Error (MRE) or Mean where K is the number of statistical tests) when multiple
Percentage Error and Prediction at level l [65], i.e. Pred(l), testing were performed.
have also been used in effort estimation. However, a number
In addition, we also employed a non-parametric effect
of studies [66–69] have found that those measures bias
size measure, the correlated samples case of the Vargha and
towards underestimation and are not stable when compar-
ing effort estimation models. Thus, the Mean Absolute Error Delaney’s ÂXY statistic [73] to assess whether the effect
(MAE), Median Absolute Error (MdAE), and the Standardized size is interesting. The ÂXY measure is chosen since it is
Accuracy (SA) have recently been recommended to compare agnostic to the underlying distribution of the data, and is
the performance of effort estimation models [19, 70]. MAE is suitable for assessing randomized algorithms in software
defined as: engineering generally [74] and effort estimation in particular
[19]. Specifically, given a performance measure (e.g. the
N
1 X Absolute Error from each estimation in our case), the ÂXY
M AE = |ActualSPi − EstimatedSPi | measures the probability that estimation model X achieves
N i=1
better results (with respect to the performance measure)
where N is the number of issues used for evaluating than estimation model Y . We note that this falls into the
the performance (i.e. test set), ActualSPi is the actual story correlated samples case of the Vargha and Delaney [73]
point, and EstimatedSPi is the estimated story point, for where the Absolute Error is derived by applying different
the issue i. estimation methods on the same data (i.e. same issues). We
We also report the Median Absolute Error (MdAE) since thus use the following formula to calculate the stochastic
it is more robust to large outliers. MdAE is defined as: superiority value between two estimation methods:

M dAE = M edian{|ActualSPi − EstimatedSPi |}


[#(X < Y ) + (0.5 × #(X = Y ))]
where 1 ≤ i ≤ N . ÂXY = ,
n
SA is based on MAE and it is defined as:
 
M AE where #(X < Y ) is the number of issues that the Absolute
SA = 1− × 100
M AErguess Error from X less than Y , #(X = Y ) is the number of
issues that the Absolute Error from X equal to Y , and n
where M AErguess is the MAE of a large number (e.g. is the number of issues. We also compute the average of
1000 runs) of random guesses. SA measures the comparison the stochastic superiority measures (Aiu ) of our approach

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

against each of the others using the following formular: are divided into 9 clusters (using K-means clustering) based
P on their embedding which was learned through the pre-
k6=i Aik training process. We used t-distributed stochastic neighbor
Aiu = ,
l−1 embedding (t-SNE) [75] to display high-dimensional vectors
where Aik is the pairwise stochastic superiority values in two dimensions.
(ÂXY ) for all (i, k) pairs of estimation methods, k = 1, ..., l, We show here some representative words from some
and l is a number of estimation methods, e.g. variable i clusters for a brief illustration. Words that are semantically
refers to Deep-SE and l = 4 when comparing Deep-SE related are grouped in the same cluster. For example, words
against Random, Mean and Median methods. related to networking like soap, configuration, tcp, and load
are in one cluster. This indicates that to some extent, the
learned vectors effectively capture the semantic relations be-
5.4 Hyper-parameter settings for training a Deep-SE
tween words, which is useful for the story-point estimation
model
task we do later.
We focused on tuning two important hyper-parameters: the
number of word embedding dimensions and the number of 15

hidden layers in the recurrent highway net component of


our model. To do so, we fixed one parameter and varied
10
the other to observe the MAE performance. We chose to test string, method, set,
with four different embedding sizes: 10, 50, 100, and 200, return, thread
and twelve variations of the number of hidden layers from 5
2 to 200. The embedding size is the number of dimensions of
the vector which represents a word. This word embedding client, service, queue,
0 session
host,
is a low dimensional vector representation of words in the
-15 -10 -5 0 5 10 15
vocabulary. This tuning was done using the validation set.
Figure 6 shows the results from experimenting with Apache -5 localhost, uri, xml,
Mesos. As can be seen, the setting where the number of http, schema
java, c, implementation,
embeddings is 50 and the number of hidden layers is 10 data, jvm
gives the lowest MAE, and thus was chosen. -10
soap, configuration,
tcp, load
2.4 -15
2.2 C1 C2 C3 C4 C5 C6 C7 C8 C9
2
MAE

1.8 Fig. 7. Top-500 word clusters used in the Apache’s issue reports
1.6
1.4 The pre-training step is known to effectively deal with
1.2 limited labelled data [76–78]. Here, pre-training does not
1 require story-point labels since it is trained by predicting
2 3 5 10 20 30 40 50 60 80 100 200 the next words. Hence the number of data points equals to
Number of hidden layers
DIM10 DIM50 DIM100 DIM200
the number of words. Since for each project repository we
used 50,000 issues for pre-training, we had approximately 5
Fig. 6. Story point estimation performance with different parameter. million data points per repository for pre-training.

For both pre-training we trained with 100 runs and the


5.6 The correlation between the story points and the
batch size is 50. The initial learning rate in pre-training
development time
was set to 0.02, adaptation rate was 0.99, and smoothing
factor was 10−7 . For the main Deep-SE model we used Identifying the actual effort required for completing an issue
1,000 epoches and the batch size wass set to 100. The initial is very challenging (especially in open source projects) since
learning rate in the main model was set to 0.01, adaptation in most cases the actual effort was not tracked and recorded.
rate was 0.9, and smoothing factor was 10−6 . Dropout rates We were however able to extract the development time
for the RHWN and LSTM layers were set to 0.5 and 0.2 which was the duration between when the issue’s status
respectively. The maximum sequence length used by the was set to “in-progress” and when it was set to “resolved”.
LSTM is 100 words, which is the average length of issue Thus, we have explicitly excluded the waiting time for being
description. assigned to a developer or being put on hold. The develop-
ment time is the closest to the actual effort of completing the
issue that we were able to extract from the data. We then
5.5 Pre-training performed two widely-used statistical tests (Spearman’s
In most repositories, we used around 50,000 issues without rank and Pearson rank correlation) [79] for all the issues in
story points (i.e. without labels) for pre-training, except the our dataset. Table 2 shows the Spearman’s rank and Pearson
Mulesoft repository which has much smaller number of rank correlation coefficient and p-value for all projects. We
issues (only 8,036 issues) available for pre-training. Figure have found that there is a significantly (p < 0.05) positive
7 show the top-500 frequent words used in Apache. They correlation between the story points and the development

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

TABLE 2 The analysis of MAE, MdAE, and SA suggests that the


The coefficient and p-value of the Spearman’s rank and Pearson rank estimations obtained with our approach, Deep-SE, are better
correlation on the story points against the development time
than those achieved by using Mean, Median, and Random
estimates. Deep-SE consistently outperforms all these three
Spearman’s rank Pearson correlation baselines in all sixteen projects.
Project coefficient p-value coefficient p-value
Our approach improved between 3.29% (in project MS)
Appcelerator Studio 0.330 <0.001 0.311 <0.001 to 57.71% (in project BB) in terms of MAE, 11.71% (in MU)
Aptana Studio 0.241 <0.001 0.325 <0.001
Bamboo 0.505 <0.001 0.476 <0.001
to 73.86% (in CV) in terms of MdAE, and 20.83% (in MS) to
Clover 0.551 <0.001 0.418 <0.001 449.02% (in MD) in terms of SA over the Mean method. The
Data Management 0.753 <0.001 0.769 <0.001 improvements of our approach over the Median method
DuraCloud 0.225 <0.001 0.393 <0.001 are between 2.12% (in MS) to 52.90% (in JI) in MAE, 0.50%
JIRA Software 0.512 <0.001 0.560 <0.001
Mesos 0.615 <0.001 0.766 <0.001
(in MS) to 63.50% (in ME) in MdAE, and 2.70% (in DC) to
Moodle 0.791 <0.001 0.816 <0.001 328.82% (in JI) in SA. Overall, the improvement achieved by
Mule 0.711 <0.001 0.722 <0.001 Deep-SE over the Mean and Median method is 34.06% and
Mule Studio 0.630 <0.001 0.565 <0.001 26.77% in terms of MAE, averaging across all projects.
Spring XD 0.486 <0.001 0.614 <0.001
Talend Data Quality 0.390 <0.001 0.370 <0.001
We note that the results achieved by the estimation
Talend ESB 0.504 <0.001 0.524 <0.001 models vary between different projects. For example, our
Titanium SDK/CLI 0.322 <0.001 0.305 <0.001 Deep-SE achieved 0.64 MAE in the Talend ESB project
Usergrid 0.212 0.005 0.263 0.001 (TE), while it achieved 5.97 MAE in Moodle (MD) project.
The distribution of story points may be the cause of this
variation: the standard deviation of story points in TE is
time across all 16 project we studied. In some projects (e.g. only 1.50, while that in MD is 21.65 (see Table 1).
Moodle) there was a strong correlation with the coefficients
was around 0.8. This positive correlation demonstrates that TABLE 4
the higher story point, the longer development time, which Comparison on the effort estimation benchmarks using Wilcoxon test
suggests that a correlation between an issue’s story points and ÂXY effect size (in brackets)
and its actual effort.
Deep-SE vs Mean Median Random Aiu
ME <0.001 [0.77] <0.001 [0.81] <0.001 [0.90] 0.83
5.7 Results UG <0.001 [0.79] <0.001 [0.79] <0.001 [0.81] 0.80
AS <0.001 [0.78] <0.001 [0.78] <0.001 [0.91] 0.82
We report here the results in answering research questions AP 0.040 [0.69] <0.001 [0.79] <0.001 [0.84] 0.77
RQs 1–6. TI <0.001 [0.77] <0.001 [0.72] <0.001 [0.88] 0.79
DC <0.001 [0.80] 0.415 [0.54] <0.001 [0.81] 0.72
RQ1: Sanity check BB <0.001 [0.78] <0.001 [0.78] <0.001 [0.85] 0.80
CV <0.001 [0.75] <0.001 [0.70] <0.001 [0.91] 0.79
JI <0.001 [0.76] <0.001 [0.79] <0.001 [0.79] 0.78
TABLE 3 MD <0.001 [0.81] <0.001 [0.75] <0.001 [0.80] 0.79
Evaluation results of Deep-SE, the Mean and Median method (the best DM <0.001 [0.69] <0.001 [0.59] <0.001 [0.75] 0.68
results are highlighted in bold). MAE and MdAE - the lower the better, MU 0.003 [0.73] <0.001 [0.73] <0.001 [0.82] 0.76
SA - the higher the better. MS 0.799 [0.56] 0.842 [0.56] <0.001 [0.69] 0.60
XD <0.001 [0.70] <0.001 [0.70] <0.001 [0.78] 0.73
TD <0.001 [0.86] <0.001 [0.85] <0.001 [0.87] 0.86
Proj Method MAE MdAE SA Proj Method MAE MdAE SA TE <0.001 [0.73] <0.001 [0.73] <0.001 [0.92] 0.79
ME Deep-SE 1.02 0.73 59.84 JI Deep-SE 1.38 1.09 59.52
mean 1.64 1.78 35.61 mean 2.48 2.15 27.06
median 1.73 2.00 32.01 median 2.93 2.00 13.88 Table 4 shows the results of the Wilcoxon test (together
UG Deep-SE 1.03 0.80 52.66 MD Deep-SE 5.97 4.93 50.29 with the corresponding ÂXY effect size) to measure the
mean 1.48 1.23 32.13 mean 10.90 12.11 9.16 statistical significance and effect size (in brackets) of the
median 1.60 1.00 26.29 median 7.18 6.00 40.16
AS Deep-SE 1.36 0.58 60.26 DM Deep-SE 3.77 2.22 47.87
improved accuracy achieved by Deep-SE over the base-
mean 2.08 1.52 39.02 mean 5.29 4.55 26.85 lines: Mean Effort, Median Effort, and Random Guessing.
median 1.84 1.00 46.17 median 4.82 3.00 33.38
In 45/48 cases, our Deep-SE significantly outperforms the
AP Deep-SE 2.71 2.52 42.58 MU Deep-SE 2.18 1.96 40.09
mean 3.15 3.46 33.30 mean 2.59 2.22 28.82 baselines after applying Bonferroni correction with effect
median 3.71 4.00 21.54 median 2.69 2.00 26.07 sizes greater than 0.5. Moreover, the average of the stochas-
TI Deep-SE 1.97 1.34 55.92 MS Deep-SE 3.23 1.99 17.17 tic superiority (Aiu ) of our approach against the baselines is
mean 3.05 1.97 31.59 mean 3.34 2.68 14.21
median 2.47 2.00 44.65 median 3.30 2.00 15.42 greater than 0.7 in the most cases. The highest Aiu achieving
DC Deep-SE 0.68 0.53 69.92 XD Deep-SE 1.63 1.31 46.82 in the Talend Data Quality project (TD) is 0.86 which can be
mean 1.30 1.14 42.88 mean 2.27 2.53 26.00
median 0.73 1.00 68.08 median 2.07 2.00 32.55 considered as large effect size (ÂXY > 0.8).
BB Deep-SE 0.74 0.61 71.24 TD Deep-SE 2.97 2.92 48.28 We note that the improvement brought by our approach
mean 1.75 1.31 32.11 mean 4.81 5.08 16.18
median 1.32 1.00 48.72 median 3.87 4.00 32.43
over the baselines was not significant for project MS. One
CV Deep-SE 2.11 0.80 50.45 TE Deep-SE 0.64 0.59 69.67
possible reason is that the size of the training and pre-
mean 3.49 3.06 17.84 mean 1.14 0.91 45.86 training data for MS is small, and deep learning techniques
median 2.84 2.00 33.33 median 1.16 1.00 44.44
tend to perform well with large training samples.
Table 3 shows the results achieved from Deep-SE, and Our approach outperforms the baselines, thus passing the
two baseline methods: Mean and Median method (See sanity check required by RQ1.
Appendix A.1 for the distribution of the Absolute Error).

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

RQ2: Benefits of deep representation improvement of our approach over LSTM+RF, LSTM+SVM,
and LSTM+ATLM is still significant after applying p-value
TABLE 5 correction with the effect size greater than 0.5 in 59/64 cases.
Evaluation results of Deep-SE, LSTM+RF, LSTM+SVM, LSTM+ATLM, In most cases, when comparing the proposed model against
and LSTM+LR (the best results are highlighted in bold). MAE and LSTM+RF, LSTM+SVM, LSTM+ATLM, and LSTM+LR, the
MdAE - the lower the better, SA - the higher the better.
effect sizes are small (between 0.5 and 0.6). A major part of
those improvement were brought by our use of the deep
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
learning LSTM architecture to model the textual description
ME Deep-SE 1.02 0.73 59.84 JI Deep-SE 1.38 1.09 59.52
lstm+rf 1.08 0.90 57.57 lstm+rf 1.71 1.27 49.71 of an issue. The use of highway recurrent networks (on top
lstm+svm 1.07 0.90 58.02 lstm+svm 2.04 1.89 40.05
lstm+atlm 1.08 0.95 57.60 lstm+atlm 2.10 1.95 38.26 of LSTM) has also improved the predictive performance, but
lstm+lr 1.10 0.96 56.94 lstm+lr 2.10 1.95 38.26 not as large effects as the LSTM itself (especially for those
UG Deep-SE 1.03 0.80 52.66 MD Deep-SE 5.97 4.93 50.29 projects which have very small number of issues). However,
lstm+rf 1.07 0.85 50.70 lstm+rf 9.86 9.69 17.86
lstm+svm 1.06 1.04 51.23 lstm+svm 6.70 5.44 44.19 our approach, Deep-SE, achieved Aiu greater than 0.6 in the
lstm+atlm 1.40 1.20 35.55 lstm+atlm 9.97 9.61 16.92
lstm+lr 1.40 1.20 35.55 lstm+lr 9.97 9.61 16.92 most cases.
AS Deep-SE 1.36 0.58 60.26 DM Deep-SE 3.77 2.22 47.87
lstm+rf 1.62 1.40 52.38 lstm+rf 4.51 3.69 37.71 TABLE 6
lstm+svm 1.46 1.42 57.20 lstm+svm 4.20 2.87 41.93 Comparison between the Recurrent Highway Net against Random
lstm+atlm 1.59 1.30 53.29 lstm+atlm 4.70 3.74 35.01
lstm+lr 1.68 1.46 50.78 lstm+lr 5.30 3.66 26.68 Forests, Support Vector Machine, Automatically Transformed Linear
AP Deep-SE 2.71 2.52 42.58 MU Deep-SE 2.18 1.96 40.09
Model, and Linear Regression using Wilcoxon test and Â12 effect size
lstm+rf 2.96 2.80 37.34 lstm+rf 2.20 2.21 38.73 (in brackets)
lstm+svm 3.06 2.90 35.26 lstm+svm 2.28 2.89 37.44
lstm+atlm 3.06 2.76 35.21 lstm+atlm 2.46 2.39 32.51
lstm+lr 3.75 3.66 20.63 lstm+lr 2.46 2.39 32.51 Deep-SE vs LSTM+RF LSTM+SVM LSTM+ATLM LSTM+LR Aiu
TI Deep-SE 1.97 1.34 55.92 MS Deep-SE 3.23 1.99 17.17 ME <0.001 [0.57] <0.001 [0.54] <0.001 [0.59] <0.001 [0.59] 0.57
lstm+rf 2.32 1.97 48.02 lstm+rf 3.30 2.77 15.30 UG 0.004 [0.59] 0.010 [0.55] <0.001 [1.00] <0.001 [0.73] 0.72
lstm+svm 2.00 2.10 55.20 lstm+svm 3.31 3.09 15.10 AS <0.001 [0.69] <0.001 [0.51] <0.001 [0.71] <0.001 [0.75] 0.67
lstm+atlm 2.51 2.03 43.87 lstm+atlm 3.42 2.75 12.21 AP <0.001 [0.60] <0.001 [0.52] <0.001 [0.62] <0.001 [0.64] 0.60
lstm+lr 2.71 2.31 39.32 lstm+lr 3.42 2.75 12.21 TI <0.001 [0.65] 0.007 [0.51] <0.001 [0.69] <0.001 [0.71] 0.64
DC Deep-SE 0.68 0.53 69.92 XD Deep-SE 1.63 1.31 46.82 DC 0.406 [0.55] 0.015 [0.60] <0.001 [0.97] 0.024 [0.58] 0.68
lstm+rf 0.69 0.62 69.52 lstm+rf 1.81 1.63 40.99 BB <0.001 [0.73] 0.007 [0.60] <0.001 [0.84] <0.001 [0.75] 0.73
lstm+svm 0.75 0.90 67.02 lstm+svm 1.80 1.77 41.33 CV <0.001 [0.70] 0.140 [0.63] <0.001 [0.82] 0.001 [0.70] 0.71
lstm+atlm 0.87 0.59 61.57 lstm+atlm 1.83 1.65 40.45 JI 0.006 [0.71] 0.001 [0.67] 0.002 [0.89] <0.001 [0.79] 0.77
lstm+lr 0.80 0.67 64.96 lstm+lr 1.85 1.72 39.63 MD <0.001 [0.76] <0.001 [0.57] <0.001 [0.74] <0.001 [0.69] 0.69
DM <0.001 [0.62] <0.001 [0.56] <0.001 [0.61] <0.001 [0.62] 0.60
BB Deep-SE 0.74 0.61 71.24 TD Deep-SE 2.97 2.92 48.28
MU 0.846 [0.53] 0.005 [0.62] 0.009 [0.67] 0.003 [0.64] 0.62
lstm+rf 1.01 1.00 60.95 lstm+rf 3.89 4.37 32.14
lstm+svm 0.81 1.00 68.55 lstm+svm 3.49 3.37 39.13
MS 0.502 [0.53] 0.054 [0.50] <0.001 [0.82] 0.195 [0.56] 0.60
lstm+atlm 1.97 1.78 23.70 lstm+atlm 3.86 4.11 32.71 XD <0.001 [0.63] <0.001 [0.57] <0.001 [0.65] <0.001 [0.60] 0.61
lstm+lr 1.26 1.16 51.24 lstm+lr 3.79 3.67 33.88 TD <0.001 [0.78] <0.001 [0.68] <0.001 [0.70] <0.001 [0.70] 0.72
TE 0.020 [0.53] 0.002 [0.59] <0.001 [0.66] 0.006 [0.65] 0.61
CV Deep-SE 2.11 0.80 50.45 TE Deep-SE 0.64 0.59 69.67
lstm+rf 3.08 2.77 27.58 lstm+rf 0.66 0.65 68.51
lstm+svm 2.50 2.32 41.22 lstm+svm 0.70 0.90 66.61
lstm+atlm 3.11 2.49 26.90 lstm+atlm 0.70 0.72 66.51 The proposed approach of using Recurrent Highway Net-
lstm+lr 3.36 2.76 21.07 lstm+lr 0.77 0.71 63.20
works is effective in building a deep representation of
issue reports and consequently improving story point
Table 5 shows MAE, MdAE, and SA achieved from
estimation.
Deep-SE using Recurrent Highway Networks (RHWN) for
deep representation of issue reports against using Ran-
dom Forests, Support Vector Machine, Automatically Trans- RQ3: Benefits of LSTM document representation
formed Linear Model, and Linear Regression Model cou- To study the benefits of using LSTM in representing
pled with LSTM (i.e. LSTM+RF, LSTM+SVM, LSTM+ATLM, issue reports, we compared the improved accuracy achieved
and LSTM+LR). The distribution of the Absolute Error is by Random Forest using the features derived from LSTM
reported in Appendix A.2. When we use MAE, MdAE, against that using the features derived from BoW and
and SA as evaluation criteria, Deep-SE is still the best ap- Doc2vec. For a fair comparison we used Random Forests as
proach, consistently outperforming LSTM+RF, LSTM+SVM, the regressor in all settings and the result is reported in Table
LSTM+ATLM, and LSTM+LR across all sixteen projects. 7 (see the distribution of the Absolute Error in Appendix
Using RHWN improved over RF between 0.91% (in MU) A.3). LSTM performs better than BoW and Doc2vec with
to 39.45% (in MD) in MAE, 5.88% (in UG) to 71.12% (in CV) respect to the MAE, MdAE, and SA measures in twelve
in MdAE, and 0.58% (in DC) to 181.58% (in MD) in SA. The projects (e.g. ME, UG, and AS) from sixteen projects. LSTM
improvements of RHWN over SVM are between 1.50% (in improved 4.16% and 11.05% in MAE over Doc2vec and
TI) to 32.35% (in JI) in MAE, 9.38% (in MD) to 65.52% (in BoW, respectively, averaging across all projects.
CV) in MdAE, and 1.30% (in TI) to 48.61% (in JI). In terms Among those twelve projects, LSTM improved over BoW
of using ATLM, RHWN improved over it between 5.56% (in between 0.30% (in MS) to 28.13% (in DC) in terms of MAE,
MS) to 62.44% (in BB) in MAE, 8.70% (in AP) to 67.87% (in 1.06% (in AP) to 45.96% (in JI) in terms of MdAE, and 0.67%
CV) in MdAE, and 3.89% (in ME) to 200.59% (in BB) in SA. (in AP) to 47.77% (in TD) in terms of SA. It also improved
Overall, RHWN improved , in terms of MAE, 9.63% over over Doc2vec between 0.45% (in MU) to 18.57% (in JI) in
SVM, 13.96% over RF, 21.84% over ATLM, and 23.24% over terms of MAE, 0.71% (in AS) to 40.65% (in JI) in terms of
LR, averaging across all projects. MdAE, and 2.85% (in TE) to 31.29% (in TD) in terms of SA.
In addition, the results for the Wilcoxon test to compare We acknowledge that BoW and Doc2vec perform better
our approach (Deep-SE) against LSTM+RF, LSTM+SVM, than LSTM in some cases. For example, in the Moodle
LSTM+ATLM, and LSTM+LR is shown in Table 6. The project (MD), D2V+RF performed better than LSTM+RF

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

TABLE 7 TABLE 9
Evaluation results of LSTM+RF, BoW+RF, and Doc2vec+RF (the best Mean Absolute Error (MAE) on cross-project estimation and
results are highlighted in bold). MAE and MdAE - the lower the better, comparison of Deep-SE and ABE0 using Wilcoxon test and ÂXY effect
SA - the higher the better. size (in brackets)

Proj Method MAE MdAE SA Proj Method MAE MdAE SA Source Target Deep-SE ABE0 Deep-SE vs ABE0
ME lstm+rf 1.08 0.90 57.57 JI lstm+rf 1.71 1.27 49.71
bow+rf 1.31 1.34 48.66 bow+rf 2.10 2.35 38.34 (i) within-repository
d2v+rf 1.14 0.98 55.28 d2v+rf 2.10 2.14 38.29 ME UG 1.07 1.23 <0.001 [0.78]
UG lstm+rf 1.07 0.85 50.70 MD lstm+rf 9.86 9.69 17.86 UG ME 1.14 1.22 0.012 [0.52]
bow+rf 1.19 1.28 45.24 bow+rf 10.20 10.22 15.07 AS AP 2.75 3.08 <0.001 [0.67]
d2v+rf 1.12 0.92 48.47 d2v+rf 8.02 9.87 33.19
AS TI 1.99 2.56 <0.001 [0.70]
AS lstm+rf 1.62 1.40 52.38 DM lstm+rf 4.51 3.69 37.71 AP AS 2.85 3.00 0.051 [0.55]
bow+rf 1.83 1.53 46.34 bow+rf 4.78 3.98 33.84
d2v+rf 1.62 1.41 52.38 d2v+rf 4.71 3.99 34.87 AP TI 3.41 3.53 0.003 [0.56]
AP lstm+rf 2.96 2.80 37.34 MU lstm+rf 2.20 2.21 38.73
MU MS 3.14 3.55 0.041 [0.55]
bow+rf 2.97 2.83 37.09 bow+rf 2.31 2.54 36.64 MS MU 2.31 2.64 0.030 [0.56]
d2v+rf 3.20 2.91 32.29 d2v+rf 2.21 2.69 39.36
Avg 2.33 2.60
TI lstm+rf 2.32 1.97 48.02 MS lstm+rf 3.30 2.77 15.30
bow+rf 2.58 2.30 42.15 bow+rf 3.31 2.57 15.58 (ii) cross-repository
d2v+rf 2.41 2.16 46.02 d2v+rf 3.40 2.93 12.79
AS UG 1.57 2.04 0.004 [0.61]
DC lstm+rf 0.69 0.62 69.52 XD lstm+rf 1.81 1.63 40.99 AS ME 2.08 2.14 0.022 [0.51]
bow+rf 0.96 1.11 57.78 bow+rf 1.98 1.72 35.56
d2v+rf 0.77 0.77 66.14 d2v+rf 1.88 1.73 38.72
MD AP 5.37 6.95 <0.001 [0.58]
MD TI 6.36 7.10 0.097 [0.54]
BB lstm+rf 1.01 1.00 60.95 TD lstm+rf 3.89 4.37 32.14
bow+rf 1.34 1.26 48.06 bow+rf 4.49 5.05 21.75 MD AS 5.55 6.77 <0.001 [0.61]
d2v+rf 1.12 1.16 56.51 d2v+rf 4.33 4.80 24.48 DM TI 2.67 3.94 <0.001 [0.64]
CV lstm+rf 3.08 2.77 27.58 TE lstm+rf 0.66 0.65 68.51 UG MS 4.24 4.45 0.005 [0.54]
bow+rf 2.98 2.93 29.91 bow+rf 0.86 0.69 58.89 ME MU 2.70 2.97 0.015 [0.53]
d2v+rf 3.16 2.79 25.70 d2v+rf 0.70 0.89 66.61
Avg 3.82 4.55

TABLE 8
Comparison of Random Forest with LSTM, Random Forests with BoW,
and Random Forests with Doc2vec using Wilcoxon test and ÂXY RQ4: Cross-project estimation
effect size (in brackets) We performed sixteen sets of cross-project estimation
experiments to test two settings: (i) within-repository: both
LSTM vs BoW Doc2Vec Aiu the source and target projects (e.g. Apache Mesos and
Apache Usergrid) were from the same repository, and pre-
ME <0.001 [0.70] 0.142 [0.53] 0.62
training was done using only the source projects, not the
UG <0.001 [0.71] 0.135 [0.60] 0.66
target projects; and (ii) cross-repository: the source project
AS <0.001 [0.66] <0.001 [0.51] 0.59
AP 0.093 [0.51] 0.144 [0.52] 0.52 (e.g. Appcelerator Studio) was in a different repository from
TI <0.001 [0.67] <0.001 [0.55] 0.61 the target project Apache Usergrid, and pre-training was
DC <0.001 [0.73] 0.008 [0.59] 0.66 done using only the source project.
BB <0.001 [0.77] 0.002 [0.66] 0.72 Table 9 shows the performance of our Deep-SE model
CV 0.109 [0.61] 0.581 [0.57] 0.59 and ABE0 for cross-project estimation (see the distribution
JI 0.009 [0.67] 0.011 [0.62] 0.65 of the Absolute Error in Appendix A.4). We also used a
MD 0.022 [0.63] 0.301 [0.51] 0.57 benchmark of within-project estimation where older issues
DM <0.001 [0.60] <0.001 [0.55] 0.58 of the target project were used for training (see Table 3).
MU 0.006 [0.59] 0.011 [0.57] 0.58 In all cases, the proposed approach when used for cross-
MS 0.780 [0.54] 0.006 [0.57] 0.56 project estimation performed worse than when used for
XD <0.001 [0.60] 0.005 [0.55] 0.58 within-project estimation (e.g. on average 20.75% reduction
TD <0.001 [0.73] <0.001 [0.67] 0.70 in performance for within-repository and 97.92% for cross-
TE <0.001 [0.69] 0.005 [0.61] 0.65
repository). However, our approach outperformed the cross-
project baseline (i.e. ABE0) in all cases – it achieved 2.33
and 3.82 MAE in within and cross repository, while ABE0
in MAE and SA – it achieved 8.02 MAE and 33.19 SA. achieved 2.60 and 4.55 MAE. The improvement of our
This could reflect that the combination between LSTM and approach over ABE0 is still significant after applying p-
RHWN significantly improves the accuracy of the estima- value correction with the effect size greater than 0.5 in 14/16
tions. cases.
The improvement of LSTM over BoW and Doc2vec is These results confirm a universal understanding [25] in
significant after applying Bonferroni correction with effect agile development that story point estimation is specific to
size greater than 0.5 in 24/32 cases and Aiu being greater teams and projects. Since story points are relatively mea-
than 0.5 in all projects (see Table 8). sured, it is not uncommon that two different same-sized
teams could give different estimates for the same user story.
The proposed LSTM-based approach is effective in au- For example, team A may estimate 5 story points for user
tomatically learning semantic features representing issue story U C1 while team B gives 10 story points. However,
description, which improves story-point estimation. it does not necessarily mean that team B would do more
work for completing U C1 than team A. It more likely means

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

that team’B baselines are twice bigger than team A’s, i.e. for
“baseline” user story which requires 5 times less the effort
than U C1 takes, team A would give it 1 story point while
team B gives 2 story points. Hence, historical estimates
are more valuable for within-project estimation, which is
demonstrated by this result. TABLE 10
Evaluation results on the adjusted story points (the best results are
Given the specificity of story points to teams and projects, highlighted in bold). MAE and MdAE - the lower the better, SA - the
higher the better.
our proposed approach is more effective for within-project
estimation.
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
ME Deep-SE 0.27 0.03 76.58 JI Deep-SE 0.60 0.51 63.20
RQ5: Adjusted/normalized story points lstm+rf 0.34 0.15 70.43 lstm+rf 0.74 0.79 54.42
bow+rf 0.36 0.16 68.82 bow+rf 0.66 0.53 58.99
d2v+rf 0.35 0.15 69.87 d2v+rf 0.70 0.53 56.99
Table 10 shows the results of our Deep-SE and the lstm+svm 0.33 0.10 71.20 lstm+svm 0.94 0.89 41.97
other baseline methods in predicting the normalized story lstm+atlm 0.33 0.14 70.97 lstm+atlm 0.89 0.89 45.18
lstm+lr 0.37 0.21 67.68 lstm+lr 0.89 0.89 45.18
points. Deep-SE performs well across all projects. Deep-SE mean 1.12 1.07 3.06 mean 1.31 1.71 18.95
median 1.05 1.00 8.87 median 1.60 2.00 1.29
improved MAE between 2.13% to 93.40% over the Mean
UG Deep-SE 0.07 0.01 93.50 MD Deep-SE 2.56 2.29 31.83
method, 9.45% to 93.27% over the Median method, 7.02% to lstm+rf 0.08 0.00 92.59 lstm+rf 3.45 3.55 8.24
53.33% over LSTM+LR, 1.20% to 61.96% over LSTM+ATLM, bow+rf 0.11 0.01 90.31 bow+rf 3.32 3.27 11.54
d2v+rf 0.10 0.01 91.22 d2v+rf 3.39 3.48 9.70
1.20% to 53.33% over LSTM+SVM, 4.00% to 30.00% over lstm+svm 0.15 0.10 86.38 lstm+svm 3.12 3.07 16.94
lstm+atlm 0.15 0.08 86.25 lstm+atlm 3.48 3.49 7.41
Doc2vec+RF, 2.04% to 36.36% over BoW+RF, and 0.86% to lstm+lr 0.15 0.08 86.25 lstm+lr 3.57 3.28 4.98
25.80% over LSTM+RF. The best result is obtained in the mean 1.04 0.98 4.79 mean 3.60 3.67 4.18
median 1.06 1.00 2.64 median 2.95 3.00 21.48
Usergrid project (UG), it is 0.07 MAE, 0.01 MdAE, and 93.50 AS Deep-SE 0.53 0.20 69.16 DM Deep-SE 2.30 1.43 31.99
SA. We however note that the adjusted story points benefits lstm+rf 0.56 0.45 67.49 lstm+rf 2.83 2.59 16.23
bow+rf 0.56 0.49 67.39 bow+rf 2.83 2.63 16.33
all methods since it narrows the gap between minimum and d2v+rf 0.56 0.46 67.37 d2v+rf 2.92 2.80 13.80
maximum value and the distribution of the story points. lstm+svm 0.55 0.32 68.34 lstm+svm 2.45 1.78 27.56
lstm+atlm 0.57 0.46 66.87 lstm+atlm 2.83 2.57 16.28
lstm+lr 0.57 0.49 67.12 lstm+lr 2.83 2.57 16.28
Our proposed approach still outperformed other tech- mean 1.18 0.79 31.89 mean 3.27 3.41 3.25
median 1.35 1.00 21.54 median 2.61 2.00 22.94
niques in estimating the new adjusted story points.
AP Deep-SE 0.92 0.86 21.95 MU Deep-SE 0.68 0.59 63.83
lstm+rf 0.99 0.87 16.23 lstm+rf 0.70 0.55 63.01
bow+rf 1.00 0.87 15.33 bow+rf 0.70 0.57 62.79
RQ6: Compare Deep-SE against the existing approach d2v+rf 0.99 0.86 15.94 d2v+rf 0.71 0.57 62.17
lstm+svm 1.12 0.92 5.26 lstm+svm 0.70 0.62 62.62
lstm+atlm 1.03 0.84 12.63 lstm+atlm 0.93 0.74 50.77
We applied our approach, Deep-SE, and the Porru et. lstm+lr 1.17 1.05 1.14 lstm+lr 0.79 0.61 58.00
al.’s approach on their dataset consisted of eight projects. mean 1.15 0.64 2.49 mean 1.21 1.51 35.86
median 0.94 1.00 20.29 median 1.64 2.00 12.80
Table 11 shows the evaluation results in MAE and the
TI Deep-SE 0.59 0.17 56.53 MS Deep-SE 0.86 0.65 56.82
comparison of Deep-SE and the Porru et. al.’s approach. The lstm+rf 0.72 0.56 46.22 lstm+rf 0.91 0.76 54.37
bow+rf 0.73 0.58 46.10 bow+rf 0.89 0.93 55.48
distribution of the Absolute Error is reported in Appendix d2v+rf 0.72 0.56 46.17 d2v+rf 0.90 0.69 54.66
A.5. Deep-SE outperforms the existing approach in all cases. lstm+svm 0.73 0.62 45.74 lstm+svm 0.94 0.78 52.91
lstm+atlm 0.73 0.57 45.86 lstm+atlm 0.99 0.87 50.45
Deep-SE improved between 18.18% (in TIMOB) to 56.48% lstm+lr 0.73 0.56 45.77 lstm+lr 0.99 0.87 50.45
mean 1.32 1.56 1.57 mean 1.23 0.62 38.49
(in DNN) in terms of MAE. In addition, the improvement median 0.86 1.00 36.04 median 1.44 1.00 27.83
of our approach over the Porru et. al.’s approach is still DC Deep-SE 0.48 0.48 55.77 XD Deep-SE 0.35 0.08 80.66
significant after applying p-value correction with the effect lstm+rf 0.49 0.49 55.02 lstm+rf 0.44 0.37 75.78
bow+rf 0.49 0.48 54.76 bow+rf 0.45 0.38 75.33
size greater than 0.5 in all cases. Especially, the large effect d2v+rf 0.50 0.50 53.59 d2v+rf 0.45 0.32 75.31
lstm+svm 0.49 0.43 55.24 lstm+svm 0.38 0.20 79.16
size (ÂXY > 0.7) of the improvement is obtained in the lstm+atlm 0.53 0.47 51.02 lstm+atlm 0.92 0.76 49.05
lstm+lr 0.53 0.47 51.02 lstm+lr 0.45 0.40 75.33
DNN project. mean 1.07 1.49 1.29 mean 1.03 1.28 43.06
median 0.58 1.00 46.76 median 0.75 1.00 58.74
Our proposed approach outperformed the existing tech- BB Deep-SE 0.41 0.12 72.00 TD Deep-SE 0.82 0.64 53.36
nique using TF-IDF in estimating the story points. lstm+rf 0.43 0.38 70.37 lstm+rf 0.84 0.68 52.65
bow+rf 0.45 0.40 69.33 bow+rf 0.88 0.65 50.30
d2v+rf 0.49 0.45 66.34 d2v+rf 0.86 0.70 51.46
lstm+svm 0.42 0.21 71.21 lstm+svm 0.83 0.62 53.24
lstm+atlm 0.47 0.41 67.53 lstm+atlm 0.83 0.58 52.82
lstm+lr 0.47 0.41 67.53 lstm+lr 0.90 0.74 48.88
5.8 Training/testing time mean 1.15 0.76 20.92 mean 1.29 1.42 27.20
median 1.39 1.00 4.50 median 0.99 1.00 44.17
Deep learning models are known for taking a long time for CV Deep-SE 1.15 0.79 23.29 TE Deep-SE 0.40 0.05 74.58
lstm+rf 1.16 1.05 22.55 lstm+rf 0.47 0.46 70.39
training. This is an important factor in considering adopting bow+rf 1.22 1.10 18.95 bow+rf 0.48 0.48 69.52
our approach, especially in an agile development setting. d2v+rf 1.20 1.09 20.30 d2v+rf 0.48 0.48 69.41
lstm+svm 1.22 1.15 18.77 lstm+svm 0.45 0.41 71.77
If training time takes longer than the duration of a sprint lstm+atlm 1.47 1.28 2.22 lstm+atlm 0.49 0.48 69.14
lstm+lr 1.47 1.28 2.22 lstm+lr 0.49 0.48 69.14
(e.g. one or two weeks), the prediction system would not mean 1.27 1.11 15.18 mean 0.99 0.60 37.28
be useful in practice. We have found that the training time median 1.29 1.00 13.92 median 1.39 1.00 12.09

of our model was very small, ranging from 13 minutes to


40 minutes with an average of 22 minutes across the 16
projects (see Table 12). Pre-training time took much longer
time, but it was done only once across a repository and took
just below 7 hours at the maximum. Once the model was

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

TABLE 11 projects. We collected the title and description provided


Mean Absolute Error (MAE) and comparison of Deep-SE and the with these issue reports and the actual story points that were
Porru’s approach using Wilcoxon test and ÂXY effect size (in brackets)
assigned to them. We are aware that those story points were
estimated by human teams, and thus may contain biases
Proj Deep-SE Porru Deep-SE vs Porru and in some cases may not be accurate. We have mitigated
APSTUD 2.67 5.69 <0.001 [0.63] this threats by performing two set of experiments: one on
DNN 0.47 1.08 <0.001 [0.74] the orgianal story points and the other on the adjusted nor-
MESOS 0.76 1.23 0.003 [0.70] malized story points. We further note that for story points,
MULE 2.32 3.37 <0.001 [0.61] the raw values are not as important as the relative values
NEXUS 0.21 0.39 0.005 [0.67] [80]. A user story that is assigned 6 story points should be
TIMOB 1.44 1.76 0.047 [0.57] three times as much as a user story that is assigned 2 story
TISTUD 1.04 1.28 <0.001 [0.58] points. Hence, when engineers determine an estimate for a
XD 1.00 1.86 <0.001 [0.69]
new issue, they need to compare the issue to other issues
avg 1.24 2.08 in the past in order to make the estimation consistently.
The problem is thus suitable for a machine learner. The
TABLE 12
trained prediction system works in a similar manner as
The pre-training, training, and testing time at 50 embedding dimensions human engineers: using past estimates as baselines for new
of our Deep-SE model estimation. The prediction system tries to reproduce an
estimate that human engineers would arrive at.
Repository Pre-training time Proj. Training time Testing time However, since we aim to mimic the team’s capability
Apache 6 h 28 min ME 23 min 1.732 s in effort estimation, the current set of ground-truths suf-
UG 15 min 0.395 s ficiently serves this purpose. When other sets of ground-
Appcelerator 5 h 11 min AS 27 min 2.209 s truths become available, our model can be easily retrained.
AP 18 min 0.428 s
TI 32 min 2.528 s To minimize threats to conclusion validity, we carefully se-
Duraspace 3 h 34 min DC 18 min 1.475 s lected unbiased error measures, applied a number of statis-
Jira 6 h 42 min BB 15 min 0.267 s
tical tests, and applied multiple statistical testing correction
CV 14 min 0.219 s
JI 13 min 0.252 s to verify our assumptions [81]. Our study was performed on
Moodle 6 h 29 min MD 15 min 1.789 s datasets of different sizes. In addition, we carefully followed
Lsstcorp 3 h 26 min DM 40 min 5.293 s
Mulesoft 2 h 39 min MU 21 min 0.535 s
recent best practices in evaluating effort estimation models
MS 17 min 0.718 s [55, 57, 74] to decrease conclusion instability [82].
Spring 5 h 20 min XD 40 min 2.774 s The original implementation of Porru et. al.’s method
Talendforge 6 h 56 min TD 19 min 1.168 s
TE 16 min 0.591 s
[64] was not released, thus we have re-implemented our
own version of their approach. We strictly followed the
described provided in their work, however we acknowledge
trained, getting an estimation from the model was very fast. that our implementation may not reflect all the implemen-
As can be seen from Table 12, the time it took for testing all tation details in their approach. To mitigate this threat, we
issues in the test sets was in the order of seconds. Hence, have tested our implementation using the dataset provided
for a given new issue, it would take less than a second for in their work. We have found that our results were consis-
the machinery to come back with an story point estimation. tent with the results reported in their work.
All the experiments were run on a MacOS laptop with 2.4 To mitigate threats to external validity, we have con-
GHz Intel Core i5 and 8 GB of RAM and the embedding sidered 23,313 issues from sixteen open source projects,
dimensions of 50. Hence, this result suggests that using our which differ significantly in size, complexity, team of devel-
proposed approach to estimate story points is applicable in opers, and community. We however acknowledge that our
practice. dataset would not be representative of all kinds of software
projects, especially in commercial settings (although open
source projects and commercial projects are similar in many
5.9 Verifiability aspects). One of the key differences between open source
We have created a replication package and made and commercial projects that may affect the estimation of
it available at http://www.dsl.uow.edu.au/sasite/index. story points is the nature of contributors, developers, and
php/storypoint/. The package contains the full dataset and project’s stakeholders. Further investigation for commercial
the source code of our Deep-SE model and the benchmark agile projects is needed.
models (i.e. the baselines, LSTM+RF, Doc2vec+RF, BoW+RF,
LSTM+SVM, and LSTM+ATLM). On this website, we also 5.11 Implications
provide detailed instructions on how to run the code and
replicate all the experiments we reported in this paper so In this section, we discuss a number of implications of our
that our results can be independently verified. results.
What do the results mean for the research on effort
estimation? Existing work on effort estimation mainly focus
5.10 Threats to validity on estimating the whole project with a small number of data
We tried to mitigate threats to construct validity by using points (see the datasets in the PROMISE repository [83] for
real world data from issues recorded in large open source example). The fast emergence of agile development demand

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

more research on estimation at the issue or user story example, word importance can be credited using various
level. Our work opens a new research area for the use of techniques (e.g., using gradient with respect to word value).
software analytics in estimating story points. The assertion Alternatively, there are model agnostic technique to explain
demonstrated by our results is that our current method any prediction [87]. Even with partly interpretable RNN, if
works and no other methods has been demonstrated to the prediction is accurate, then we can still expect a high
work at this scale of above 23,000 data points. Existing level of adoption.
work in software effort estimation have dealt with a much What do the results mean for project managers and
smaller number of observations (i.e. data points) than our developers?
work did. For example, the China dataset has only 499 Our positive results indicate that it is possible to build a
data points, Desharnais has 77, and Finish has 38 (see the prediction system to support project managers and develop-
datasets for effort estimation on the PROMISE repository) ers in estimating story points. Our proposal enables teams
– these are commonly used in existing effort estimation to be consistent in their estimation of story points. Achieving
work (e.g. [19, 84]). By contrast, in this work we deal with this consistency is central to effectively leveraging story
the scale of thousands of data points. Since we make our points for project planning. The machine learner learns
dataset publicly available, further research (e.g. modeling from past estimates made by the specific team which it is
the codebase and adding team-specific features into the deployed to assist. The insights that the learner acquires
estimation model) can be advanced in this topic, and our are therefore team-specific. The intent is not to have the
current results can serve as the baseline. machine learner supplant existing agile estimation practices.
Should we adopt deep learning? To the best of our The intent, instead, is to deploy the machine learner to
knowledge, our work is the first major research in using complement these practices by playing the role of a decision
deep learning for effort estimation. The use of deep learning support system. Teams would still meet, discuss user stories
has allowed us to automatically learn a good representation and generate estimates as per current practice, but would
of an issue report and use this for estimating the effort of have the added benefit of access to the insights acquired
resolving the issue. The evaluation results demonstrates the by the machine learner. Teams would be free to reject the
significant improvement that our deep learning approach suggestions of the machine learner, as is the case with any
has brought in terms of predictive performance. This is a decision support system. In every such estimation exercise,
powerful result since it helps software practitioners move the actual estimates generated are recorded as data to be
away from the manual feature engineering process. Feature fed to the machine learner, independent of whether these
engineering usually relies on domain experts who use their estimates are based on the recommendations of the machine
specific knowledge of the data to create features for machine learner or not. This estimation process helps the team not
learners to work. In our approach, features are automatically only understand sufficient details about what it will take to
learned from a textual description of an issue, thus obviating to resolve those issues, but also align with their previous
the need for designing them manually. We of course need estimations.
to collect the labels (i.e. story points assigned to issues)
as the ground truths used for learning and testing. Hence,
we believe that the wide adoption of software analytics in 6 R ELATED WORK
industry crucially depends on the ability to automatically
derive (learn) features from raw software engineering data. Existing estimation methods can generally be classified into
In our context of story point estimation, if the number of three major groups: expert-based, model-based, and hybrid
new words is large, transfer learning is needed, e.g. by using approaches. Expert-based methods rely on human expertise
the existing model as a strong prior for the new model. to make estimations, and are the most popular technique
However, this can be mitigated by pre-training on a large in practice [88, 89]. Expert-based estimation however tends
corpus so that most of the terms are covered. After pre- to require large overheads and the availability of experts
training, our model is able to automatically learn semantic each time the estimation needs to be made. Model-based ap-
relations between words. For example, words related to proaches use data from past projects but they are also varied
networking like “soap”, “configuration”, “tcp”, and “load” in terms of building customized models or using fixed mod-
are in one cluster (see Figure 7). Hence, even when a user els. The well-known construction cost (COCOMO) model
story has several unique terms (but already pre-trained), [11] is an example of a fixed model where factors and their
retraining the main model is not necessary. Pre-training may relationships are already defined. Such estimation models
however take time and effort. One potential research direc- were built based on data from a range of past projects.
tion is therefore building up a community for sharing pre- Hence, they tend to be suitable only for a certain kinds
trained networks, which can be used for initialization, thus of project that were used to build the model. The cus-
reducing training times (similar to Model Zoo [85]). As the tomized model building approach requires context-specific
first step towards this direction, we make our pre-trained data and uses various methods such as regression (e.g. [12,
models publicly available for the research community. 13]), Neural Network (e.g. [14, 90]), Fuzzy Logic (e.g. [15]),
We acknowledged that the explainability of a model is Bayesian Belief Networks (e.g.[16]), analogy-based (e.g. [17,
important for full adoption of machine learning techniques. 18]), and multi-objective evolutionary approaches (e.g. [19]).
This is not a unique problem only for recurrent networks It is however likely that no single method will be the
(RNN), but also for many powerful modern machine learn- best performer for all project types [10, 20, 91]. Hence, some
ing techniques (e.g. Random Forests and SVM). However, recent work (e.g. [20]) proposes to combine the estimates
RNN is not entirely a black-box as it seems (e.g. see [86]). For from multiple estimators. Hybrid approaches (e.g. [21, 22])

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17

combine expert judgements with the available data – simi- errors in C programs. Deep Belief Network [108] is another
larly to the notions of our proposal. common deep learning model, which has been used in
While most existing work focuses on estimating a whole software engineering, e.g. for building defection prediction
project, little work has been done in building models specif- models [109, 110].
ically for agile projects. Today’s agile, dynamic and change-
driven projects require different approaches to planning and
estimating [24]. Some recent approaches leverage machine 7 C ONCLUSION
learning techniques to support effort estimation for agile In this paper, we have contributed to the research com-
projects. Recently, the work in [64] proposed an approach munity the dataset for story point estimations, sourcing
which extracts TF-IDF features from issue description to from 16 large and diverse software projects. We have also
develop an story-point estimation model. The univariate proposed a deep learning-based, fully end-to-end prediction
feature selection technique are then applied on the extracted system for estimating story points, removing the users from
features and fed into classifiers (e.g. SVM). In addition, the manually designing features from the textual description of
work in [92] applied Cosmic Function Points (CFP) [93] to issues. A key novelty of our approach is the combination
estimate the effort for completing an agile project. The work of two powerful deep learning architectures: Long Short-
in [94] developed an effort prediction model for iterative Term Memory (to learn a vector representation for issue
software development setting using regression models and reports) and Recurrent Highway Network (for building a
neural networks. Differing from traditional effort estimation deep representation).
models, this model is built after each iteration (rather than The proposed approach has consistently outperformed
at the end of a project) to estimate effort for the next three common baselines and four alternatives according to
iteration. The work in [95] built a Bayesian network model our evaluation results. Compared against the Mean and
for effort prediction in software projects which adhere to the Median techniques, the proposed approach has improved
agile Extreme Programming method. Their model however 34.06% and 26.77% respectively in MAE averaging across
relies on several parameters (e.g. process effectiveness and 16 projects we studied. Compared against the BoW and
process improvement) that require learning and extensive Doc2Vec techniques, our approach has improved 23.68%
fine tuning. Bayesian networks are also used in [96] to model and 17.90% in MAE. These are significant results in the
dependencies between different factors (e.g. sprint progress literature of effort estimation. A major part of those im-
and sprint planning quality influence product quality) in provement were brought by our use of the deep learning
Scrum-based software development project in order to de- LSTM architecture to model the textual description of an
tect problems in the project. Our work specifically focuses issue. The use of highway recurrent networks (on top of
on estimating issues with story points using deep learning LSTM) has also improved the predictive performance, but
techniques to automatically learn semantic features repre- not as significantly as the LSTM itself (especially for those
senting the actual meaning of issue descriptions, which is project which have very small number of issues).
the key difference from previous work. Previous research Our future work would involve expanding our study to
(e.g. [97–100]) has also been done in predicting the elapsed commercial software projects and other large open source
time for fixing a bug or the delay risk of resolving an issue. projects to further evaluate our proposed method. We also
However, effort estimation using story points is a more consider performing team analytics (e.g. features character-
preferable practice in agile development. izing a team) to model team changes over time and feed it
LSTM has shown successes in many applications such into our prediction model. We also plan to investigate how
as language models [35], speech recognition [36] and video to learn a semantic representation of the codebase and use
analysis [37]. Our Deep-SE is a generic in which it maps text it as another input to our model. Furthermore, we will look
to a numerical score or a class, and can be used for other into experimenting with a sliding window setting to explore
tasks, e.g. mapping a movie review to a score, or assigning incremental learning. In addition, we will also investigate
scores to essays, or sentiment analysis. Deep learning has re- how to best use the issue’s metadata (e.g. priority and type)
cently attracted increasing interests in software engineering. and still maintain the end-to-end nature of our entire model.
Our previous work [101] proposed a generic deep learning Our future work also involve comparing our use of the
framework based on LSTM for modeling software and its LSTM model against other state-of-the-art models of natural
development process. White et. al. [102] has employed re- language such as paragraph2vec [59] or Convolutional Neu-
current neural networks (RNN) to build a language model ral Network [111]. We have discussed (informally) our work
for source code. Their later work [103] extended these RNN with several software developers who has been practising
models for detecting code clones. The work in [104] also agile and estimating story points. They all agreed that our
used RNNs to build a statistical model for code completion. prediction system could be useful in practice. However, to
Our recent work [105] used LSTM to build a language make such a claim, we need to implement it into a tool and
model for code and demonstrated the improvement of this perform a user study. Hence, we would like to evaluate
model compared to the one using RNNs. Gu et. al. [106] empirically the impact of our prediction system for story
used a special RNN Encoder–Decoder, which consists of an point estimation in practice by project managers and/or
encoder RNN to process the input sequence and a decoder software developers. This would involve developing the
RNN with attention to generate the output sequence. This model into a tool (e.g. a JIRA plugin) and then organising
model takes as input a given API-related natural language trial use in practice. This is an important part of our future
query and returns API usage sequences. The work in [107] work to confirm the ultimate benefits of our approach in
also uses RNN Encoder–Decoder but for fixing common general.

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18

R EFERENCES [27] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy,


“Cross-project defect prediction: a large scale experiment on data
vs. domain vs. process,” in Proceedings of the the 7th joint meeting of
[1] B. Michael, S. Blumberg, and J. Laartz, “Delivering the European software engineering conference and the ACM SIGSOFT
large-scale IT projects on time, on budget, and on symposium on The foundations of software engineering. ACM, 2009,
value,” 2012. [Online]. Available: http://www.mckinsey. pp. 91–100.
com/business-functions/digital-mckinsey/our-insights/ [28] Atlassian, “Atlassian JIRA Agile software,” 2016. [Online].
delivering-large-scale-it-projects-on-time-on-budget-and-on-value Available: https://www.atlassian.com/software/jira
[2] B. Flyvbjerg and A. Budzier, “Why Your IT Project May Be Riskier [29] Spring, “Spring XD issue XD-2970,” 2016. [Online]. Available:
Than You Think,” Harvard Business Review, vol. 89, no. 9, pp. 601– https://jira.spring.io/browse/XD-2970
603, 2011. [30] J. Grenning, Planning poker or how to avoid analysis paralysis while
[3] A. Trendowicz and R. Jeffery, Software project effort estimation: release planning, 2002, vol. 3.
Foundations and best practice guidelines for success. Springer, 2014. [31] M. Choetkiertikul, H. K. Dam, T. Tran, A. Ghose, and
[4] L. C. Briand and I. Wieczorek, “Resource estimation in software J. Grundy, “Predicting Delivery Capability in Iterative Software
engineering,” Encyclopedia of software engineering, 2002. Development,” IEEE Transactions on Software Engineering,
[5] E. Kocaguneli, A. T. Misirli, B. Caglayan, and A. Bener, “Ex- vol. 14, no. 8, pp. 1–1, 2017. [Online]. Available: http:
periences on developer participation and effort estimation,” in //ieeexplore.ieee.org/document/7898472/
Software Engineering and Advanced Applications (SEAA), 2011 37th [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
EUROMICRO Conference on. IEEE, 2011, pp. 419–422. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[6] M. Jorgensen, “What we do and don’t know about software [33] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
development effort estimation,” IEEE software, vol. 31, no. 2, pp. Continual prediction with lstm,” Neural computation, vol. 12,
37–40, 2014. no. 10, pp. 2451–2471, 2000.
[7] S. McConnell, Software estimation: demystifying the black art. Mi- [34] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gra-
crosoft press, 2006. dient flow in recurrent nets: the difficulty of learning long-term
[8] T. Menzies, Z. Chen, J. Hihn, and K. Lum, “Selecting best dependencies,” 2001.
practices for effort estimation,” IEEE Transactions on Software [35] M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM Neural Net-
Engineering, vol. 32, no. 11, pp. 883–895, 2006. works for Language Modeling,” in INTERSPEECH, 2012, pp.
[9] I. Sommerville, Software Engineering, 9th ed. Pearson, 2010. 194–197.
[10] M. Jørgensen and M. Shepperd, “A Systematic Review of Soft- [36] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition
ware Development Cost Estimation Studies,” IEEE Transactions with deep recurrent neural networks,” in Acoustics, Speech and
on Software Engineering, vol. 33, no. 1, pp. 33–53, 2007. Signal Processing (ICASSP), 2013 IEEE International Conference on.
[11] B. W. Boehm, R. Madachy, and B. Steece, Software cost estimation IEEE, 2013, pp. 6645–6649.
with Cocomo II. Prentice Hall PTR, 2000. [37] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals,
[12] P. Sentas, L. Angelis, and I. Stamelos, “Multinomial Logistic R. Monga, and G. Toderici, “Beyond short snippets: Deep net-
Regression Applied on Software Productivity Prediction,” in works for video classification,” in Proceedings of the IEEE Confer-
Proceedings of the 9th Panhellenic conference in informatics, 2003, pp. ence on Computer Vision and Pattern Recognition, 2015, pp. 4694–
1–12. 4702.
[13] P. Sentas, L. Angelis, I. Stamelos, and G. Bleris, “Software produc- [38] T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Faster training
tivity and effort prediction with ordinal regression,” Information of very deep networks via p-norm gates,” The 23rd International
and Software Technology, vol. 47, no. 1, pp. 17–29, 2005. Conference on Pattern Recognition, 2016, To Appear.
[14] S. Kanmani, J. Kathiravan, S. S. Kumar, M. Shanmugam, and P. E. [39] ——, “Predicting healthcare trajectories from medical records:
College, “Neural Network Based Effort Estimation using Class A deep learning approach,” Journal of Biomedical Informatics,
Points for OO Systems,” Evaluation, 2007. vol. 69, pp. 218–229, 2017. [Online]. Available: http://www.
[15] S. Kanmani, J. Kathiravan, S. S. Kumar, and M. Shanmugam, sciencedirect.com/science/article/pii/S1532046417300710
“Class point based effort estimation of OO systems using Fuzzy [40] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rec-
Subtractive Clustering and Artificial Neural Networks,” Proceed- tifiers: Surpassing human-level performance on imagenet clas-
ings of the 1st India Software Engineering Conference (ISEC), pp. sification,” in Proceedings of the IEEE International Conference on
141–142, 2008. Computer Vision, 2015, pp. 1026–1034.
[16] S. Bibi, I. Stamelos, and L. Angelis, “Software cost prediction with [41] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
predefined interval estimates,” in Proceedings of the First Software A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep
Measurement European Forum, Rome, Italy, 2004, pp. 237–246. neural networks for acoustic modeling in speech recognition: The
[17] M. Shepperd and C. Schofield, “Estimating Software Project Ef- shared views of four research groups,” Signal Processing Magazine,
fort Using Analogies,” IEEE Transactions on Software Engineering, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
vol. 23, no. 12, pp. 736–743, 1997. [42] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
[18] L. Angelis and I. Stamelos, “A Simulation Tool for Efficient learning with neural networks,” in Advances in Neural Information
Analogy Based Cost Estimation,” Empirical Software Engineering, Processing Systems, 2014, pp. 3104–3112.
vol. 5, no. 1, pp. 35–68, 2000. [43] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce,
[19] F. Sarro, A. Petrozziello, and M. Harman, “Multi-objective Soft- P. Ondruska, I. Gulrajani, and R. Socher, “Ask me anything:
ware Effort Estimation,” in Proceedings of the 38th International Dynamic memory networks for natural language processing,”
Conference on Software Engineering (ICSE), 2016, pp. 619–630. arXiv preprint arXiv:1506.07285, 2015.
[20] E. Kocaguneli, T. Menzies, and J. W. Keung, “On the value of [44] Y. Bengio, “Learning deep architectures for AI,” Foundations and
ensemble effort estimation,” IEEE Transactions on Software Engi- trends R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
neering, vol. 38, no. 6, pp. 1403–1416, 2012. [45] S. Liang and R. Srikant, “Why Deep Neural Networks for Func-
[21] R. Valerdi, “Convergence of expert opinion via the wideband tion Approximation?” 2016.
delphi method: An application in cost estimation models,” 2011. [46] H. Mhaskar, Q. Liao, and T. A. Poggio, “When and Why Are
[22] S. Chulani, B. Boehm, and B. Steece, “Bayesian analysis of em- Deep Networks Better Than Shallow Ones?” in AAAI, 2017, pp.
pirical software engineering cost models,” IEEE Transactions on 2343–2349.
Software Engineering, vol. 25, no. 4, pp. 573–583, 1999. [47] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-
[23] H. F. Cervone, “Understanding agile project management meth- Dickstein, “On the expressive power of deep neural networks,”
ods using Scrum,” OCLC Systems & Services: International digital in ICML, 2017.
library perspectives, vol. 27, no. 1, pp. 18–22, 2011. [48] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the
[24] M. Cohn, Agile estimating and planning. Pearson Education, 2005. number of linear regions of deep neural networks,” in Advances
[25] M. Usman, E. Mendes, F. Weidt, and R. Britto, “Effort Estima- in neural information processing systems, 2014, pp. 2924–2932.
tion in Agile Software Development: A Systematic Literature [49] M. Bianchini and F. Scarselli, “On the complexity of neural
Review,” in Proceedings of the 10th International Conference on network classifiers: A comparison between shallow and deep
Predictive Models in Software Engineering (PROMISE), 2014, pp. architectures,” IEEE transactions on neural networks and learning
82–91. systems, vol. 25, no. 8, pp. 1553–1565, 2014.
[26] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating [50] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very
complexity, code churn, and developer activity metrics as indi- deep networks,” arXiv preprint arXiv:1507.06228, 2015.
cators of software vulnerabilities,” IEEE Transactions on Software [51] J. Schmidhuber, “Deep Learning in neural networks: An
Engineering, vol. 37, no. 6, pp. 772–787, 2011.

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19

overview,” Neural Networks, vol. 61, pp. 85–117, 2015. [Online]. Statistics, vol. 25, no. 2, pp. 101–132, 2000. [Online]. Available:
Available: http://dx.doi.org/10.1016/j.neunet.2014.09.003 http://jeb.sagepub.com/cgi/doi/10.3102/10769986025002101
[52] M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation [74] A. Arcuri and L. Briand, “A practical guide for using statistical
of unnormalized statistical models, with applications to natural tests to assess randomized algorithms in software engineering,”
image statistics,” Journal of Machine Learning Research, vol. 13, no. in Proceedings of the 33rd International Conference on Software Engi-
Feb, pp. 307–361, 2012. neering (ICSE), 2011, pp. 1–10.
[53] T. D. Team, “Theano: A {Python} framework for fast [75] L. van der Maaten and G. Hinton, “Visualizing high-dimensional
computation of mathematical expressions,” arXiv e-prints, vol. data using t-sne,” Journal of Machine Learning Research, vol. 9, pp.
abs/1605.0, 2016. [Online]. Available: http://deeplearning.net/ 2579–2605, Nov 2008.
software/theano [76] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent,
[54] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and and S. Bengio, “Why does unsupervised pre-training help deep
R. Salakhutdinov, “Dropout: A simple way to prevent neural learning?” J. Mach. Learn. Res., vol. 11, pp. 625–660, Mar. 2010.
networks from overfitting,” Journal of Machine Learning Research, [77] J. Weston, F. Ratle, and R. Collobert, “Deep learning via semi-
vol. 15, pp. 1929–1958, 2014. supervised embedding,” in Proceedings of the 25th International
[55] M. Shepperd and S. MacDonell, “Evaluating prediction systems Conference on Machine Learning, ser. ICML ’08. New York,
in software project estimation,” Information and Software NY, USA: ACM, 2008, pp. 1168–1175. [Online]. Available:
Technology, vol. 54, no. 8, pp. 820–827, 2012. [Online]. Available: http://doi.acm.org/10.1145/1390156.1390303
http://dx.doi.org/10.1016/j.infsof.2011.12.008 [78] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
[56] R. Moraes, J. F. Valiati, and W. P. Gavião Neto, “Document-level and P. Kuksa, “Natural language processing (almost) from
sentiment classification: An empirical comparison between SVM scratch,” J. Mach. Learn. Res., vol. 12, pp. 2493–2537, Nov. 2011.
and ANN,” Expert Systems with Applications, vol. 40, no. 2, pp. [Online]. Available: http://dl.acm.org/citation.cfm?id=1953048.
621–633, 2013. 2078186
[57] P. A. Whigham, C. A. Owen, and S. G. Macdonell, “A Baseline [79] D. Zwillinger and S. Kokoska, CRC standard probability and statis-
Model for Software Effort Estimation,” ACM Transactions on Soft- tics tables and formulae. Crc Press, 1999.
ware Engineering and Methodology (TOSEM), vol. 24, no. 3, p. 20, [80] J. McCarthy, “From here to human-level AI,” Artificial Intelligence,
2015. vol. 171, no. 18, pp. 1174–1182, 2007.
[58] P. Tirilly, V. Claveau, and P. Gros, “Language modeling for bag- [81] A. Arcuri and L. Briand, “A Hitchhiker’s guide to statistical tests
of-visual words image categorization,” in Proceedings of the 2008 for assessing randomized algorithms in software engineering,”
international conference on Content-based image and video retrieval, Software Testing, Verification and Reliability, vol. 24, no. 3, pp. 219–
2008, pp. 249–258. 250, 2014.
[59] Q. Le and T. Mikolov, “Distributed Representations of Sentences [82] T. Menzies and M. Shepperd, “Special issue on repeatable results
and Documents,” in Proceedings of the 31st International Conference in software engineering prediction,” Empirical Software Engineer-
on Machine Learning (ICML), vol. 32, 2014, pp. 1188–1196. ing, vol. 17, no. 1-2, pp. 1–17, 2012.
[60] E. Kocaguneli, S. Member, and T. Menzies, “Exploiting the Essen- [83] T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, F. Peters,
tial Assumptions of Analogy-Based Effort Estimation,” vol. 38, and B. Turhan, “The PROMISE Repository of empirical software
no. 2, pp. 425–438, 2012. engineering data,” 2012.
[61] E. Kocaguneli, T. Menzies, and E. Mendes, “Transfer learning in [84] P. L. Braga, A. L. I. Oliveira, and S. R. L. Meira, “Software
effort estimation,” Empirical Software Engineering, vol. 20, no. 3, Effort Estimation using Machine Learning Techniques with Ro-
pp. 813–843, 2015. bust Confidence Intervals,” in Proceedings of the 7th International
[62] E. Mendes, I. Watson, and C. Triggs, “A Comparative Study Conference on Hybrid Intelligent Systems (HIS), 2007, pp. 352–357.
of Cost Estimation Models for Web Hypermedia Applications,” [85] T. Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and
Empirical Software Engineering, vol. 8, pp. 163–196, 2003. Karayev, Sergey and Long, Jonathan and Girshick, Ross and
[63] Y. F. Li, M. Xie, and T. N. Goh, “A Study of Project Selection and Guadarrama, Sergio and Darrell, “Caffe: Convolutional Architec-
Feature Weighting for Analogy Based Software Cost Estimation,” ture for Fast Feature Embedding,” arXiv preprint arXiv:1408.5093,
J. Syst. Softw., vol. 82, no. 2, pp. 241–252, feb 2009. [Online]. 2014.
Available: http://dx.doi.org/10.1016/j.jss.2008.06.001 [86] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing
[64] S. Porru, A. Murgia, S. Demeyer, and M. Marchesi, “Estimating and Understanding Recurrent Networks,” in arXiv preprint
Story Points from Issue Reports,” 2016. arXiv:1506.02078, 2015, pp. 1–12. [Online]. Available:
[65] S. D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineer- http://arxiv.org/abs/1506.02078
ing Metrics and Models. Redwood City, CA, USA: Benjamin- [87] M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why Should I
Cummings Publishing Co., Inc., 1986. Trust You?”: Explaining the Predictions of Any Classifier,” in
[66] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit, “A sim- Proceedings of the 22nd ACM SIGKDD International Conference
ulation study of the model evaluation criterion MMRE,” IEEE on Knowledge Discovery and Data Mining. ACM, 2016, pp.
Transactions on Software Engineering, vol. 29, no. 11, pp. 985–995, 1135–1144. [Online]. Available: http://arxiv.org/abs/1602.04938
2003. [88] M. Jorgensen, “A review of studies on expert estimation of
[67] B. Kitchenham, L. Pickard, S. MacDonell, and M. Shepperd, software development effort,” Journal of Systems and Software,
“What accuracy statistics really measure,” IEE Proceedings - Soft- vol. 70, no. 1-2, pp. 37–60, 2004.
ware, vol. 148, no. 3, p. 81, 2001. [89] M. Jorgensen and T. M. Gruschke, “The impact of lessons-learned
[68] M. Korte and D. Port, “Confidence in software cost estima- sessions on effort estimation and uncertainty assessments,” IEEE
tion results based on MMRE and PRED,” Proceedings of the 4th Transactions on Software Engineering, vol. 35, no. 3, pp. 368–383,
international workshop on Predictor models in software engineering 2009.
(PROMISE), pp. 63–70, 2008. [90] A. Panda, S. M. Satapathy, and S. K. Rath, “Empirical Validation
[69] D. Port and M. Korte, “Comparative studies of the model eval- of Neural Network Models for Agile Software Effort Estimation
uation criterions mmre and pred in software cost estimation based on Story Points,” Procedia Computer Science, vol. 57, pp.
research,” in Proceedings of the 2nd ACM-IEEE international sym- 772–781, 2015.
posium on Empirical software engineering and measurement. ACM, [91] F. Collopy, “Difficulty and complexity as factors in software effort
2008, pp. 51–60. estimation,” International Journal of Forecasting, vol. 23, no. 3, pp.
[70] T. Menzies, E. Kocaguneli, B. Turhan, L. Minku, and F. Peters, 469–471, 2007.
Sharing data and models in software engineering. Morgan Kauf- [92] R. D. A. Christophe Commeyne, Alain Abran, “Effort
mann, 2014. Estimation with Story Points and COSMIC Function Points -
[71] K. Muller, “Statistical power analysis for the behavioral sciences,” An Industry Case Study,” pp. 25–36, 2008. [Online]. Avail-
Technometrics, vol. 31, no. 4, pp. 499–500, 1989. able: http://cosmic-sizing.org/wp-content/uploads/2016/03/
[72] H. H. Abdi, “The Bonferonni and Sidak Corrections Estimation-model-v-Print-Format-adapter.pdf
for Multiple Comparisons,” Encyclopedia of Measurement [93] C. Group, INTERNATIONAL STANDARD ISO/IEC Software engi-
and Statistics., vol. 1, pp. 1–9, 2007. [Online]. Avail- neering COSMIC: a functional size measurement method, 2011, vol.
able: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1. 2011.
1.78.8747{\&}rep=rep1{\&}type=pdf [94] P. Abrahamsson, R. Moser, W. Pedrycz, A. Sillitti, and G. Succi,
[73] A. Vargha and H. D. Delaney, “A Critique and Improvement “Effort prediction in iterative software development processes –
of the CL Common Language Effect Size Statistics of incremental versus global prediction models,” 1st International
McGraw and Wong,” Journal of Educational and Behavioral Symposium on Empirical Software Engineering and Measurement

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20

(ESEM), pp. 344–353, 2007. Morakot Choetkiertikul Biography text here.


[95] P. Hearty, N. Fenton, D. Marquez, and M. Neil, “Predicting
Project Velocity in XP Using a Learning Dynamic Bayesian Net-
work Model,” IEEE Transactions on Software Engineering, vol. 35,
no. 1, pp. 124–137, 2009. PLACE
[96] M. Perkusich, H. De Almeida, and A. Perkusich, “A model to PHOTO
detect problems on scrum-based software development projects,” HERE
The ACM Symposium on Applied Computing, pp. 1037–1042, 2013.
[97] E. Giger, M. Pinzger, and H. Gall, “Predicting the fix time of
bugs,” in Proceedings of the 2nd International Workshop on Recom-
mendation Systems for Software Engineering (RSSE). ACM, 2010,
pp. 52–56.
[98] L. D. Panjer, “Predicting Eclipse Bug Lifetimes,” in Proceedings
of the 4th International Workshop on Mining Software Repositories
(MSR), 2007, pp. 29–32.
[99] P. Bhattacharya and I. Neamtiu, “Bug-fix time prediction models:
can we do better?” in Proceedings of the 8th working conference on
Mining software repositories (MSR). ACM, 2011, pp. 207–210. Hoa Khanh Dam Biography text here.
[100] P. Hooimeijer and W. Weimer, “Modeling bug report quality,” in
Proceedings of the 22 IEEE/ACM international conference on Auto-
mated software engineering (ASE). ACM Press, nov 2007, pp. 34 –
44. PLACE
[101] H. K. Dam, T. Tran, J. Grundy, and A. Ghose, “DeepSoft: A PHOTO
vision for a deep model of software,” in Proceedings of the 24th HERE
ACM SIGSOFT International Symposium on Foundations of Software
Engineering, ser. FSE ’16. ACM, To Appear., 2016.
[102] M. White, C. Vendome, M. Linares-v, and D. Poshyvanyk, “To-
ward Deep Learning Software Repositories,” in Proceedings of the
12th Working Conference on Mining Software Repositories (MSR),
2015, pp. 334–345.
[103] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk,
“Deep learning code fragments for code clone detection,”
in Proceedings of the 31st IEEE/ACM International Conference
on Automated Software Engineering, ser. ASE 2016. New
York, NY, USA: ACM, 2016, pp. 87–98. [Online]. Available: Truyen Tran Biography text here.
http://doi.acm.org/10.1145/2970276.2970326
[104] V. Raychev, M. Vechev, and E. Yahav, “Code completion
with statistical language models,” Proceedings of the 35th ACM
SIGPLAN Conference on Programming Language Design and PLACE
Implementation - PLDI ’14, pp. 419–428, 2013. [Online]. Available: PHOTO
http://dl.acm.org/citation.cfm?doid=2594291.2594321 HERE
[105] H. Dam, T. Tran, and T. Pham, “A deep language model for
software code,” arXiv preprint arXiv:1608.02715, no. August, pp.
1–4, 2016. [Online]. Available: http://arxiv.org/abs/1608.02715
[106] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API Learning,”
in Proceedings of the 2016 24th ACM SIGSOFT International
Symposium on Foundations of Software Engineering, ser. FSE
2016. New York, NY, USA: ACM, 2016, pp. 631–642. [Online].
Available: http://doi.acm.org/10.1145/2950290.2950334
[107] R. Gupta, S. Pal, A. Kanade, and S. Shevade, “Deepfix:
Fixing common C language errors by deep learning,” in
Proceedings of the Thirty-First AAAI Conference on Artificial Trang Pham Biography text here.
Intelligence, February 4-9, 2017, San Francisco, California, USA.
AAAI Press, 2017, pp. 1345–1351. [Online]. Available: http:
//aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603
[108] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality PLACE
of data with neural networks,” Science, vol. 313, no. 5786, pp. 504 PHOTO
– 507, 2006. HERE
[109] S. Wang, T. Liu, and L. Tan, “Automatically learning
semantic features for defect prediction,” in Proceedings of
the International Conference on Software Engineering (ICSE),
vol. 14-22, 2016, pp. 297–308. [Online]. Available: http:
//dx.doi.org/10.1145/2884781.2884804
[110] X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun, “Deep Learning
for Just-in-Time Defect Prediction,” in Proceedings of the IEEE
International Conference on Software Quality, Reliability and Security
(QRS), no. 1, 2015, pp. 17–26.
[111] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A Convo-
lutional Neural Network for Modelling Sentences,” Proceedings Aditya Ghose Biography text here.
of the 52nd Annual Meeting of the Association for Computational
Linguistics (ACL), pp. 655–665, 2014.
PLACE
PHOTO
HERE

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2018.2792473, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 21

Tim Menzies Biography text here.

PLACE
PHOTO
HERE

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like