0% found this document useful (0 votes)

167 views20 pages

Story Point PDF

This document summarizes a research paper that proposes a deep learning model called Deep-SE for estimating story points, which are a common unit used to estimate the effort required to complete user stories in agile software development. Deep-SE uses a combination of long short-term memory and recurrent highway networks to predict story points for new user stories based on a dataset of over 23,000 user stories from 16 open source projects, without requiring manual feature engineering. An evaluation found that Deep-SE outperformed other baseline and alternative methods in estimating story points.

Uploaded by

nirvana not

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

167 views20 pages

Story Point PDF

Uploaded by

nirvana not

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO.

7, JULY 2019 637

A Deep Learning Model for Estimating

Story Points
Morakot Choetkiertikul , Hoa Khanh Dam , Truyen Tran , Trang Pham ,
Aditya Ghose, and Tim Menzies

Abstract—Although there has been substantial research in software analytics for effort estimation in traditional software projects,
little work has been done for estimation in agile projects, especially estimating the effort required for completing user stories or issues.
Story points are the most common unit of measure used for estimating the effort involved in completing a user story or resolving an
issue. In this paper, we propose a prediction model for estimating story points based on a novel combination of two powerful deep
learning architectures: long short-term memory and recurrent highway network. Our prediction system is end-to-end trainable
from raw input data to prediction outcomes without any manual feature engineering. We offer a comprehensive dataset for story
points-based estimation that contains 23,313 issues from 16 open source projects. An empirical evaluation demonstrates that our
approach consistently outperforms three common baselines (Random Guessing, Mean, and Median methods) and six alternatives
(e.g., using Doc2Vec and Random Forests) in Mean Absolute Error, Median Absolute Error, and the Standardized Accuracy.

Index Terms—Software analytics, effort estimation, story point estimation, deep learning

Ç
1 INTRODUCTION

E FFORT estimation is an important part of software proj-

ect management, particularly for planning and monitor-
ing a software project. Cost and schedule overruns have
incorrect estimates may have adverse impact on the project
outcomes [3], [7], [8], [9]. Research in software effort estima-
tion dates back several decades and they can generally be
been a common risk in software projects. Mckinsey and divided into model-based methods, expert-based methods,
the University of Oxford has conducted a study on 5,400 and hybrid methods which combine model-based and
large scale IT projects, and found that on average large soft- expert-based methods [10]. Model-based approaches lever-
ware projects run 66 percent over budget and 33 percent ages data from old projects to make predictions about new
overtime [1]. A different study on 1,471 software projects [2] projects. Expert-based methods rely on human expertise to
revealed similar findings: one in six software projects has a make such judgements. Most of the existing work (e.g., [11],
budget overrun of 200 percent and a schedule overrun of [12], [13], [14], [15], [16], [17], [18], [19], [20], [20], [21], [22])
almost 70 percent. Activities involving effort estimation focus on the effort required for completing a whole project
forms a critical part in planning and managing a software (as opposed to user stories or issues). These approaches esti-
project to ensure it complete in time and within budget [3], mate the effort required for developing a complete software
[4], [5]. Effort estimates may be used by different stake- system, relying on a set of features manually designed for
holders as input for developing project plans, scheduling characterizing a software project.
iteration or release plans, budgeting, and costing [6]. Hence, In modern agile development settings, software is devel-
oped through repeated cycles (iterative) and in smaller
parts at a time (incremental), allowing for adaptation to
M. Choetkiertikul is with the School of Computing and Information Tech- changing requirements at any point during a project’s life.
nology, Faculty of Engineering and Information Sciences, University of A project has a number of iterations (e.g., sprints in Scrum
Wollongong, Wollongong, NSW 2522, Australia and with the Faculty
of Information and Communication Technology, Mahidol University,
[23]). An iteration is usually a short (usually 2–4 weeks)
Nakhonpathom 73170, Thailand. E-mail: [email protected]. period in which the development team designs, imple-
H.K. Dam and A. Ghose are with the School of Computing and ments, tests and delivers a distinct product increment, e.g.,
Information Technology, Faculty of Engineering and Information Sciences, a working milestone version or a working release. Each iter-
University of Wollongong, Wollongong, NSW 2522, Australia.
E-mail: {hoa, aditya}@uow.edu.au. ation requires the completion of a number of user stories,
T. Tran and T. Pham are with the School of Information Technology, which are a common way for agile teams to express user
Deakin University, Waurn Ponds, VIC 3216, Australia. requirements. This is a shift from a model where all func-
E-mail: {truyen.tran, phtra}@deakin.edu.au.
tionalities are delivered together (in a single delivery) to a
T. Menzies is with North Carolina State University, Raleigh, NC 27695.
E-mail: [email protected]. model involving a series of incremental deliveries.
Manuscript received 22 Dec. 2016; revised 19 Dec. 2017; accepted 28 Dec. There is thus a need to focus on estimating the effort of
2017. Date of publication 11 Jan. 2018; date of current version 24 July 2019. completing a single user story at a time rather than the entire
(Corresponding author: Morakot Choetkiertikul.) project. In fact, it has now become a common practice for agile
Recommended for acceptance by A. Zeller. teams to go through each user story and estimate the effort
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. required for completing it. Story points are commonly used as
Digital Object Identifier no. 10.1109/TSE.2018.2792473 a unit of effort measure for a user story [24]. Currently, most
0098-5589 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
638 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

agile teams heavily rely on experts’ subjective assessment

(e.g., planning poker, analogy, and expert judgment) to arrive
at an estimate. This may lead to inaccuracy and more impor-
tantly inconsistencies between estimates [25].
To facilitate research in effort estimation for agile develop-
ment, we have developed a new dataset for story point effort
estimation. This dataset contains 23,313 user stories or issues
with ground truth story points. Note that ground-truth story
points refer to the actual story points that were assigned to an
issue by the team. We collected these issues from 16 large
open source projects in 9 repositories namely Apache, App-
celerator, DuraSpace, Atlassian, Moodle, Lsstcorp, Mulesoft,
Spring, and Talendforge. To the best of our knowledge, this is
the largest dataset (in terms of number of data points) for
story point estimation where the focus is at the issue/user
story level rather than at the project level as in traditional Fig. 1. An example of an issue with estimated story points.
effort estimation datasets.
We also propose a prediction model which supports a recognize the semantic relations between “she” and “he”, and
team by recommending a story-point estimate for a given it ignore the sequential nature of text. In our approach,
user story. Our model learns from the team’s previous story features are automatically learned, thus obviating the need
point estimates to predict the size of new issues. This predic- for designing them manually.
tion system will be used in conjunction with (instead of a Although our Deep-SE is a deep model of multiple layers,
replacement for) existing estimation techniques practiced by it is recurrent and thus model parameters are shared across
the team. It can be used in an completely automated manner, layers. Hence, the number of parameters does not grow with
i.e., the team will use the story points given by the prediction the depth and consequently avoid overfitting. We also employ
system. Alternatively, it could be used as a decision support a number of common techniques such as dropout and early
system and takes part in the estimation process. This is similar stopping combat overfitting. Our approach consistently out-
to the notions of combination-based effort estimation in which performs three common baseline estimators: Random Guess-
estimates come from different sources, e.g., a combination of ing, Mean, and Median methods and six alternatives (e.g.,
expert and formal model-based estimates [10]. The key nov- using Doc2Vec and Random Forests) in Mean Absolute Error,
elty of our approach resides in the combination of two power- Median Absolute Error, and the Standardized Accuracy.
ful deep learning architectures: long short-term memory These claims have also been tested using a non-parametric
(LSTM) and recurrent highway network (RHN). LSTM allows Wilcoxon test and Vargha and Delaney’s statistic to demon-
us to model the long-term context in the textual description of strate the statistical significance and the effect size.
an issue, while RHN provides us with a deep representation The remainder of this paper is organized as follows.
of that model. We named this approach as Deep learning Section 2 provides a background of the story point estima-
model for Story point Estimation (Deep-SE). tion and deep neural networks. We then present the Deep-SE
Our Deep-SE model is a fully end-to-end system where raw model and explain how it can be trained in Sections 3 and 4
data signals (i.e., words) are passed from input nodes up to respectively. Section 5 reports on the experimental evalua-
the final output node for estimating story points, and the pre- tion of our approach. Related work is discussed in Section 6
diction errors are propagated from the output node all the before we conclude and outline future work in Section 7.
way back to the word layer. Deep-SE automatically learns
semantic features which represent the meaning of user stories
or issue reports, thus liberating the users from manually
2 BACKGROUND
designing and extracting features. Feature engineering usu- 2.1 Story Point Estimation
ally relies on domain experts who use their specific knowl- When a team estimates with story points, it assigns a point
edge of the data to create features for machine learners to value (i.e., story points) to each user story. A story point
work. For example, the performance of most of existing defect estimate reflects the relative amount of effort involved in
prediction models heavily relies on the careful designing of resolving or completing the user story: a user story that is
good features (e.g., size of code, number of dependencies, assigned two story points should take twice as much effort
cyclomatic complexity, and code churn metrics) which can as a user story assigned one story point. Many projects have
discriminate between defective and non-defective code [26]. now adopted this story point estimation approach [25]. Proj-
Coming up with good features is difficult, time-consuming, ects that use issue tracking systems (e.g., JIRA [28]) record
and requires domain-specific knowledge, and hence poses a their user stories as issues. Fig. 1 shows an example of issue
major challenge. In many situations, manually designed fea- XD-2970 in the Spring XD project [29] which is recorded in
tures normally do not generalize well: features that work well JIRA. An issue typically has a title (e.g., “Standardize XD
in a certain software project may not perform well in other logging to align with Spring Boot”) and description. Projects
projects [27]. Bag-of-Words (BoW) is a traditional technique that use JIRA Agile also record story points. For example,
to “engineer” features representing textual data like issue the issue in Fig. 1 has 8 story points.
description. However, the BoW approach has two major Story points are usually estimated by the whole team
weaknesses: it ignores the semantics of words, e.g., fails to within a project. For example, the widely-used Planning
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 639

Poker [30] method suggests that each team member pro-

vides an estimate and a consensus estimate is reached after
a few rounds of discussion and (re-)estimation. This practice
is different from traditional approaches (e.g., function
points) in several aspects. Both story points and function
points reflect an effort for resolving an issue. However,
function points can be determined by an external estimator
based on a standard set of rules (e.g., counting inputs, out- Fig. 2. An LSTM network.
puts, and inquiries) that can be applied consistently by
any trained practitioner. On the other hand, story points are information to persist and it can map a sequence into a
developed by a specific team based on the team’s cumula- sequence (see Fig. 2). Let w1 ; . . . ; wn be the input sequence
tive knowledge and biases, and thus may not be useful out- (e.g., words in a sentence) and y1 ; . . . ; yn be the sequence of
side the team (e.g., in comparing performance across corresponding labels (e.g., the next words). At time step t,
teams). Since story points represent the effort required for an LSTM unit reads the input wt , the previous hidden state
completing a user story, an estimate should cover different ht1 , and the previous memory ct1 in order to compute the
factors which can affect the effort. These factors include hidden state ht . The hidden state is used to produce an out-
how much work needed to be done, the complexity of the put at each step t. For example, the output of predicting the
work, and any uncertainty involving in the work [24]. next word k in a sentence would be a vector of probabilities
In agile development, user stories or issues are commonly across our vocabulary, i.e., softmaxðVk ht Þ where Vk is a row
viewed as the first-class entity of a project since they describe in the output parameter matrix Wout .
what has to be built in the software project, forming the basis The most important element of LSTM is a short-term
for design, implementation and testing. Story point sizes are memory cell—a vector that stores accumulated information
used for measuring a team’s progress rate, prioritizing user over time (see Fig. 3). The information stored in the memory
stories, planning and scheduling for future iterations and is refreshed at each time step through partially forgetting
releases, and even costing and allocating resources. Story old, irrelevant information and accepting fresh new input.
points are also the basis for other effort-related estimation. An LSTM unit uses the forget gate f t to control how much
For example, in our recent work [31], they are used for pre- information from the memory of previous context (i.e., ct1 )
dicting delivery capability for an ongoing iteration. Specifi- should be removed from the memory cell. The forget gate
cally, we predict the amount of work delivered at the end of looks at the the previous output state ht1 and the current
an iteration, relative to the amount of work which the team word wt , and outputs a number between 0 and 1. A value of
has originally committed to. The amount of work done in an 1 indicates that all the past memory is preserved, while a
iteration is then quantified in terms of story points from the value of 0 means “completely forget everything”. The next
issues completed within that iteration. To enable such a pre- step is updating the memory with new information
diction, we have taken into account both the information of an obtained from the current word wt . The input gate i t is used
iteration and user stories or issues involving in the iteration. to control which new information will be stored in the mem-
Interaction between user stories and between user stories and ory. Information stored in the memory cell will be used to
resources are captured through extracting information related produce an output ht . The output gate o t looks at the current
to the dependencies between user stories and the assignment code token wt and the previous hidden state ht1 , and deter-
of user stories to developers. mines which parts of the memory should be output.
Velocity is the sum of the story-point estimates of the This mechanism allows LSTM to effectively learn long-
issues that the team resolved during an iteration. For exam- term dependencies in text. Consider trying to predict the
ple, if the team resolves four stories each estimated at three last word in the following text extracted from the descrip-
story points, their velocity is twelve. Velocity is used for tion of issue XD-2970 in Fig. 1: “Boot uses slf4j APIs backed
planning and predicting when a software (or a release) by logback. This causes some build incompatibilities .... An addi-
should be completed. For example, if the team estimates the tional step is to replace log4j with .”. Recent information sug-
next release to include 100 story points and the team’s cur- gests that the next word is probably the name of a logging
rent velocity is 20 points per 2-week iteration, then it would library, but if we want to narrow down to a specific library,
take 5 iterations (or 10 weeks) to complete the project. we need to remember that “logback” and “log4j” are log-
Hence, it is important that the team is consistent in their ging libraries from the earlier text. There could be a big gap
story point estimates to avoid reducing the predictability in between relevant information and the point where it is
planning and managing their project. A machine learner needed, but LSTM is capable to learn to connect the infor-
can help the team maintain this consistency, especially in mation. In fact, LSTM has demonstrated ground-breaking
coping with increasingly large numbers of issues. It does results in many applications such as language models [35],
so by learning insight from past issues and estimations to speech recognition [36] and video analysis [37].
make future estimations. The reading of the new input, writing of the output, and
the forgetting (i.e., all those gates) are all learnable. As an
2.2 Long Short Term Memory recurrent network, LSTM network shares the same parame-
Long Short-Term Memory [32], [33] is a special variant of ters across all steps since the same task is performed at each
recurrent neural networks [34]. While a feedforward neural step, just with different inputs. Thus, comparing to tradi-
network maps an input vector into an output vector, tional feedforward networks, using an LSTM network sig-
an LSTM network uses a loop in a network that allows nificantly reduces the total number of parameters which we
640 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

Fig. 3. The internal structure of an LSTM unit.

need to learn. An LSTM model is trained using many input

sequences with known actual output sequences. Learning is
done by minimizing the error between the actual output
and the predicted output by adjusting the model parame-
ters. Learning involves computing the gradient of LðuuÞ
during the backpropagation phase, and parameters are Fig. 4. Deep learning model for Story point Estimation (Deep-SE). The
input layer (bottom) is a sequence of words (represented as filled
updated using a stochastic gradient descent. It means that circles). Words are first embedded into a continuous space, then fed into
parameters are updated after seeing only a small random the LSTM layer. The LSTM outputs a sequence of state vectors, which
subset of sequences. We refer the readers to the seminal are then pooled to form a document-level vector. This global vector is
paper [32] for more details about LSTM. then fed into a Recurrent Highway Net for multiple transformations (See
Eq. (1) for detail). Finally, a regressor predicts an outcome (story-point).

3 APPROACH These word vectors then serve as an input sequence to the

Our overall research goal is to build a prediction system that Long Short-Term Memory layer which computes a vector
takes as input the title and description of an issue and pro- representation for the whole document.
duces a story-point estimate for the issue. Title and descrip- After that, the document vector is fed into the Recurrent
tion are required information for any issue tracking system. Highway Network, which transforms the document vector
Hence, our prediction system is applicable to a wide range multiple times, before outputting a final vector which repre-
of issue tracking systems, and can be used at any time, even sents the text. The vector serves as input for the regressor which
when an issue is created. predicts the output story-point. While many existing regres-
We combine the title and description of an issue report sors can be employed, we are mainly interested in regressors
into a single text document where the title is followed by that are differentiable with respect to the training signals and
the description. Our approach computes vector representa- the input vector. In our implementation, we use the simple
tions for these documents. These representations are then linear regression that outputs the story-point estimate.
used as features to predict the story points of each issue. It Our entire system is trainable from end-to-end: (a) data
is important to note that these features are automatically signals are passed from the words in issue reports to the
learned from raw text, hence removing us from manually final output node; and (b) the prediction error is propagated
engineering the features. from the output node all the way back to the word layer.
Fig. 4 shows the Deep learning model for Story point
Estimation (Deep-SE) that we have designed for the story 3.1 Word Embedding
point prediction system. It is composed of four components We represent each word as a low dimensional, continuous
arranged sequentially: (i) word embedding, (ii) document and real-valued vector, also known as word embedding. Here
representation using Long Short-Term Memory [32], (iii) we maintain a look-up table, which is a word embedding
deep representation using Recurrent Highway Net (RHWN) matrix M 2 RdjV j where d is the dimension of word vector
[38]; and (iv) differentiable regression. Given a document and jV j is vocabulary size. These word vectors are pre-
which consists of a sequence of words s ¼ ðw1 ; w2 ; . . . ; wn Þ, trained from corpora of issue reports, which will be
e.g., the word sequence (Standardize, XD, logging, to, align, described in details in Section 4.1.
with, ....) in the title and description of issue XD-2970 in Fig. 1.
We model a document’s semantics based on the principle 3.2 Document Representation Using LSTM
of compositionality: the meaning of a document is deter- Since an issue document consists of a sequence of words,
mined by the meanings of its constituents (e.g., words) and we model the document by accumulating information
the rules used to combine them (e.g., one word followed by from the start to the end of the sequence. A powerful accu-
another). Hence, our approach models document represen- mulator is a Recurrent Neural Network (RNN) [34], which
tation in two stages. It first converts each word in a docu- can be seen as multiple copies of the same single-hidden-
ment into a fixed-length vector (i.e., word embedding). layer network, each passing information to a successor.
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 641

many non-linear layers [50]. A Highway Net is a special

type of feedforward neural networks with a modification to
the transformation taking place at a hidden unit to let infor-
mation from lower layers pass linearly through. Specifically,
the hidden state at layer l is defined as

h lþ1 ¼ al hl þ ð1 a l Þ s l ðh l Þ; (1)

where s l is a non-linear transform (e.g., a logistic or a tanh)

and al ¼ logitðh hl Þ is a linear logistic transform of h l . Here a l
plays the role of a highway gatethat lets information passing
Fig. 5. An example of how a vector representation is obtained for issue from layer l to layer l þ 1 without loss of information. For
reports.
example, al ! 1 enables simple copying.
We need to learn a mapping from the raw words in an
Thus, recurrent networks allow information to be accumu- issue description to the story points. A deep feedforward neu-
lated. While RNNs are theoretically powerful, they are dif- ral network like Highway Net effectively breaks the mapping
ficult to train for long sequences [34], which are often seen into a series of nested simple mappings, each described by a
in issue reports (e.g., see the description of issue XD-2970 different layer of the network. The first layer provides a
in Fig. 1). Hence, our approach employs Long Short-Term (rough) estimate, and subsequent layers iteratively refine that
Memory, a special variant of RNN (see Section 2 for more estimate. As the number of layers increase, further refinement
details of how LSTM work). can be achieved. Comparing to traditional feedforward net-
After the vector output state has been computed for works, the special gating scheme in Highway Net is highly
every word in the input sequence, the next step is aggregat- effective in letting the information and the gradients to pass
ing those vectors into a single vector representing the whole through while stacking many non-linear functions. In fact,
document (see Fig. 5). The aggregation operation is known earlier work has demonstrated that Highway Net can have
as pooling. There are multiple ways to perform pooling, but up to a thousand layers [50], while traditional deep neural
the main requirement is that pooling must be length invari- nets cannot go beyond several layers [51].
ant. In other words, pooling is not sensitive to variable We have also modified the standard Highway Network
length of the document. For example, the simplest statistical by sharing parameters between layers, i.e., all the hidden
pooling method is mean-pooling where we take the sum of layers having the same hidden units. In other words, all the
the state vectors and divide it by the number of vectors. hidden layers to have the same hidden units. This is similar
Other pooling methods are such as max pooling (e.g., to the notion of a recurrent network, and thus we called it a
choose the maximum value in each dimension), min pool- Recurrent Highway Network. Our previous work [38] has
ing and sum pooling. From our experience in other settings, demonstrated the effectiveness of this approach in pattern
a simple but often effective pooling method is averaging, recognition. This key novelty allows us to create a very com-
which we also employed here [39]. pact version of Recurrent Highway Network with only one
set of parameters in a l and s l . This clearly produces a great
3.3 Deep Representation Using Recurrent advantage of avoiding overfitting. We note that the number
Highway Network of layers here refers to the number of hidden layers of a
Given that vector representation of an issue report has been Recurrent Highway Network, not the number of LSTM
extracted by the LSTM layer, we can use a differentiable layers. The number of LSTM layers is the same as the num-
regressor for immediate prediction. However, this may be ber of words in an issue’s description.
sub-optimal since the network is rather shallow. Deep neu-
ral networks have become a popular method with many 3.4 Regression
ground-breaking successes in vision [40], speech recogni- At the top-layer of Deep-SE, we employ linear activation
tion [41] and NLP [42], [43]. Deep nets represent complex function in a feedforward neural network as the final
data more efficiently than shallow ones [44]. Deep models regressor (see Fig. 4) to produce a story-point estimate. This
can be expressive while staying compact, as theoretically function can be defined as follows:
analysed by recent work [45], [46], [47], [48], [49]. This have X
n
been empirically validated in recent record-breaking results y ¼ b0 þ b i xi ; (2)
in vision, speech recognition and machine translation. i¼1
However, learning standard feedforward networks with
where y is the output story point, xi is an input signal from
many hidden layers is notoriously difficult due to two main
RHWN layer, bi is trained coefficient (weight), and n is the
problems: (i) the number of parameters grows with the size of embedding dimension.
number of layers, leading to overfitting; and (ii) stacking
many non-linear functions makes it difficult for the informa-
tion and the gradients to pass through.
4 MODEL TRAINING
To address these problems, we designed a deep repre- 4.1 Pre-Training
sentation that performs multiple non-linear transformations Pre-training is a way to come up with a good parameter ini-
using the idea from Highway Networks. Highway Nets are tialization without using the labels (i.e., ground-truth story
the latest idea that enables efficient learning through those points). We pre-train the lower layers of Deep-SE (i.e.,
642 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

embedding and LSTM), which operate at the word level. Since every component in the model is differentiable, we
Pre-training is effective when the labels are not abundant. use the popular stochastic gradient descent to perform opti-
During pre-training, we do not use the ground-truth story mization: through backpropagation, the model parameters u
points, but instead leverage two sources of information: the are updated in the opposite direction of the gradient of the
strong predictiveness of natural language, and availability loss function LðuÞ. In this search, a learning rate h is used to
of free texts without labels (e.g., issue reports without story control how large of a step we take to reach a (local) mini-
points). The first source comes from the property of lan- mum. We use RMSprop, an adaptive stochastic gradient
guages that the next word can be predicted using previous method (unpublished note by Geoffrey Hinton), which is
words, thanks to grammars and common expressions. known to work best for recurrent models. We tuned RMSprop
Thus, at each time step t, we can predict the next word wtþ1 by partitioning the data into mutually exclusive training, vali-
using the state h t , using the softmax function dation, and test sets and running the training multiple times.
expðUk h t Þ Specifically, the training set is used to learn a useful model.
P ðwtþ1 ¼ k j w1:t Þ ¼ P ; (3) After each training epoch, the learned model was evaluated
k0 expðUk0 h t Þ
on the validation set and its performance was used to assess
where Uk is a free parameter. Essentially we are building a against hyperparameters (e.g., learning rate in gradient
language model, i.e., P ðsÞ ¼ P ðw1:n QÞ, which can be factor-
searches). Note that the validation set was not used to learn
ized using the chain-rule as: P ðw1 Þ nt¼2 P ðwtþ1 j w 1:t Þ. any of the model’s parameters. The best performing model in
We note that the probability of the first word P ðw1 Þ in a the validation set was chosen to be evaluated on the test set.
sequence is the number of sequences in the corpus which We also employed the early stopping strategy (see Section 5.4),
has that word w1 starting first. At step t, ht is computed by i.e., monitoring the model’s performance during the valida-
feeding ht1 and wt to the LSTM unit (see Fig. 2). Since wt is tion phase and stopping when the performance got worse. If
a word embedding vector, Eq. (3) indirectly refers to the the log-loss does not improve for ten consecutive runs, we
embedding matrix . than terminate the training.
The language model can be learned by optimizing the log- To prevent overfitting in our neural network, we have
loss log P ðsÞ. However, the main bottleneck is computa- implemented an effective solution called dropout in our
tional: Eq. (3) costs jV j time to evaluate where jV j is the vocab- model [54], where the elements of input and output states
ulary size, which can be hundreds of thousands for a big are randomly set to zeros during training. During testing,
corpus. For that reason, we implemented an approximate but parameter averaging is used. In effect, dropout implicitly
very fast alternative based on Noise-Contrastive Estimation trains many models in parallel, and all of them share the
[52], which reduces the time to M jV j, where M can be as same parameter set. The final model parameters represent
small as 100. We also run the pre-training multiple times the average of the parameters across these models. Typi-
against a validation set to choose the best model. We use cally, the dropout rate is set at 0.5.
perplexity, a common intrinsic evaluation metric based on the An important step prior to optimization is parameter ini-
log-loss, as a criterion for choosing the best model and early tialization. Typically the parameters are initialized ran-
stopping. A smaller perplexity implies a better language domly, but our experience shows that a good initialization
model. The word embedding matrix M 2 RdjV j (which is (through pre-training of embedding and LSTM layers) helps
first randomly initialized) and the initialization for LSTM learning converge faster to good solutions.
parameters are learned through this pre-training process.
5 EVALUATION
4.2 Training Deep-SE
The empirical evaluation we carried out aimed to answer
We have implemented the Deep-SE model in Python using the following research questions:
Theano [53]. To simplify our model, we set the size of the
memory cell in an LSTM unit and the size of a recurrent RQ1. Sanity check: Is the proposed approach suitable for
layer in RHWN to be the same as the embedding size. We estimating story points?
tuned some important hyper-parameters (e.g., embedding This sanity check requires us to compare our
size and the number of hidden layers) by conducting experi- Deep-SE prediction model with the three common
ments with different values, while for some other hyper- baseline benchmarks used in the context of effort
parameters, we used the default values. This will be dis- estimation: Random Guessing, Mean Effort, and
cussed in more details in the evaluation section. Median Effort. Random guessing is a naive bench-
Recall that the entire network can be reduced to a param- mark used to assess if an estimation model is useful
eterized function defined, which maps sequences of raw [55]. Random guessing performs random sampling
words (in issue reports) to story points. Let u be the set of all (with equal probability) over the set of issues with
parameters in the model. We define a loss function LðuÞ that known story points, chooses randomly one issue
measures the quality of a particular set of parameters based from the sample, and uses the story point value of
on the difference between the predicted story points and that issue as the estimate of the target issue. Random
the ground truth story points in the training data. A setting guessing does not use any information associated
of the parameters u that produces a prediction for an issue with the target issue. Thus any useful estimation
in the training data consistent with its ground truth story model should outperform random guessing. Mean
points would have a very low loss L. Hence, learning is and Median Effort estimations are commonly used
achieved through the optimization process of finding the as baseline benchmarks for effort estimation [19].
set of parameters u that minimizes the loss function. They use the mean or median story points of the past
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 643

issues to estimate the story points of the target issue. “Java” than “logging”. To address this issue, Doc2-
Note that the samples used for all the naive baselines vec [59] (i.e., alternatively known as paragraph2vec)
(i.e., Random Guessing, Mean Effort, and Median is an unsupervised algorithm that learns fixed-length
Effort) were from the training set. feature representations from texts (e.g., title and
RQ2. Benefits of deep representation: Does the description of issues). Each document is represented
use of Recurrent Highway Nets provide more accurate in a dense vector which is trained to predict next
story point estimates than using a traditional regression words in the document.
technique? Both BoW and Doc2vec representations however
To answer this question, we replaced the Recurrent effectively destroys the sequential nature of text.
Highway Net component with a regressor for imm- This question aims to explore whether LSTM with
ediate prediction. Here, we compare our approach its capability of modeling this sequential structure
against four common regressors: Random Forests would improve the story point estimation. To answer
(RF), Support Vector Machine (SVM), Automatically this question, we feed three different feature vectors:
Transformed Linear Model (ATLM), and Linear one learned by LSTM and the other two derived from
Regression (LR). We choose RF over other baselines BoW technique and Doc2vec to the same Random
since ensemble methods like RF, which combine the Forrests regressor, and compare the predictive per-
estimates from multiple estimators, are an effective formance of the former (i.e., LSTM+RF) against
method for effort estimation [20]. RF achieves a sig- that of the latter (i.e., BoW+RF and Doc2vec+RF).
nificant improvement over the decision tree approach We used Gensim,1 a well-known implementation for
by generating many classification and regression Doc2vec in our experiments.
trees, each of which is built on a random resampling RQ4. Cross-project estimation. Is the proposed appr-
of the data, with a random subset of variables at each oach suitable for cross-project estimation?
node split. Tree predictions are then aggregated Story point estimation in new projects is often dif-
through averaging. We used the issues in the valida- ficult due to lack of training data. One common tech-
tion set to fine-tune parameters (i.e., the number of nique to address this issue is training a model using
tress, the maximum depth of the tree, and The mini- data from a (source) project and applying it to the
mum number of samples). For SVM, it has been new (target) project. Since our approach requires
widely use in software analytics (e.g., defect predic- only the title and description of issues in the source
tion) and document classification (e.g., sentiment and target projects, it is readily applicable to both
analysis) [56]. SVM is known as Support Vector within-project estimation and cross-project estimation.
Regression (SVR) for regression problems. We also In practice, story point estimation is however known
used the issues in the validation set to find the kernel to be specific to teams and projects. Hence, this
type (e.g., linear, polynomial) for testing. We used question aims to investigate whether our approach is
the Automatically Transformed Linear Model [57] suitable for cross-project estimation. We have imple-
recently proposed as the baseline model for software mented Analogy-based estimation called ABE0, which
effort estimation. Although ATLM is simple and were proposed in previous work [60], [61], [62], [63]
requires no parameter tuning, it performs well over a for cross-project estimation, and used it as a bench-
range of various project types in the traditional effort mark. The ABE0 estimation bases on the distances
estimation [57]. Since LR is the top layer of our between individual issues. Specifically, the story point
approach, we also used LR as the immediate regres- of issues in the target project is the mean of story
sor after LSTM layers to assess whether RHWN points of k-nearest issues from the source project. We
improves the predictive performance. We then com- used the euclidean distance as a distance measure,
pare the performance of these alternatives, namely Bag-of-Words of the title and the description as the
LSTM+RF, LSTM+SVM, LSTM+ATLM, and LSTM features of an issue, and k = 3.
+LR against our Deep-SE model. RQ5. Normalizing/adjusting story points. Does our
RQ3. Benefits of LSTM document representation: approach still perform well with normalized/adjusted
Does the use of LSTM for modeling issue reports provide story points?
more accurate results than the traditional Doc2Vec and We have ran our experiments again using the new
Bag-of-Words approach? labels (i.e., the normalized story points) for address-
The most popular text representation is Bag-of- ing the concern that whether our approach still per-
Words [58], where a text is represented as a vector of forms well on those adjusted ground-truths. We
word counts. For example, the title and description adjusted the story points of each issue using a range
of issue XD-2970 in Fig. 1 would be converted into a of information, including the number of days from
sparse binary vector of vocabulary size, whose ele- creation to resolved time, the development time, the
ments are mostly zeros, except for those at the posi- number of comments, the number of users who com-
tions designated to “standardize”, “XD”, “logging” mented on the issue, the number of times that an
and so on. However, BoW has two major weak- issue had their attributes changed, the number of
nesses: they lose the sequence of the words and they users who changed the issue’s attributes, the number
also ignore semantics of the words. For example, of issue links, the number of affect versions, and the
“Python”, “Java”, and “logging ” are equally distant,
while semantically “Python” should be closer to 1. https://radimrehurek.com/gensim/models/doc2vec.html
644 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

TABLE 1
Descriptive Statistics of Our Story Point Dataset

Repo. Project Abb. # issues min SP max SP mean SP median SP mode SP var SP std SP mean TD length LOC
Apache Mesos ME 1,680 1 40 3.09 3 3 5.87 2.42 181.12 247,542+
Usergrid UG 482 1 8 2.85 3 3 1.97 1.40 108.60 639,110+
Appcelerator Appcelerator Studio AS 2,919 1 40 5.64 5 5 11.07 3.33 124.61 2,941,856#
Aptana Studio AP 829 1 40 8.02 8 8 35.46 5.95 124.61 6,536,521+
Titanium SDK/CLI TI 2,251 1 34 6.32 5 5 25.97 5.10 205.90 882,986+
DuraSpace DuraCloud DC 666 1 16 2.13 1 1 4.12 2.03 70.91 88,978+
Atlassian Bamboo BB 521 1 20 2.42 2 1 4.60 2.14 133.28 6,230,465#
Clover CV 384 1 40 4.59 2 1 42.95 6.55 124.48 890,020#
JIRA Software JI 352 1 20 4.43 3 5 12.35 3.51 114.57 7,070,022#
Moodle Moodle MD 1,166 1 100 15.54 8 5 468.53 21.65 88.86 2,976,645+
Lsstcorp Data Management DM 4,667 1 100 9.57 4 1 275.71 16.61 69.41 125,651*
Mulesoft Mule MU 889 1 21 5.08 5 5 12.24 3.50 81.16 589,212+
Mule Studio MS 732 1 34 6.40 5 5 29.01 5.39 70.99 16,140,452#
Spring Spring XD XD 3,526 1 40 3.70 3 1 10.42 3.23 78.47 107,916+
Talendforge Talend Data Quality TD 1,381 1 40 5.92 5 8 26.96 5.19 104.86 1,753,463#
Talend ESB TE 868 1 13 2.16 2 1 2.24 1.50 128.97 18,571,052#
Total 23,313

SP: story points, TD length: the number of words in the title and description of an issue, LOC: line of code
(+: LOC obtained from www.openhub.net, *: LOC from GitHub, and #: LOC from the reverse engineering)

number of fix versions. These information reflect the Apache, Appcelerator, DuraSpace, Atlassian, Moodle, Lsst-
actual effort and we thus refer to them as effort indi- corp, MuleSoft, Spring, and Talendforge. We then used the
cators. The values of these indicators were extracted Representational State Transfer (REST) API provided by JIRA
after the issue was completed. The normalized story to query and collected those issue reports. We collected all the
point (SPnormalized ) is then computed as the following: issues which were assigned a story point measure from
the nine open source repositories up until August 8, 2016.
SPnormalized ¼ ð0:5ÞSPoriginal þ ð0:5ÞSPnearest ; We then extracted the story point, title and description from
where SPorginal is the original story point, and the collected issue reports. Each repository contains a number
SPnearest is the mean of story points from 10 nearest of projects, and we chose to include in our dataset only proj-
issues based on their actual effort indicators. Note ects that had more than 300 issues with story points. Issues
that we use K-Nearest Neighbour (KNN) to find the that were assigned a story point of zero (e.g., a non-reproduc-
nearest issues and the euclidean metric to measure ible bug), as well as issues with a negative, or unrealistically
the distance. We ran the experiment on the new large story point (e.g., greater than 100) were filtered out. Ulti-
labels (i.e., SPnormalized ) using our proposed approach mately, about 2.66 percent of the collected issues were filtered
against all other baseline benchmark methods. out in this fashion. In total, our dataset has 23,313 issues with
RQ6. Compare against the existing approach. How story points from 16 different projects: Apache Mesos (ME),
does our approach perform against existing approaches in Apache Usergrid (UG), Appcelerator Studio (AS), Aptana
story point estimation? Studio (AP), Titanum SDK/CLI (TI), DuraCloud (DC), Bam-
Recently, Porru et al. [64] also proposed an esti- boo (BB), Clover (CV), JIRA Software (JI), Moodle (MD), Data
mation model for story points. Their approach uses Management (DM), Mule (MU), Mule Studio (MS), Spring
the type of an issue, the component(s) assigned to it, XD (XD), Talend Data Quality (TD), and Talend ESB (TE).
and the TF-IDF derived from its summary and Table 1 summarizes the descriptive statistics of all the projects
description as features representing the issue. They in terms of the minimum, maximum, mean, median, mode,
also performed univariate feature selection to variance, and standard deviations of story points assigned
choose a subset of features for building a classifier. used and the average length of the title and description of
By contrast, our approach automatically learns issues in each project. These sixteen projects bring diversity to
semantic features which represent the actual mean- our dataset in terms of both application domains and project’s
ing of the issue’s report, thus potentially providing characteristics. Specifically, they are different in the following
more accurate estimates. To answer this research aspects: number of observation (from 352 to 4,667 issues),
question, we ran Deep-SE on the dataset used in technical characteristics (different programming languages
Porru et. al, re-implemented their approach, and and different application domains), sizes (from 88 KLOC to
performed a comparison on the results produced by 18 millions LOC), and team characteristics (different team
the two approaches. structures and participants from different regions).
Since story points rate the relative effort of work between
5.1 Story Point Datasets user stories, they are usually measured on a certain scale
To collect data for our dataset, we looked for issues that were (e.g., 1, 2, 4, 8, etc.) to facilitate comparison (e.g., a user story
estimated with story points. JIRA is one of the few widely- is double the effort of the other) [25]. The story points used
used issue tracking systems that support agile development in planning poker typically follow a Fibonacci scale, i.e., 1,
(and thus story point estimation) with its JIRA Agile plugin. 2, 3, 5, 8, 13, 21, and so on [24]. Among the projects we stud-
Hence, we selected a diverse collection of nine major open ied, only seven of them (i.e., Usergrid, Talend ESB, Talend
source repositories that use the JIRA issue tracking system: Data Quality, Mule Studio, Mule, Appcelerator Studio, and
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 645

Aptana Studio followed the Fibonacci scale, while the other SA is based on MAE and it is defined as
nine projects did not use any scale. When our prediction
system gives an estimate, we did not round it to the nearest MAE
SA ¼ 1 100;
story point value on the Fibonacci scale. An alternative MAErguess
approach (for those project which follow a Fibonacci scale)
is treating this as a classification problem: each value on the where MAErguess is the MAE of a large number (e.g.,
Fibonacci scale represents a class. The limitations of this 1,000 runs) of random guesses. SA measures the compari-
approach is that the number of classes must be pre-deter- son against random guessing. Predictive performance can
mined and that it is not applicable to projects that do not fol- be improved by decreasing MAE or increasing SA.
low this scale. We however note that the Fibonacci scale is We assess the story point estimates produced by the esti-
only a guidance for estimating story points. In practice, mation models using MAE, MdAE and SA. To compare the
teams may follow other common scales, define their own performance of two estimation models, we tested the statisti-
scales or may not follow any scale at all. Our approach does cal significance of the absolute errors achieved with the two
not rely on these specific scales, thus making it applicable to models using the Wilcoxon Signed Rank Test [71]. The Wil-
a wider range of projects. It predicts a scalar value (regres- coxon test is a safe test since it makes no assumptions about
sion) rather than a class (classification). underlying data distributions. The null hypothesis here is:
“the absolute errors provided by an estimation model are not
5.2 Experimental Setting different to those provided by another estimation model”. We
We performed experiments on the sixteen projects in our set the confidence limit at 0.05 and also applied Bonferroni
dataset—see Table 1 for their details. To mimic a real correction [72] (0.05/K, where K is the number of statistical
deployment scenario that prediction on a current issue is tests) when multiple testing were performed.
made by using knowledge from estimations of the past In addition, we also employed a non-parametric effect
issues, the issues in each project were split into training set size measure, the correlated samples case of the Vargha and
Delaney’s A ^XY statistic [73] to assess whether the effect size
(60 percent of the issues), development/validation set (i.e.,
is interesting. The A^XY measure is chosen since it is agnostic
20 percent), and test set (i.e., 20 percent) based on their crea-
tion time. The issues in the training set and the validation to the underlying distribution of the data, and is suitable for
set were created before the issues in the test set, and the assessing randomized algorithms in software engineering
issues in the training set were also created before the issues generally [74] and effort estimation in particular [19]. Specif-
in the validation set. ically, given a performance measure (e.g., the Absolute
Error from each estimation in our case), the A ^XY measures
5.3 Performance Measures the probability that estimation model X achieves better
There are a range of measures used in evaluating the accu- results (with respect to the performance measure) than esti-
racy of an effort estimation model. Most of them are based mation model Y . We note that this falls into the correlated
on the Absolute Error, (i.e., jActualSP EstimatedSP j). samples case of the Vargha and Delaney [73] where the
where AcutalSP is the real story points assigned to an issue Absolute Error is derived by applying different estimation
and EstimatedSP is the outcome given by an estimation methods on the same data (i.e., same issues). We thus use
model. Mean of Magnitude of Relative Error (MRE) or the following formula to calculate the stochastic superiority
Mean Percentage Error and Prediction at level l [65], i.e., value between two estimation methods
Pred(l), have also been used in effort estimation. However,
a number of studies [66], [67], [68], [69] have found that ^XY ¼ ½#ðX < Y Þ þ ð0:5 #ðX ¼ Y ÞÞ ;
A
those measures bias towards underestimation and are not n
stable when comparing effort estimation models. Thus, the where #ðX < Y Þ is the number of issues that the Absolute
Mean Absolute Error (MAE), Median Absolute Error (MdAE), Error from X less than Y , #ðX ¼ Y Þ is the number of issues
and the Standardized Accuracy (SA) have recently been rec- that the Absolute Error from X equal to Y , and n is the num-
ommended to compare the performance of effort estimation ber of issues. We also compute the average of the stochastic
models [19], [70]. MAE is defined as superiority measures (Aiu ) of our approach against each of
the others using the following formular:
1X N
P
MAE ¼ jActualSPi EstimatedSPi j; k6¼i Aik
N i¼1 Aiu ¼ ;
l1
where N is the number of issues used for evaluating the where Aik is the pairwise stochastic superiority values
performance (i.e., test set), ActualSPi is the actual story ^XY ) for all ði; kÞ pairs of estimation methods, k ¼ 1; . . . ; l,
(A
point, and EstimatedSPi is the estimated story point, for the and l is a number of estimation methods, e.g., variable i
issue i. refers to Deep-SE and l ¼ 4 when comparing Deep-SE
We also report the Median Absolute Error since it is more against Random, Mean and Median methods.
robust to large outliers. MdAE is defined as
5.4 Hyper-Parameter Settings for Training a
MdAE ¼ MedianfjActualSPi EstimatedSPi jg; Deep-SE Model
We focused on tuning two important hyper-parameters: the
where 1 i N. number of word embedding dimensions and the number of
646 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

Fig. 6. Story point estimation performance with different parameter.

hidden layers in the recurrent highway net component

of our model. To do so, we fixed one parameter and varied
the other to observe the MAE performance. We chose to test
with four different embedding sizes: 10, 50, 100, and 200,
and twelve variations of the number of hidden layers from
2 to 200. The embedding size is the number of dimensions
of the vector which represents a word. This word embed-
ding is a low dimensional vector representation of words in
the vocabulary. This tuning was done using the validation Fig. 7. Top-500 word clusters used in the Apache’s issue reports.
set. Fig. 6 shows the results from experimenting with
Apache Mesos. As can be seen, the setting where the num- we used 50,000 issues for pre-training, we had approxi-
ber of embeddings is 50 and the number of hidden layers is mately 5 million data points per repository for pre-training.
10 gives the lowest MAE, and thus was chosen.
For both pre-training we trained with 100 runs and the 5.6 The Correlation Between the Story Points and
batch size is 50. The initial learning rate in pre-training was the Development Time
set to 0.02, adaptation rate was 0.99, and smoothing factor Identifying the actual effort required for completing an issue
was 107 . For the main Deep-SE model we used 1,000 is very challenging (especially in open source projects) since
epoches and the batch size wass set to 100. The initial learn- in most cases the actual effort was not tracked and recorded.
ing rate in the main model was set to 0.01, adaptation rate We were however able to extract the development time
was 0.9, and smoothing factor was 106 . Dropout rates for which was the duration between when the issue’s status
the RHWN and LSTM layers were set to 0.5 and 0.2 respec- was set to “in-progress” and when it was set to “resolved”.
tively. The maximum sequence length used by the LSTM is Thus, we have explicitly excluded the waiting time for being
100 words, which is the average length of issue description. assigned to a developer or being put on hold. The develop-
ment time is the closest to the actual effort of completing the
5.5 Pre-Training issue that we were able to extract from the data. We then
In most repositories, we used around 50,000 issues without performed two widely-used statistical tests (Spearman’s
story points (i.e., without labels) for pre-training, except rank and Pearson rank correlation) [79] for all the issues in
the Mulesoft repository which has much smaller number our dataset. Table 2 shows the Spearman’s rank and Pearson
of issues (only 8,036 issues) available for pre-training. Fig. 7 rank correlation coefficient and p-value for all projects. We
show the top-500 frequent words used in Apache. They are have found that there is a significantly (p < 0:05) positive
divided into 9 clusters (using K-means clustering) based on correlation between the story points and the development
their embedding which was learned through the pre-train- time across all 16 project we studied. In some projects (e.g.,
ing process. We used t-distributed stochastic neighbor Moodle) there was a strong correlation with the coefficients
embedding (t-SNE) [75] to display high-dimensional vectors was around 0.8. This positive correlation demonstrates that
in two dimensions. the higher story point, the longer development time, which
We show here some representative words from some suggests that a correlation between an issue’s story points
clusters for a brief illustration. Words that are semantically and its actual effort.
related are grouped in the same cluster. For example, words
related to networking like soap, configuration, tcp, and load 5.7 Results
are in one cluster. This indicates that to some extent, the We report here the results in answering research questions
learned vectors effectively capture the semantic relations RQs 1–6.
between words, which is useful for the story-point estima- RQ1: Sanity Check
tion task we do later. Table 3 shows the results achieved from Deep-SE,
The pre-training step is known to effectively deal with and two baseline methods: Mean and Median method
limited labelled data [76], [77], [78]. Here, pre-training does (See Appendix A.1, which can be found on the Computer
not require story-point labels since it is trained by predict- Society Digital Library at http://doi.ieeecomputersociety.
ing the next words. Hence the number of data points equals org/10.1109/TSE.2018.2792473, for the distribution of the
to the number of words. Since for each project repository Absolute Error). The analysis of MAE, MdAE, and SA
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 647

TABLE 2 TABLE 4
The Coefficient and p-Value of the Spearman’s Rank and Comparison on the Effort Estimation Benchmarks
Pearson Rank Correlation on the Story Points ^XY Effect Size (in Brackets)
Using Wilcoxon Test and A
Against the Development Time
Deep-SE vs Mean Median Random Aiu
Spearman’s rank Pearson correlation ME <0.001 [0.77] <0.001 [0.81] <0.001 [0.90] 0.83
UG <0.001 [0.79] <0.001 [0.79] <0.001 [0.81] 0.80
Project coefficient p-value coefficient p-value AS <0.001 [0.78] <0.001 [0.78] <0.001 [0.91] 0.82
Appcelerator Studio 0.330 <0.001 0.311 <0.001 AP 0.040 [0.69] <0.001 [0.79] <0.001 [0.84] 0.77
Aptana Studio 0.241 <0.001 0.325 <0.001 TI <0.001 [0.77] <0.001 [0.72] <0.001 [0.88] 0.79
DC <0.001 [0.80] 0.415 [0.54] <0.001 [0.81] 0.72
Bamboo 0.505 <0.001 0.476 <0.001 BB <0.001 [0.78] <0.001 [0.78] <0.001 [0.85] 0.80
Clover 0.551 <0.001 0.418 <0.001 CV <0.001 [0.75] <0.001 [0.70] <0.001 [0.91] 0.79
Data Management 0.753 <0.001 0.769 <0.001 JI <0.001 [0.76] <0.001 [0.79] <0.001 [0.79] 0.78
DuraCloud 0.225 <0.001 0.393 <0.001 MD <0.001 [0.81] <0.001 [0.75] <0.001 [0.80] 0.79
JIRA Software 0.512 <0.001 0.560 <0.001 DM <0.001 [0.69] <0.001 [0.59] <0.001 [0.75] 0.68
MU 0.003 [0.73] <0.001 [0.73] <0.001 [0.82] 0.76
Mesos 0.615 <0.001 0.766 <0.001 MS 0.799 [0.56] 0.842 [0.56] <0.001 [0.69] 0.60
Moodle 0.791 <0.001 0.816 <0.001 XD <0.001 [0.70] <0.001 [0.70] <0.001 [0.78] 0.73
Mule 0.711 <0.001 0.722 <0.001 TD <0.001 [0.86] <0.001 [0.85] <0.001 [0.87] 0.86
Mule Studio 0.630 <0.001 0.565 <0.001 TE <0.001 [0.73] <0.001 [0.73] <0.001 [0.92] 0.79
Spring XD 0.486 <0.001 0.614 <0.001
Talend Data Quality 0.390 <0.001 0.370 <0.001
Talend ESB 0.504 <0.001 0.524 <0.001 by Deep-SE over the Mean and Median method is 34.06 and
Titanium SDK/CLI 0.322 <0.001 0.305 <0.001 26.77 percent in terms of MAE, averaging across all projects.
Usergrid 0.212 0.005 0.263 0.001 We note that the results achieved by the estimation
models vary between different projects. For example, our
Deep-SE achieved 0.64 MAE in the Talend ESB project (TE),
suggests that the estimations obtained with our approach, while it achieved 5.97 MAE in Moodle (MD) project. The
Deep-SE, are better than those achieved by using Mean, distribution of story points may be the cause of this varia-
Median, and Random estimates. Deep-SE consistently out- tion: the standard deviation of story points in TE is only
performs all these three baselines in all sixteen projects. 1.50, while that in MD is 21.65 (see Table 1).
Our approach improved between 3.29 percent (in project Table 4 shows the results of the Wilcoxon test (together
MS) to 57.71 percent (in project BB) in terms of MAE, 11.71 per- with the corresponding A ^XY effect size) to measure the sta-
cent (in MU) to 73.86 percent (in CV) in terms of MdAE, and tistical significance and effect size (in brackets) of the
20.83 percent (in MS) to 449.02 percent (in MD) in terms of SA improved accuracy achieved by Deep-SE over the baselines:
over the Mean method. The improvements of our approach Mean Effort, Median Effort, and Random Guessing. In
over the Median method are between 2.12 percent (in MS) 45/48 cases, our Deep-SE significantly outperforms the
to 52.90 percent (in JI) in MAE, 0.50 percent (in MS) to baselines after applying Bonferroni correction with effect
63.50 percent (in ME) in MdAE, and 2.70 percent (in DC) to sizes greater than 0.5. Moreover, the average of the stochas-
328.82 percent (in JI) in SA. Overall, the improvement achieved tic superiority (Aiu ) of our approach against the baselines is
greater than 0.7 in the most cases. The highest Aiu achieving
in the Talend Data Quality project (TD) is 0.86 which can be
TABLE 3
considered as large effect size (A^XY > 0:8).
Evaluation Results of Deep-SE, the Mean and
Median Method (the Best Results Are Highlighted in Bold) We note that the improvement brought by our approach
over the baselines was not significant for project MS.
Proj Method MAE MdAE SA Proj Method MAE MdAE SA One possible reason is that the size of the training and pre-
ME Deep-SE 1.02 0.73 59.84 JI Deep-SE 1.38 1.09 59.52 training data for MS is small, and deep learning techniques
mean 1.64 1.78 35.61 mean 2.48 2.15 27.06 tend to perform well with large training samples.
median 1.73 2.00 32.01 median 2.93 2.00 13.88
UG Deep-SE 1.03 0.80 52.66 MD Deep-SE 5.97 4.93 50.29
mean 1.48 1.23 32.13 mean 10.90 12.11 9.16 Our approach outperforms the baselines, thus passing
median 1.60 1.00 26.29 median 7.18 6.00 40.16 the sanity check required by RQ1.
AS Deep-SE 1.36 0.58 60.26 DM Deep-SE 3.77 2.22 47.87
mean 2.08 1.52 39.02 mean 5.29 4.55 26.85
median 1.84 1.00 46.17 median 4.82 3.00 33.38 RQ2: Benefits of Deep Representation
AP Deep-SE 2.71 2.52 42.58 MU Deep-SE 2.18 1.96 40.09
mean 3.15 3.46 33.30 mean 2.59 2.22 28.82 Table 5 shows MAE, MdAE, and SA achieved from Deep-
median 3.71 4.00 21.54 median 2.69 2.00 26.07 SE using Recurrent Highway Networks for deep representa-
TI Deep-SE 1.97 1.34 55.92 MS Deep-SE 3.23 1.99 17.17
mean 3.05 1.97 31.59 mean 3.34 2.68 14.21 tion of issue reports against using Random Forests, Support
median 2.47 2.00 44.65 median 3.30 2.00 15.42 Vector Machine, Automatically Transformed Linear Model,
DC Deep-SE 0.68 0.53 69.92 XD Deep-SE 1.63 1.31 46.82 and Linear Regression Model coupled with LSTM (i.e.,
mean 1.30 1.14 42.88 mean 2.27 2.53 26.00
median 0.73 1.00 68.08 median 2.07 2.00 32.55 LSTM+RF, LSTM+SVM, LSTM+ATLM, and LSTM+LR).
BB Deep-SE 0.74 0.61 71.24 TD Deep-SE 2.97 2.92 48.28 The distribution of the Absolute Error is reported in
mean 1.75 1.31 32.11 mean 4.81 5.08 16.18
median 1.32 1.00 48.72 median 3.87 4.00 32.43 Appendix A.2, available in the online supplemental mate-
CV Deep-SE 2.11 0.80 50.45 TE Deep-SE 0.64 0.59 69.67 rial. When we use MAE, MdAE, and SA as evaluation
mean 3.49 3.06 17.84 mean 1.14 0.91 45.86 criteria, Deep-SE is still the best approach, consistently
median 2.84 2.00 33.33 median 1.16 1.00 44.44
outperforming LSTM+RF, LSTM+SVM, LSTM+ATLM,
MAE and MdAE - the lower the better, SA - the higher the better. and LSTM+LR across all sixteen projects.
648 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

TABLE 5 TABLE 6
Evaluation Results of Deep-SE, LSTM+RF, LSTM+SVM, Comparison Between the Recurrent Highway Net Against
LSTM+ATLM, and LSTM+LR (the Best Results Are Random Forests, Support Vector Machine, Automatically
Highlighted in Bold) Transformed Linear Model, and Linear Regression Using
^12 Effect Size (in Brackets)
Wilcoxon Test and A
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
Deep-SE vs LSTM+RF LSTM+SVM LSTM+ATLM LSTM+LR Aiu
ME Deep-SE 1.02 0.73 59.84 JI Deep-SE 1.38 1.09 59.52
lstm+rf 1.08 0.90 57.57 lstm+rf 1.71 1.27 49.71 ME <0.001 [0.57] <0.001 [0.54] <0.001 [0.59] <0.001 [0.59] 0.57
lstm+svm 1.07 0.90 58.02 lstm+svm 2.04 1.89 40.05 UG 0.004 [0.59] 0.010 [0.55] <0.001 [1.00] <0.001 [0.73] 0.72
lstm+atlm 1.08 0.95 57.60 lstm+atlm 2.10 1.95 38.26 AS <0.001 [0.69] <0.001 [0.51] <0.001 [0.71] <0.001 [0.75] 0.67
lstm+lr 1.10 0.96 56.94 lstm+lr 2.10 1.95 38.26 AP <0.001 [0.60] <0.001 [0.52] <0.001 [0.62] <0.001 [0.64] 0.60
TI <0.001 [0.65] 0.007 [0.51] <0.001 [0.69] <0.001 [0.71] 0.64
UG Deep-SE 1.03 0.80 52.66 MD Deep-SE 5.97 4.93 50.29 DC 0.406 [0.55] 0.015 [0.60] <0.001 [0.97] 0.024 [0.58] 0.68
lstm+rf 1.07 0.85 50.70 lstm+rf 9.86 9.69 17.86 BB <0.001 [0.73] 0.007 [0.60] <0.001 [0.84] <0.001 [0.75] 0.73
lstm+svm 1.06 1.04 51.23 lstm+svm 6.70 5.44 44.19 CV <0.001 [0.70] 0.140 [0.63] <0.001 [0.82] 0.001 [0.70] 0.71
lstm+atlm 1.40 1.20 35.55 lstm+atlm 9.97 9.61 16.92 JI 0.006 [0.71] 0.001 [0.67] 0.002 [0.89] <0.001 [0.79] 0.77
lstm+lr 1.40 1.20 35.55 lstm+lr 9.97 9.61 16.92 MD <0.001 [0.76] <0.001 [0.57] <0.001 [0.74] <0.001 [0.69] 0.69
DM <0.001 [0.62] <0.001 [0.56] <0.001 [0.61] <0.001 [0.62] 0.60
AS Deep-SE 1.36 0.58 60.26 DM Deep-SE 3.77 2.22 47.87
MU 0.846 [0.53] 0.005 [0.62] 0.009 [0.67] 0.003 [0.64] 0.62
lstm+rf 1.62 1.40 52.38 lstm+rf 4.51 3.69 37.71 MS 0.502 [0.53] 0.054 [0.50] <0.001 [0.82] 0.195 [0.56] 0.60
lstm+svm 1.46 1.42 57.20 lstm+svm 4.20 2.87 41.93 XD <0.001 [0.63] <0.001 [0.57] <0.001 [0.65] <0.001 [0.60] 0.61
lstm+atlm 1.59 1.30 53.29 lstm+atlm 4.70 3.74 35.01 TD <0.001 [0.78] <0.001 [0.68] <0.001 [0.70] <0.001 [0.70] 0.72
lstm+lr 1.68 1.46 50.78 lstm+lr 5.30 3.66 26.68 TE 0.020 [0.53] 0.002 [0.59] <0.001 [0.66] 0.006 [0.65] 0.61
AP Deep-SE 2.71 2.52 42.58 MU Deep-SE 2.18 1.96 40.09
lstm+rf 2.96 2.80 37.34 lstm+rf 2.20 2.21 38.73
lstm+svm 3.06 2.90 35.26 lstm+svm 2.28 2.89 37.44
improvement of our approach over LSTM+RF, LSTM
lstm+atlm 3.06 2.76 35.21 lstm+atlm 2.46 2.39 32.51
lstm+lr 3.75 3.66 20.63 lstm+lr 2.46 2.39 32.51 +SVM, and LSTM+ATLM is still significant after applying
p-value correction with the effect size greater than 0.5 in
TI Deep-SE 1.97 1.34 55.92 MS Deep-SE 3.23 1.99 17.17
lstm+rf 2.32 1.97 48.02 lstm+rf 3.30 2.77 15.30 59/64 cases. In most cases, when comparing the proposed
lstm+svm 2.00 2.10 55.20 lstm+svm 3.31 3.09 15.10 model against LSTM+RF, LSTM+SVM, LSTM+ATLM, and
lstm+atlm 2.51 2.03 43.87 lstm+atlm 3.42 2.75 12.21 LSTM+LR, the effect sizes are small (between 0.5 and 0.6).
lstm+lr 2.71 2.31 39.32 lstm+lr 3.42 2.75 12.21 A major part of those improvement were brought by our
DC Deep-SE 0.68 0.53 69.92 XD Deep-SE 1.63 1.31 46.82 use of the deep learning LSTM architecture to model the
lstm+rf 0.69 0.62 69.52 lstm+rf 1.81 1.63 40.99 textual description of an issue. The use of highway recur-
lstm+svm 0.75 0.90 67.02 lstm+svm 1.80 1.77 41.33 rent networks (on top of LSTM) has also improved the pre-
lstm+atlm 0.87 0.59 61.57 lstm+atlm 1.83 1.65 40.45
lstm+lr 0.80 0.67 64.96 lstm+lr 1.85 1.72 39.63 dictive performance, but not as large effects as the LSTM
itself (especially for those projects which have very small
BB Deep-SE 0.74 0.61 71.24 TD Deep-SE 2.97 2.92 48.28
lstm+rf 1.01 1.00 60.95 lstm+rf 3.89 4.37 32.14
number of issues). However, our approach, Deep-SE,
lstm+svm 0.81 1.00 68.55 lstm+svm 3.49 3.37 39.13 achieved Aiu greater than 0.6 in the most cases.
lstm+atlm 1.97 1.78 23.70 lstm+atlm 3.86 4.11 32.71
lstm+lr 1.26 1.16 51.24 lstm+lr 3.79 3.67 33.88 The proposed approach of using Recurrent Highway
CV Deep-SE 2.11 0.80 50.45 TE Deep-SE 0.64 0.59 69.67 Networks is effective in building a deep representation
lstm+rf 3.08 2.77 27.58 lstm+rf 0.66 0.65 68.51 of issue reports and consequently improving story point
lstm+svm 2.50 2.32 41.22 lstm+svm 0.70 0.90 66.61
lstm+atlm 3.11 2.49 26.90 lstm+atlm 0.70 0.72 66.51
estimation.
lstm+lr 3.36 2.76 21.07 lstm+lr 0.77 0.71 63.20

MAE and MdAE - the lower the better, SA - the higher the better. RQ3: Benefits of LSTM Document Representation
To study the benefits of using LSTM in representing issue
Using RHWN improved over RF between 0.91 percent reports, we compared the improved accuracy achieved by
(in MU) to 39.45 percent (in MD) in MAE, 5.88 percent Random Forest using the features derived from LSTM
(in UG) to 71.12 percent (in CV) in MdAE, and 0.58 percent against that using the features derived from BoW and Doc2-
(in DC) to 181.58 percent (in MD) in SA. The improvements vec. For a fair comparison we used Random Forests as the
of RHWN over SVM are between 1.50 percent (in TI) to regressor in all settings and the result is reported in Table 7
32.35 percent (in JI) in MAE, 9.38 percent (in MD) to 65.52 (see the distribution of the Absolute Error in Appendix A.3,
percent (in CV) in MdAE, and 1.30 percent (in TI) to 48.61 available in the online supplemental material). LSTM per-
percent (in JI). In terms of using ATLM, RHWN improved forms better than BoW and Doc2vec with respect to the
over it between 5.56 percent (in MS) to 62.44 percent (in BB) in MAE, MdAE, and SA measures in twelve projects (e.g., ME,
MAE, 8.70 percent (in AP) to 67.87 percent (in CV) in MdAE, UG, and AS) from sixteen projects. LSTM improved 4.16
and 3.89 percent (in ME) to 200.59 percent (in BB) in SA. Over- and 11.05 percent in MAE over Doc2vec and BoW, respec-
all, RHWN improved , in terms of MAE, 9.63 percent over tively, averaging across all projects.
SVM, 13.96 percent over RF, 21.84 percent over ATLM, and Among those twelve projects, LSTM improved over BoW
23.24 percent over LR, averaging across all projects. between 0.30 percent (in MS) to 28.13 percent (in DC) in
In addition, the results for the Wilcoxon test to compare terms of MAE, 1.06 percent (in AP) to 45.96 percent (in JI) in
our approach (Deep-SE) against LSTM+RF, LSTM+SVM, terms of MdAE, and 0.67 percent (in AP) to 47.77 percent
LSTM+ATLM, and LSTM+LR is shown in Table 6. The (in TD) in terms of SA. It also improved over Doc2vec
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 649

TABLE 7 TABLE 9
Evaluation Results of LSTM+RF, BoW+RF, and Doc2vec+RF Mean Absolute Error (MAE) on Cross-Project Estimation
(the Best Results Are Highlighted in Bold) and Comparison of Deep-SE and ABE0 Using Wilcoxon
Test and A^XY Effect Size (in Brackets)
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
ME lstm+rf 1.08 0.90 57.57 JI lstm+rf 1.71 1.27 49.71 Source Target Deep-SE ABE0 Deep-SE vs ABE0
bow+rf 1.31 1.34 48.66 bow+rf 2.10 2.35 38.34 (i) within-repository
d2v+rf 1.14 0.98 55.28 d2v+rf 2.10 2.14 38.29 ME UG 1.07 1.23 <0.001 [0.78]
UG lstm+rf 1.07 0.85 50.70 MD lstm+rf 9.86 9.69 17.86 UG ME 1.14 1.22 0.012 [0.52]
bow+rf 1.19 1.28 45.24 bow+rf 10.20 10.22 15.07 AS AP 2.75 3.08 <0.001 [0.67]
d2v+rf 1.12 0.92 48.47 d2v+rf 8.02 9.87 33.19 AS TI 1.99 2.56 <0.001 [0.70]
AS lstm+rf 1.62 1.40 52.38 DM lstm+rf 4.51 3.69 37.71 AP AS 2.85 3.00 0.051 [0.55]
bow+rf 1.83 1.53 46.34 bow+rf 4.78 3.98 33.84 AP TI 3.41 3.53 0.003 [0.56]
d2v+rf 1.62 1.41 52.38 d2v+rf 4.71 3.99 34.87 MU MS 3.14 3.55 0.041 [0.55]
MS MU 2.31 2.64 0.030 [0.56]
AP lstm+rf 2.96 2.80 37.34 MU lstm+rf 2.20 2.21 38.73
bow+rf 2.97 2.83 37.09 bow+rf 2.31 2.54 36.64 Avg 2.33 2.60
d2v+rf 3.20 2.91 32.29 d2v+rf 2.21 2.69 39.36
(ii) cross-repository
TI lstm+rf 2.32 1.97 48.02 MS lstm+rf 3.30 2.77 15.30
AS UG 1.57 2.04 0.004 [0.61]
bow+rf 2.58 2.30 42.15 bow+rf 3.31 2.57 15.58
d2v+rf 2.41 2.16 46.02 d2v+rf 3.40 2.93 12.79
AS ME 2.08 2.14 0.022 [0.51]
MD AP 5.37 6.95 <0.001 [0.58]
DC lstm+rf 0.69 0.62 69.52 XD lstm+rf 1.81 1.63 40.99 MD TI 6.36 7.10 0.097 [0.54]
bow+rf 0.96 1.11 57.78 bow+rf 1.98 1.72 35.56 MD AS 5.55 6.77 <0.001 [0.61]
d2v+rf 0.77 0.77 66.14 d2v+rf 1.88 1.73 38.72 DM TI 2.67 3.94 <0.001 [0.64]
BB lstm+rf 1.01 1.00 60.95 TD lstm+rf 3.89 4.37 32.14 UG MS 4.24 4.45 0.005 [0.54]
bow+rf 1.34 1.26 48.06 bow+rf 4.49 5.05 21.75 ME MU 2.70 2.97 0.015 [0.53]
d2v+rf 1.12 1.16 56.51 d2v+rf 4.33 4.80 24.48
Avg 3.82 4.55
CV lstm+rf 3.08 2.77 27.58 TE lstm+rf 0.66 0.65 68.51
bow+rf 2.98 2.93 29.91 bow+rf 0.86 0.69 58.89
d2v+rf 3.16 2.79 25.70 d2v+rf 0.70 0.89 66.61
The improvement of LSTM over BoW and Doc2vec is sig-
MAE and MdAE - the lower the better, SA - the higher the better. nificant after applying Bonferroni correction with effect size
greater than 0.5 in 24/32 cases and Aiu being greater than
between 0.45 percent (in MU) to 18.57 percent (in JI) in 0.5 in all projects (see Table 8).
terms of MAE, 0.71 percent (in AS) to 40.65 percent (in JI) in
terms of MdAE, and 2.85 percent (in TE) to 31.29 percent The proposed LSTM-based approach is effective in auto-
(in TD) in terms of SA. matically learning semantic features representing issue
We acknowledge that BoW and Doc2vec perform bet- description, which improves story-point estimation.
ter than LSTM in some cases. For example, in the Moo-
dle project (MD), D2V+RF performed better than LSTM RQ4: Cross-Project Estimation
+RF in MAE and SA—it achieved 8.02 MAE and 33.19 We performed sixteen sets of cross-project estimation
SA. This could reflect that the combination between experiments to test two settings: (i) within-repository: both
LSTM and RHWN significantly improves the accuracy of the source and target projects (e.g., Apache Mesos and
the estimations. Apache Usergrid) were from the same repository, and pre-
training was done using only the source projects, not the
TABLE 8 target projects; and (ii) cross-repository: the source project
Comparison of Random Forest with LSTM, Random Forests
(e.g., Appcelerator Studio) was in a different repository
with BoW, and Random Forests with Doc2vec Using
^XY Effect Size (in Brackets)
Wilcoxon Test and A from the target project Apache Usergrid, and pre-training
was done using only the source project.
LSTM vs BoW Doc2Vec Aiu Table 9 shows the performance of our Deep-SE model and
ME <0.001 [0.70] 0.142 [0.53] 0.62 ABE0 for cross-project estimation (see the distribution of the
UG <0.001 [0.71] 0.135 [0.60] 0.66 Absolute Error in Appendix A.4, available in the online sup-
AS <0.001 [0.66] <0.001 [0.51] 0.59 plemental material). We also used a benchmark of within-
AP 0.093 [0.51] 0.144 [0.52] 0.52 project estimation where older issues of the target project
TI <0.001 [0.67] <0.001 [0.55] 0.61 were used for training (see Table 3). In all cases, the proposed
DC <0.001 [0.73] 0.008 [0.59] 0.66 approach when used for cross-project estimation performed
BB <0.001 [0.77] 0.002 [0.66] 0.72
worse than when used for within-project estimation (e.g., on
CV 0.109 [0.61] 0.581 [0.57] 0.59
JI 0.009 [0.67] 0.011 [0.62] 0.65 average 20.75 percent reduction in performance for within-
MD 0.022 [0.63] 0.301 [0.51] 0.57 repository and 97.92 percent for cross-repository). However,
DM <0.001 [0.60] <0.001 [0.55] 0.58 our approach outperformed the cross-project baseline (i.e.,
MU 0.006 [0.59] 0.011 [0.57] 0.58 ABE0) in all cases—it achieved 2.33 and 3.82 MAE in within
MS 0.780 [0.54] 0.006 [0.57] 0.56 and cross repository, while ABE0 achieved 2.60 and 4.55
XD <0.001 [0.60] 0.005 [0.55] 0.58 MAE. The improvement of our approach over ABE0 is still
TD <0.001 [0.73] <0.001 [0.67] 0.70
TE <0.001 [0.69] 0.005 [0.61] 0.65 significant after applying p-value correction with the effect
size greater than 0.5 in 14/16 cases.
650 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

TABLE 10 TABLE 11
Evaluation Results on the Adjusted Story Points Mean Absolute Error (MAE) and Comparison of Deep-SE
(the Best Results Are Highlighted in Bold) and the Porru’s Approach Using Wilcoxon Test and
A^XY Effect Size (in Brackets)
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
ME Deep-SE 0.27 0.03 76.58 JI Deep-SE 0.60 0.51 63.20 Proj Deep-SE Porru Deep-SE vs Porru
lstm+rf 0.34 0.15 70.43 lstm+rf 0.74 0.79 54.42 APSTUD 2.67 5.69 <0.001 [0.63]
bow+rf 0.36 0.16 68.82 bow+rf 0.66 0.53 58.99
DNN 0.47 1.08 <0.001 [0.74]
d2v+rf 0.35 0.15 69.87 d2v+rf 0.70 0.53 56.99
lstm+svm 0.33 0.10 71.20 lstm+svm 0.94 0.89 41.97
MESOS 0.76 1.23 0.003 [0.70]
lstm+atlm 0.33 0.14 70.97 lstm+atlm 0.89 0.89 45.18 MULE 2.32 3.37 <0.001 [0.61]
lstm+lr 0.37 0.21 67.68 lstm+lr 0.89 0.89 45.18 NEXUS 0.21 0.39 0.005 [0.67]
mean 1.12 1.07 3.06 mean 1.31 1.71 18.95 TIMOB 1.44 1.76 0.047 [0.57]
median 1.05 1.00 8.87 median 1.60 2.00 1.29 TISTUD 1.04 1.28 <0.001 [0.58]
XD 1.00 1.86 <0.001 [0.69]
UG Deep-SE 0.07 0.01 93.50 MD Deep-SE 2.56 2.29 31.83
lstm+rf 0.08 0.00 92.59 lstm+rf 3.45 3.55 8.24 avg 1.24 2.08
bow+rf 0.11 0.01 90.31 bow+rf 3.32 3.27 11.54
d2v+rf 0.10 0.01 91.22 d2v+rf 3.39 3.48 9.70
lstm+svm 0.15 0.10 86.38 lstm+svm 3.12 3.07 16.94
lstm+atlm 0.15 0.08 86.25 lstm+atlm 3.48 3.49 7.41 These results confirm a universal understanding [25] in
lstm+lr 0.15 0.08 86.25 lstm+lr 3.57 3.28 4.98 agile development that story point estimation is specific to
mean 1.04 0.98 4.79 mean 3.60 3.67 4.18
median 1.06 1.00 2.64 median 2.95 3.00 21.48 teams and projects. Since story points are relatively mea-
AS Deep-SE 0.53 0.20 69.16 DM Deep-SE 2.30 1.43 31.99 sured, it is not uncommon that two different same-sized
lstm+rf 0.56 0.45 67.49 lstm+rf 2.83 2.59 16.23 teams could give different estimates for the same user story.
bow+rf 0.56 0.49 67.39 bow+rf 2.83 2.63 16.33 For example, team A may estimate 5 story points for user
d2v+rf 0.56 0.46 67.37 d2v+rf 2.92 2.80 13.80
lstm+svm 0.55 0.32 68.34 lstm+svm 2.45 1.78 27.56 story UC1 while team B gives 10 story points. However, it
lstm+atlm 0.57 0.46 66.87 lstm+atlm 2.83 2.57 16.28 does not necessarily mean that team B would do more work
lstm+lr 0.57 0.49 67.12 lstm+lr 2.83 2.57 16.28
mean 1.18 0.79 31.89 mean 3.27 3.41 3.25
for completing UC1 than team A. It more likely means that
median 1.35 1.00 21.54 median 2.61 2.00 22.94 team’B baselines are twice bigger than team A’s, i.e., for
AP Deep-SE 0.92 0.86 21.95 MU Deep-SE 0.68 0.59 63.83 “baseline” user story which requires 5 times less the effort
lstm+rf 0.99 0.87 16.23 lstm+rf 0.70 0.55 63.01 than UC1 takes, team A would give it 1 story point while
bow+rf 1.00 0.87 15.33 bow+rf 0.70 0.57 62.79
d2v+rf 0.99 0.86 15.94 d2v+rf 0.71 0.57 62.17
team B gives 2 story points. Hence, historical estimates are
lstm+svm 1.12 0.92 5.26 lstm+svm 0.70 0.62 62.62 more valuable for within-project estimation, which is dem-
lstm+atlm 1.03 0.84 12.63 lstm+atlm 0.93 0.74 50.77 onstrated by this result.
lstm+lr 1.17 1.05 1.14 lstm+lr 0.79 0.61 58.00
mean 1.15 0.64 2.49 mean 1.21 1.51 35.86
median 0.94 1.00 20.29 median 1.64 2.00 12.80 Given the specificity of story points to teams and proj-
TI Deep-SE 0.59 0.17 56.53 MS Deep-SE 0.86 0.65 56.82 ects, our proposed approach is more effective for within-
lstm+rf 0.72 0.56 46.22 lstm+rf 0.91 0.76 54.37 project estimation.
bow+rf 0.73 0.58 46.10 bow+rf 0.89 0.93 55.48
d2v+rf 0.72 0.56 46.17 d2v+rf 0.90 0.69 54.66
lstm+svm 0.73 0.62 45.74 lstm+svm 0.94 0.78 52.91 RQ5: Adjusted/Normalized Story Points
lstm+atlm 0.73 0.57 45.86 lstm+atlm 0.99 0.87 50.45
lstm+lr 0.73 0.56 45.77 lstm+lr 0.99 0.87 50.45
Table 10 shows the results of our Deep-SE and the other
mean 1.32 1.56 1.57 mean 1.23 0.62 38.49 baseline methods in predicting the normalized story points.
median 0.86 1.00 36.04 median 1.44 1.00 27.83 Deep-SE performs well across all projects. Deep-SE impro-
DC Deep-SE 0.48 0.48 55.77 XD Deep-SE 0.35 0.08 80.66 ved MAE between 2.13 to 93.40 percent over the Mean
lstm+rf 0.49 0.49 55.02 lstm+rf 0.44 0.37 75.78
bow+rf 0.49 0.48 54.76 bow+rf 0.45 0.38 75.33
method, 9.45 to 93.27 percent over the Median method, 7.02
d2v+rf 0.50 0.50 53.59 d2v+rf 0.45 0.32 75.31 to 53.33 percent over LSTM+LR, 1.20 to 61.96 percent over
lstm+svm 0.49 0.43 55.24 lstm+svm 0.38 0.20 79.16 LSTM+ATLM, 1.20 to 53.33 percent over LSTM+SVM, 4.00
lstm+atlm 0.53 0.47 51.02 lstm+atlm 0.92 0.76 49.05
lstm+lr 0.53 0.47 51.02 lstm+lr 0.45 0.40 75.33 to 30.00 percent over Doc2vec+RF, 2.04 to 36.36 percent
mean 1.07 1.49 1.29 mean 1.03 1.28 43.06 over BoW+RF, and 0.86 to 25.80 percent over LSTM+RF.
median 0.58 1.00 46.76 median 0.75 1.00 58.74
The best result is obtained in the Usergrid project (UG), it is
BB Deep-SE 0.41 0.12 72.00 TD Deep-SE 0.82 0.64 53.36 0.07 MAE, 0.01 MdAE, and 93.50 SA. We however note that
lstm+rf 0.43 0.38 70.37 lstm+rf 0.84 0.68 52.65
bow+rf 0.45 0.40 69.33 bow+rf 0.88 0.65 50.30 the adjusted story points benefits all methods since it nar-
d2v+rf 0.49 0.45 66.34 d2v+rf 0.86 0.70 51.46 rows the gap between minimum and maximum value and
lstm+svm 0.42 0.21 71.21 lstm+svm 0.83 0.62 53.24
lstm+atlm 0.47 0.41 67.53 lstm+atlm 0.83 0.58 52.82
the distribution of the story points.
lstm+lr 0.47 0.41 67.53 lstm+lr 0.90 0.74 48.88
mean 1.15 0.76 20.92 mean 1.29 1.42 27.20 Our proposed approach still outperformed other techni-
median 1.39 1.00 4.50 median 0.99 1.00 44.17
ques in estimating the new adjusted story points.
CV Deep-SE 1.15 0.79 23.29 TE Deep-SE 0.40 0.05 74.58
lstm+rf 1.16 1.05 22.55 lstm+rf 0.47 0.46 70.39
bow+rf 1.22 1.10 18.95 bow+rf 0.48 0.48 69.52 RQ6: Compare Deep-SE Against the Existing Approach
d2v+rf 1.20 1.09 20.30 d2v+rf 0.48 0.48 69.41 We applied our approach, Deep-SE, and the Porru et al.’s
lstm+svm 1.22 1.15 18.77 lstm+svm 0.45 0.41 71.77
lstm+atlm 1.47 1.28 2.22 lstm+atlm 0.49 0.48 69.14 approach on their dataset consisted of eight projects.
lstm+lr 1.47 1.28 2.22 lstm+lr 0.49 0.48 69.14 Table 11 shows the evaluation results in MAE and the com-
mean 1.27 1.11 15.18 mean 0.99 0.60 37.28
median 1.29 1.00 13.92 median 1.39 1.00 12.09
parison of Deep-SE and the Porru et al.’s approach. The dis-
tribution of the Absolute Error is reported in Appendix A.5,
MAE and MdAE - the lower the better, SA - the higher the better. available in the online supplemental material. Deep-SE
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 651

TABLE 12 Doc2vec+RF, BoW+RF, LSTM+SVM, LSTM+ATLM, and

The Pre-Training, Training, and Testing Time at LSTM+LR). On this website, we also provide detailed
50 Embedding Dimensions of Our Deep-SE Model instructions on how to run the code and replicate all the
Repository Pre-training Proj. Training Testing experiments we reported in this paper so that our results
time time time can be independently verified.
Apache 6 h 28 min ME 23 min 1.732 s
UG 15 min 0.395 s 5.10 Threats to Validity
Appcelerator 5 h 11 min AS 27 min 2.209 s We tried to mitigate threats to construct validity by using real
AP 18 min 0.428 s world data from issues recorded in large open source proj-
TI 32 min 2.528 s ects. We collected the title and description provided with
Duraspace 3 h 34 min DC 18 min 1.475 s these issue reports and the actual story points that were
Jira 6 h 42 min BB 15 min 0.267 s
CV 14 min 0.219 s assigned to them. We are aware that those story points were
JI 13 min 0.252 s estimated by human teams, and thus may contain biases and
Moodle 6 h 29 min MD 15 min 1.789 s in some cases may not be accurate. We have mitigated this
Lsstcorp 3 h 26 min DM 40 min 5.293 s threats by performing two set of experiments: one on the
Mulesoft 2 h 39 min MU 21 min 0.535 s orgianal story points and the other on the adjusted normal-
MS 17 min 0.718 s ized story points. We further note that for story points, the
Spring 5 h 20 min XD 40 min 2.774 s
raw values are not as important as the relative values [80]. A
Talendforge 6 h 56 min TD 19 min 1.168 s
TE 16 min 0.591 s user story that is assigned 6 story points should be three
times as much as a user story that is assigned 2 story points.
Hence, when engineers determine an estimate for a new
outperforms the existing approach in all cases. Deep-SE issue, they need to compare the issue to other issues in the
improved between 18.18 percent (in TIMOB) to 56.48 per- past in order to make the estimation consistently. The prob-
cent (in DNN) in terms of MAE. In addition, the improve- lem is thus suitable for a machine learner. The trained predic-
ment of our approach over the Porru et al.’s approach is still tion system works in a similar manner as human engineers:
significant after applying p-value correction with the effect using past estimates as baselines for new estimation. The
size greater than 0.5 in all cases. Especially, the large effect prediction system tries to reproduce an estimate that human
size (A ^XY > 0:7) of the improvement is obtained in engineers would arrive at.
the DNN project. However, since we aim to mimic the team’s capability
in effort estimation, the current set of ground-truths suffi-
Our proposed approach outperformed the existing tech- ciently serves this purpose. When other sets of ground-
nique using TF-IDF in estimating the story points. truths become available, our model can be easily retrained.
To minimize threats to conclusion validity, we carefully
selected unbiased error measures, applied a number of sta-
5.8 Training/Testing Time tistical tests, and applied multiple statistical testing corr-
Deep learning models are known for taking a long time for ection to verify our assumptions [81]. Our study was
training. This is an important factor in considering adopting performed on datasets of different sizes. In addition, we
our approach, especially in an agile development setting. If carefully followed recent best practices in evaluating effort
training time takes longer than the duration of a sprint (e.g., estimation models [55], [57], [74] to decrease conclusion
one or two weeks), the prediction system would not be use- instability [82].
ful in practice. We have found that the training time of our The original implementation of Porru et al.’s method [64]
model was very small, ranging from 13 to 40 minutes with was not released, thus we have re-implemented our own
an average of 22 minutes across the 16 projects (see Table 12). version of their approach. We strictly followed the described
Pre-training time took much longer time, but it was done provided in their work, however we acknowledge that our
only once across a repository and took just below 7 hours at implementation may not reflect all the implementation
the maximum. Once the model was trained, getting an esti- details in their approach. To mitigate this threat, we have
mation from the model was very fast. As can be seen from tested our implementation using the dataset provided in
Table 12, the time it took for testing all issues in the test sets their work. We have found that our results were consistent
was in the order of seconds. Hence, for a given new issue, it with the results reported in their work.
would take less than a second for the machinery to come To mitigate threats to external validity, we have consid-
back with an story point estimation. All the experiments ered 23,313 issues from sixteen open source projects, which
were run on a MacOS laptop with 2.4 GHz Intel Core i5 and differ significantly in size, complexity, team of developers,
8 GB of RAM and the embedding dimensions of 50. Hence, and community. We however acknowledge that our dataset
this result suggests that using our proposed approach to esti- would not be representative of all kinds of software proj-
mate story points is applicable in practice. ects, especially in commercial settings (although open
source projects and commercial projects are similar in many
5.9 Verifiability aspects). One of the key differences between open source
We have created a replication package and made it available and commercial projects that may affect the estimation of
at https://github.com/SEAnalytics/. The package contains story points is the nature of contributors, developers, and
the full dataset and the source code of our Deep-SE model project’s stakeholders. Further investigation for commercial
and the benchmark models (i.e., the baselines, LSTM+RF, agile projects is needed.
652 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

5.11 Implications this direction, we make our pre-trained models publicly

In this section, we discuss a number of implications of our available for the research community.
results. We acknowledged that the explainability of a model is
What do the Results Mean for the Research on Effort Estima- important for full adoption of machine learning techniques.
tion? Existing work on effort estimation mainly focus on This is not a unique problem only for recurrent networks
estimating the whole project with a small number of data (RNN), but also for many powerful modern machine learn-
points (see the datasets in the PROMISE repository [83] for ing techniques (e.g., Random Forests and SVM). However,
example). The fast emergence of agile development demand RNN is not entirely a black-box as it seems (e.g., see [86]).
more research on estimation at the issue or user story level. For example, word importance can be credited using vari-
Our work opens a new research area for the use of software ous techniques (e.g., using gradient with respect to word
analytics in estimating story points. The assertion demon- value). Alternatively, there are model agnostic technique to
strated by our results is that our current method works and explain any prediction [87]. Even with partly interpretable
no other methods has been demonstrated to work at this RNN, if the prediction is accurate, then we can still expect a
scale of above 23,000 data points. Existing work in software high level of adoption.
effort estimation have dealt with a much smaller number of What do the Results Mean for Project Managers and Dev-
observations (i.e., data points) than our work did. For exam- elopers? Our positive results indicate that it is possible to
ple, the China dataset has only 499 data points, Desharnais build a prediction system to support project managers and
has 77, and Finish has 38 (see the datasets for effort estima- developers in estimating story points. Our proposal enables
tion on the PROMISE repository)—these are commonly used teams to be consistent in their estimation of story points.
in existing effort estimation work (e.g., [19], [84]). By con- Achieving this consistency is central to effectively leveraging
trast, in this work we deal with the scale of thousands of story points for project planning. The machine learner learns
data points. Since we make our dataset publicly available, from past estimates made by the specific team which it is
further research (e.g., modeling the codebase and adding deployed to assist. The insights that the learner acquires are
team-specific features into the estimation model) can be therefore team-specific. The intent is not to have the machine
advanced in this topic, and our current results can serve as learner supplant existing agile estimation practices. The
the baseline. intent, instead, is to deploy the machine learner to comple-
Should we Adopt Deep Learning? To the best of our knowl- ment these practices by playing the role of a decision support
edge, our work is the first major research in using deep learn- system. Teams would still meet, discuss user stories and gen-
ing for effort estimation. The use of deep learning has erate estimates as per current practice, but would have the
allowed us to automatically learn a good representation of added benefit of access to the insights acquired by the
an issue report and use this for estimating the effort of resolv- machine learner. Teams would be free to reject the sugges-
ing the issue. The evaluation results demonstrates the signifi- tions of the machine learner, as is the case with any decision
cant improvement that our deep learning approach has support system. In every such estimation exercise, the actual
brought in terms of predictive performance. This is a power- estimates generated are recorded as data to be fed to the
ful result since it helps software practitioners move away machine learner, independent of whether these estimates are
from the manual feature engineering process. Feature engi- based on the recommendations of the machine learner or
neering usually relies on domain experts who use their spe- not. This estimation process helps the team not only under-
cific knowledge of the data to create features for machine stand sufficient details about what it will take to to resolve
learners to work. In our approach, features are automatically those issues, but also align with their previous estimations.
learned from a textual description of an issue, thus obviating
the need for designing them manually. We of course need to
6 RELATED WORK
collect the labels (i.e., story points assigned to issues) as the
ground truths used for learning and testing. Hence, we Existing estimation methods can generally be classified into
believe that the wide adoption of software analytics in indus- three major groups: expert-based, model-based, and hybrid
try crucially depends on the ability to automatically derive approaches. Expert-based methods rely on human expertise
(learn) features from raw software engineering data. to make estimations, and are the most popular technique in
In our context of story point estimation, if the number of practice [88], [89]. Expert-based estimation however tends
new words is large, transfer learning is needed, e.g., by using to require large overheads and the availability of experts
the existing model as a strong prior for the new model. How- each time the estimation needs to be made. Model-based
ever, this can be mitigated by pre-training on a large corpus approaches use data from past projects but they are also var-
so that most of the terms are covered. After pre-training, our ied in terms of building customized models or using fixed
model is able to automatically learn semantic relations models. The well-known construction cost (COCOMO)
between words. For example, words related to networking model [11] is an example of a fixed model where factors and
like “soap”, “configuration”, “tcp”, and “load” are in one their relationships are already defined. Such estimation
cluster (see Fig. 7). Hence, even when a user story has several models were built based on data from a range of past proj-
unique terms (but already pre-trained), retraining the main ects. Hence, they tend to be suitable only for a certain kinds
model is not necessary. Pre-training may however take time of project that were used to build the model. The customized
and effort. One potential research direction is therefore model building approach requires context-specific data and
building up a community for sharing pre-trained networks, uses various methods such as regression (e.g., [12], [13]),
which can be used for initialization, thus reducing training Neural Network (e.g., [14], [90]), Fuzzy Logic (e.g., [15]),
times (similar to Model Zoo [85]). As the first step towards Bayesian Belief Networks (e.g., [16]), analogy-based (e.g.,
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 653

[17], [18]), and multi-objective evolutionary approaches (e.g., process the input sequence and a decoder RNN with attention
[19]). It is however likely that no single method will be the to generate the output sequence. This model takes as input a
best performer for all project types [10], [20], [91]. Hence, given API-related natural language query and returns API
some recent work (e.g., [20]) proposes to combine the esti- usage sequences. The work in [107] also uses RNN Encoder—
mates from multiple estimators. Hybrid approaches (e.g., Decoder but for fixing common errors in C programs. Deep
[21], [22]) combine expert judgements with the available Belief Network [108] is another common deep learning model,
data—similarly to the notions of our proposal. which has been used in software engineering, e.g., for build-
While most existing work focuses on estimating a whole ing defection prediction models [109], [110].
project, little work has been done in building models specifi-
cally for agile projects. Today’s agile, dynamic and change- 7 CONCLUSION
driven projects require different approaches to planning and In this paper, we have contributed to the research community
estimating [24]. Some recent approaches leverage machine the dataset for story point estimations, sourcing from 16 large
learning techniques to support effort estimation for agile and diverse software projects. We have also proposed a
projects. Recently, the work in [64] proposed an approach deep learning-based, fully end-to-end prediction system for
which extracts TF-IDF features from issue description to estimating story points, removing the users from manually
develop an story-point estimation model. The univariate fea- designing features from the textual description of issues.
ture selection technique are then applied on the extracted A key novelty of our approach is the combination of two pow-
features and fed into classifiers (e.g., SVM). In addition, the erful deep learning architectures: Long Short-Term Memory
work in [92] applied Cosmic Function Points (CFP) [93] to (to learn a vector representation for issue reports) and Recur-
estimate the effort for completing an agile project. The work rent Highway Network (for building a deep representation).
in [94] developed an effort prediction model for iterative The proposed approach has consistently outperformed
software development setting using regression models and three common baselines and four alternatives according to
neural networks. Differing from traditional effort estimation our evaluation results. Compared against the Mean and
models, this model is built after each iteration (rather than at Median techniques, the proposed approach has improved
the end of a project) to estimate effort for the next iteration. 34.06 and 26.77 percent respectively in MAE averaging
The work in [95] built a Bayesian network model for effort across 16 projects we studied. Compared against the BoW
prediction in software projects which adhere to the agile and Doc2Vec techniques, our approach has improved 23.68
Extreme Programming method. Their model however relies and 17.90 percent in MAE. These are significant results in
on several parameters (e.g., process effectiveness and pro- the literature of effort estimation. A major part of those
cess improvement) that require learning and extensive fine improvement were brought by our use of the deep learning
tuning. Bayesian networks are also used in [96] to model LSTM architecture to model the textual description of an
dependencies between different factors (e.g., sprint progress issue. The use of highway recurrent networks (on top of
LSTM) has also improved the predictive performance, but
and sprint planning quality influence product quality) in
not as significantly as the LSTM itself (especially for those
Scrum-based software development project in order to detect
project which have very small number of issues).
problems in the project. Our work specifically focuses on
Our future work would involve expanding our study to
estimating issues with story points using deep learning tech-
commercial software projects and other large open source
niques to automatically learn semantic features representing
projects to further evaluate our proposed method. We also
the actual meaning of issue descriptions, which is the key dif- consider performing team analytics (e.g., features character-
ference from previous work. Previous research (e.g., [97], izing a team) to model team changes over time and feed it
[98], [99], [100]) has also been done in predicting the elapsed into our prediction model. We also plan to investigate how
time for fixing a bug or the delay risk of resolving an issue. to learn a semantic representation of the codebase and use it
However, effort estimation using story points is a more pref- as another input to our model. Furthermore, we will look
erable practice in agile development. into experimenting with a sliding window setting to explore
LSTM has shown successes in many applications such as incremental learning. In addition, we will also investigate
language models [35], speech recognition [36] and video anal- how to best use the issue’s metadata (e.g., priority and type)
ysis [37]. Our Deep-SE is a generic in which it maps text to a and still maintain the end-to-end nature of our entire model.
numerical score or a class, and can be used for other tasks, Our future work also involve comparing our use of the
e.g., mapping a movie review to a score, or assigning scores to LSTM model against other state-of-the-art models of natural
essays, or sentiment analysis. Deep learning has recently language such as paragraph2vec [59] or Convolutional Neu-
attracted increasing interests in software engineering. Our ral Network [111]. We have discussed (informally) our work
previous work [101] proposed a generic deep learning frame- with several software developers who has been practising
work based on LSTM for modeling software and its develop- agile and estimating story points. They all agreed that our
ment process. White et al. [102] has employed recurrent prediction system could be useful in practice. However, to
neural networks to build a language model for source code. make such a claim, we need to implement it into a tool and
Their later work [103] extended these RNN models for detect- perform a user study. Hence, we would like to evaluate
ing code clones. The work in [104] also used RNNs to build a empirically the impact of our prediction system for story
statistical model for code completion. Our recent work [105] point estimation in practice by project managers and/or soft-
used LSTM to build a language model for code and demon- ware developers. This would involve developing the model
strated the improvement of this model compared to the into a tool (e.g., a JIRA plugin) and then organising trial use
one using RNNs. Gu et al. [106] used a special RNN in practice. This is an important part of our future work to
Encoder—Decoder, which consists of an encoder RNN to confirm the ultimate benefits of our approach in general.
654 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

ACKNOWLEDGMENTS [23] H. F. Cervone, “Understanding agile project management meth-

ods using Scrum,” OCLC Syst. Serv.: Int. Digit. Library Perspec-
This research project was partially supported by Faculty tives, vol. 27, no. 1, pp. 18–22, 2011.
[24] M. Cohn, Agile Estimating and Planning. London, U.K.: Pearson
of Information and Communication Technology, Mahidol Education, 2005.
University. We also gratefully acknowledge the support [25] M. Usman, E. Mendes, F. Weidt, and R. Britto, “Effort estimation
of NVIDIA Corporation with the donation of the Titan in agile software development: A systematic literature review,” in
X Pascal GPU used for this research. Proc. 10th Int. Conf. Predictive Models Softw. Eng., 2014, pp. 82–91.
[26] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating
complexity, code churn, and developer activity metrics as indica-
REFERENCES tors of software vulnerabilities,” IEEE Trans. Softw. Eng., vol. 37,
[1] B. Michael, S. Blumberg, and J. Laartz, “Delivering large-scale IT no. 6, pp. 772–787, Nov./Dec. 2011.
projects on time, on budget, and on value,” 2012. [Online]. Avail- [27] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy,
able: http://www.mckinsey.com/business-functions/digital- “Cross-project defect prediction: A large scale experiment on
mckinsey/our-insights/delivering-large-scale-it-projects-on- data versus domain versus process,” in Proc. 7th Joint Meet. Eur.
time-on-budget-and-on-value Softw. Eng. Conf. ACM SIGSOFT Symp. Found. Softw. Eng., 2009,
[2] B. Flyvbjerg and A. Budzier, “Why your IT project may be riskier pp. 91–100.
than you think,” Harvard Bus. Rev., vol. 89, no. 9, pp. 601–603, 2011. [28] Atlassian, “Atlassian JIRA agile software,” 2016. [Online]. Avail-
[3] A. Trendowicz and R. Jeffery, Software Project Effort Estimation: able: https://www.atlassian.com/software/jira
Foundations and Best Practice Guidelines for Success. Berlin, [29] Spring, “Spring XD issue XD-2970,” 2016. [Online]. Available:
Germany: Springer, 2014. https://jira.spring.io/browse/XD-2970
[4] L. C. Briand and I. Wieczorek, “Resource estimation in [30] J. Grenning, “Planning poker or how to avoid analysis paralysis
softwareengineering,” in Encyclopedia of Software Engineering. while release planning,” Hawthorn Woods: Renaissance Soft. Con-
Hoboken, NJ, USA: Wiley, 2002. sulting, vol. 3, pp. 1–3, 2002,
[5] E. Kocaguneli, A. T. Misirli, B. Caglayan, and A. Bener, “Expe- [31] M. Choetkiertikul, H. K. Dam, T. Tran, A. Ghose, and J. Grundy,
riences on developer participation and effort estimation,” in Proc. “Predicting delivery capability in iterative software development,”
37th EUROMICRO Conf. Softw. Eng. Adv. Appl., 2011, pp. 419–422. IEEE Trans. Softw. Eng., doi: 10.1109/TSE.2017.2693989
[6] M. Jorgensen, “What we do and don’t know about software [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
development effort estimation,” IEEE Softw., vol. 31, no. 2, Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
pp. 37–40, Mar./Apr. 2014. [33] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
[7] S. McConnell, Software Estimation: Demystifying the Black Art. Continual prediction with LSTM,” Neural Comput., vol. 12, no. 10,
Redmond, WA, USA: Microsoft Press, 2006. pp. 2451–2471, 2000.
[8] T. Menzies, Z. Chen, J. Hihn, and K. Lum, “Selecting best practi- [34] J. F. Kolen, and S. C. Kremer, “Gradient flow in recurrent nets:
ces for effort estimation,” IEEE Trans. Softw. Eng., vol. 32, no. 11, The difficulty of learning long-term dependencies,” A Field Guide
pp. 883–895, Nov. 2006. to Dynamical Recurrent Networks, Wiley-IEEE Press, pp. 237–243,
[9] I. Sommerville, Software Engineering, 9th ed. London, U.K.: 2009.
Pearson, 2010. [35] M. Sundermeyer, R. Schl€ uter, and H. Ney, “LSTM neural net-
[10] M. Jørgensen and M. Shepperd, “A systematic review of soft- works for language modeling,” in Proc. Annu. Conf. Int. Speech
ware development cost estimation studies,” IEEE Trans. Softw. Commun. Assoc., 2012, pp. 194–197.
Eng., vol. 33, no. 1, pp. 33–53, Jan. 2007. [36] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition
[11] B. W. Boehm, R. Madachy, and B. Steece, Software Cost Estimation with deep recurrent neural networks,” in Proc. IEEE Int. Conf.
with Cocomo II. Englewood Cliffs, NJ, USA: Prentice Hall, 2000. Acoust. Speech Signal Process., 2013, pp. 6645–6649.
[12] P. Sentas, L. Angelis, and I. Stamelos, “Multinomial logistic [37] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals,
regression applied on software productivity prediction,” in Proc. R. Monga, and G. Toderici, “Beyond short snippets: Deep net-
9th Panhellenic Conf. Informat., 2003, pp. 1–12. works for video classification,” in Proc. IEEE Conf. Comput. Vis.
[13] P. Sentas, L. Angelis, I. Stamelos, and G. Bleris, “Software pro- Pattern Recognit., 2015, pp. 4694–4702.
ductivity and effort prediction with ordinal regression,” Inf. [38] T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Faster training of
Softw. Technol., vol. 47, no. 1, pp. 17–29, 2005. very deep networks via p-norm gates,” in Proc. 23rd Int. Conf.
[14] S. Kanmani, J. Kathiravan, S. S. Kumar, M. Shanmugam, and Pattern Recognit., 2016, pp. 3542–3547.
P. E. College, “Neural network based effort estimation using [39] T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Predicting health-
class points for OO systems,” in Proc. Int. Conf. Comput.: Theory care trajectories from medical records: A deep learning approach,”
Appl. (ICCTA), 2007, pp. 261–266. J. Biomed. Informat., vol. 69, pp. 218–229, 2017. [Online]. Available:
[15] S. Kanmani, J. Kathiravan, S. S. Kumar, and M. Shanmugam, http://www.sciencedirect.com/science/article/pii/
“Class point based effort estimation of OO systems using fuzzy S1532046417300710
subtractive clustering and artificial neural networks,” in Proc. 1st [40] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
India Softw. Eng. Conf., 2008, pp. 141–142. Surpassing human-level performance on ImageNet classifica-
[16] S. Bibi, I. Stamelos, and L. Angelis, “Software cost prediction tion,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1026–1034.
with predefined interval estimates,” in Proc. 1st Softw. Meas. Eur. [41] G. Hinton, et al., “Deep neural networks for acoustic modeling in
Forum, 2004, pp. 237–246. speech recognition: The shared views of four research groups,”
[17] M. Shepperd and C. Schofield, “Estimating software project IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
effort using analogies,” IEEE Trans. Softw. Eng., vol. 23, no. 11, [42] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
pp. 736–743, Nov. 1997. learning with neural networks,” in Proc. Int. Conf. Neural Inf.
[18] L. Angelis and I. Stamelos, “A simulation tool for efficient anal- Process. Syst., 2014, pp. 3104–3112.
ogy based cost estimation,” Empirical Softw. Eng., vol. 5, no. 1, [43] A. Kumar, et al., “Ask me anything: Dynamic memory networks
pp. 35–68, 2000. for natural language processing,” in Proc. The 33rd Int. Conf.
[19] F. Sarro, A. Petrozziello, and M. Harman, “Multi-objective soft- Mach. Learn. (ICML), vol. 48, 2016, pp. 1378–1387.
ware effort estimation,” in Proc. 38th Int. Conf. Softw. Eng., 2016, [44] Y. Bengio, “Learning deep architectures for AI,” Found. Trends
pp. 619–630. Mach. Learn., vol. 2, no. 1, pp. 1–127, 2009.
[20] E. Kocaguneli, T. Menzies, and J. W. Keung, “On the value of [45] S. Liang and R. Srikant, “Why deep neural networks for fun-
ensemble effort estimation,” IEEE Trans. Softw. Eng., vol. 38, ction approximation?” arXiv:1610.04161, pp. 1–13, 2016. [Online].
no. 6, pp. 1403–1416, Nov./Dec. 2012. Available: https://arxiv.org/abs/1610.04161
[21] R. Valerdi, “Convergence of expert opinion via the wideband [46] H. Mhaskar, Q. Liao, and T. A. Poggio, “When and why are deep
delphi method: An application in cost estimation models,” networks better than shallow ones?” in Proc. AAAI Conf. Artif.
INCOSE Int. Symp., vol. 21, no. 1, pp. 1246–1259, 2011. Intell., 2017, pp. 2343–2349.
[22] S. Chulani, B. Boehm, and B. Steece, “Bayesian analysis of empir- [47] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein,
ical software engineering cost models,” IEEE Trans. Softw. Eng., “On the expressive power of deep neural networks,” in Proc. Int.
vol. 25, no. 4, pp. 573–583, Jul./Aug. 1999. Conf. Mach. Learn., 2017, pp. 2847–2854.
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 655

[48] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the num- [71] K. Muller, “Statistical power analysis for the behavioral scien-
ber of linear regions of deep neural networks,” in Proc. Int. Conf. ces,” Technometrics, vol. 31, no. 4, pp. 499–500, 1989.
Neural Inf. Process. Syst., 2014, pp. 2924–2932. [72] H. H. Abdi, “The bonferonni and Sidak corrections for multiple
[49] M. Bianchini and F. Scarselli, “On the complexity of neural net- comparisons,” Encyclopedia Meas. Statist.., vol. 1, pp. 1–9, 2007.
work classifiers: A comparison between shallow and deep [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/
architectures,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, download?doi=10.1.1.78.8747&rep=rep1&type=pdf
no. 8, pp. 1553–1565, Aug. 2014. [73] A. Vargha and H. D. Delaney, “A critique and improvement of
[50] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very the CL common language effect size statistics of McGraw and
deep networks,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst. Wong,” J. Educational Behavioral Statist., vol. 25, no. 2, pp. 101–
(NIPS), 2015, pp. 2377–2385. 132, 2000. [Online]. Available: http://jeb.sagepub.com/cgi/doi/
[51] J. Schmidhuber, “Deep learning in neural networks: An over- 10.3102/10769986025002101
view,” Neural Netw., vol. 61, pp. 85–117, 2015. [Online]. Avail- [74] A. Arcuri and L. Briand, “A practical guide for using statistical
able: http://dx.doi.org/10.1016/j.neunet.2014.09.003 tests to assess randomized algorithms in software engineering,”
[52] M. U. Gutmann and A. Hyv€arinen, “Noise-contrastive estima- in Proc. 33rd Int. Conf. Softw. Eng., 2011, pp. 1–10.
tion of unnormalized statistical models, with applications to [75] L. van der Maaten and G. Hinton, “Visualizing high-dimensional
natural image statistics,” J. Mach. Learn. Res., vol. 13, pp. 307–361, data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605,
2012. Nov. 2008.
[53] T. D. Team, “Theano: A {Python} framework for fast computa- [76] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent,
tion of mathematical expressions,” arXiv e-prints, vol. abs/ and S. Bengio, “Why does unsupervised pre-training help deep
1605.0, 2016. [Online]. Available: http://deeplearning.net/ learning?” J. Mach. Learn. Res., vol. 11, pp. 625–660, Mar. 2010.
software/theano [77] J. Weston, F. Ratle, and R. Collobert, “Deep learning via semi-
[54] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and supervised embedding,” in Proc. 25th Int. Conf. Mach. Learn.,
R. Salakhutdinov, “Dropout: A simple way to prevent neural net- 2008, pp. 1168–1175. [Online]. Available: http://doi.acm.org/
works from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 10.1145/1390156.1390303
2014. [78] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
[55] M. Shepperd and S. MacDonell , “Evaluating prediction systems and P. Kuksa, “Natural language processing (almost) from
in software project estimation,” Inf. Softw. Technol., vol. 54, no. 8, scratch,” J. Mach. Learn. Res., vol. 12, pp. 2493–2537, Nov. 2011.
pp. 820–827, 2012. [Online]. Available: http://dx.doi.org/ [Online]. Available: http://dl.acm.org/citation.cfm?
10.1016/j.infsof.2011.12.008 id=1953048.2078186
[56] R. Moraes, J. F. Valiati, and W. P. Gavi~ao Neto, “Document- [79] D. Zwillinger and S. Kokoska, CRC Standard Probability and Statis-
level sentiment classification: An empirical comparison between tics Tables and Formulae. Boca Raton, FL, USA: CRC Press, 1999.
SVM and ANN,” Expert Syst. Appl., vol. 40, no. 2, pp. 621–633, [80] J. McCarthy, “From here to human-level AI,” Artif. Intell.,
2013. vol. 171, no. 18, pp. 1174–1182, 2007.
[57] P. A. Whigham, C. A. Owen, and S. G. Macdonell, “A baseline [81] A. Arcuri and L. Briand, “A Hitchhiker’s guide to statistical tests
model for software effort estimation,” ACM Trans. Softw. Eng. for assessing randomized algorithms in software engineering,”
Methodology, vol. 24, no. 3, 2015, Art. no. 20. Softw. Testing Verification Rel., vol. 24, no. 3, pp. 219–250, 2014.
[58] P. Tirilly, V. Claveau, and P. Gros, “Language modeling for bag- [82] T. Menzies and M. Shepperd, “Special issue on repeatable results
of-visual words image categorization,” in Proc. 2008 Int. Conf. in software engineering prediction,” Empirical Softw. Eng.,
Content-Based Image Video Retrieval, 2008, pp. 249–258. vol. 17, no. 1/2, pp. 1–17, 2012.
[59] Q. Le and T. Mikolov, “Distributed representations of sentences [83] T. Menzies, et al., “The PROMISE Repository of empirical soft-
and documents,” in Proc. 31st Int. Conf. Mach. Learn., vol. 32, ware engineering data,” North Carolina State University,
pp. 1188–1196, 2014. Department of Computer Science, 2015, [Online]. Available:
[60] E. Kocaguneli, S. Member, and T. Menzies, “Exploiting the essen- http://openscience.us/repo
tial assumptions of analogy-based effort estimation,” IEEE Trans. [84] P. L. Braga, A. L. I. Oliveira, and S. R. L. Meira, “Software effort
Softw. Eng., vol. 38, no. 2, pp. 425–438, Mar./Apr. 2012. estimation using machine learning techniques with robust confi-
[61] E. Kocaguneli, T. Menzies, and E. Mendes, “Transfer learning in dence intervals,” in Proc. 7th Int. Conf. Hybrid Intell. Syst., 2007,
effort estimation,” Empirical Softw. Eng., vol. 20, no. 3, pp. 813–843, pp. 352–357.
2015. [85] Y. Jia, et al., “Caffe: Convolutional architecture for fast feature
[62] E. Mendes, I. Watson, and C. Triggs, “A comparative study of embedding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014,
cost estimation models for web hypermedia applications,” pp. 675–678.
Empirical Softw. Eng., vol. 8, pp. 163–196, 2003. [86] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and un-
[63] Y. F. Li, M. Xie, and T. N. Goh, “A study of project selection and derstanding recurrent networks,” in arXiv:1506.02078, 2015,
feature weighting for analogy based software cost estimation,” pp. 1–12. [Online]. Available: http://arxiv.org/abs/1506.02078
J. Syst. Softw., vol. 82, no. 2, pp. 241–252, Feb. 2009. [Online]. [87] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why should I trust
Available: http://dx.doi.org/10.1016/j.jss.2008.06.001 you?’: Explaining the predictions of any classifier,” in Proc.
[64] S. Porru, A. Murgia, S. Demeyer, M. Marchesi, and 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016,
R. Tonelli, “Estimating story points from issue reports,” pp. 1135–1144. [Online]. Available: http://arxiv.org/abs/
Proc. 12th Int. Conf. Predictive Models Data Anal. Softw. Eng., 1602.04938
2016, Art. no. 2. [88] M. Jorgensen, “A review of studies on expert estimation of
[65] S. D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineering software development effort,” J. Syst. Softw., vol. 70, no. 1/2,
Metrics and Models. Redwood City, CA, USA: Benjamin- pp. 37–60, 2004.
Cummings Publishing Co., Inc., 1986. [89] M. Jorgensen and T. M. Gruschke, “The impact of lessons-learned
[66] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit, “A simula- sessions on effort estimation and uncertainty assessments,” IEEE
tion study of the model evaluation criterion MMRE,” IEEE Trans. Trans. Softw. Eng., vol. 35, no. 3, pp. 368–383, May/Jun. 2009.
Softw. Eng., vol. 29, no. 11, pp. 985–995, Nov. 2003. [90] A. Panda, S. M. Satapathy, and S. K. Rath, “Empirical validation of
[67] B. Kitchenham, L. Pickard, S. MacDonell, and M. Shepperd, neural network models for agile software effort estimation based
“What accuracy statistics really measure,” IEE Proc. Softw., on story points,” Procedia Comput. Sci., vol. 57, pp. 772–781, 2015.
vol. 148, no. 3, pp. 81–85, Jun. 2001. [91] F. Collopy, “Difficulty and complexity as factors in software eff-
[68] M. Korte and D. Port, “Confidence in software cost estimation ort estimation,” Int. J. Forecasting, vol. 23, no. 3, pp. 469–471, 2007.
results based on MMRE and PRED,” in Proc. 4th Int. Workshop [92] R. Djouab, C. Commeyne, and A. Abran, “Effort estimation with
Predictor Models Softw. Eng., 2008, pp. 63–70. story points and COSMIC function points - an industry case
[69] D. Port and M. Korte, “Comparative studies of the model evalua- study,” pp. 25–36, 2008. [Online]. Available: http://cosmic-
tion criterions MMRE and PRED in software cost estimation sizing.org/wp-content/uploads/2016/03/Estimation-model-v-
research,” in Proc. 2nd ACM-IEEE Int. Symp. Empirical Softw. Eng. Print-Format-adapter.pdf
Meas., 2008, pp. 51–60. [93] ISO/IEC JTC 1/SC 7, INTERNATIONAL STANDARD ISO/IEC
[70] T. Menzies, E. Kocaguneli, B. Turhan, L. Minku, and F. Peters, Software Engineering COSMIC: A Functional Size Measurement
Sharing Data and Models in Software Engineering. San Mateo, CA, Method, vol. 2011, 2011. [Online]. Available: https://www.iso.
USA: Morgan Kaufmann, 2014. org/standard/54849.html
656 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019

[94] P. Abrahamsson, R. Moser, W. Pedrycz, A. Sillitti, and G. Succi, Hoa Khanh Dam received the bachelor of com-
“Effort prediction in iterative software development processes— puter science degree from the University of
incremental versus global prediction models,” in Proc. 1st Int. Melbourne, in Australia, and the master’s and PhD
Symp. Empirical Softw. Eng. Meas., 2007, pp. 344–353. degrees in computer science from RMIT University.
[95] P. Hearty, N. Fenton, D. Marquez, and M. Neil, “Predicting proj- He is a senior lecturer in the School of Computing
ect velocity in XP using a learning dynamic Bayesian network and Information Technology, University of Wollon-
model,” IEEE Trans. Softw. Eng., vol. 35, no. 1, pp. 124–137, gong (UOW), in Australia. He is an associate direc-
Jan./Feb. 2009. tor for the Decision System Lab, University of
[96] M. Perkusich, H. De Almeida, and A. Perkusich, “A model to Wollongong, heading its Software Engineering
detect problems on scrum-based software development proj- Analytics research program. His research interests
ects,” in Proc. ACM Symp. Appl. Comput., 2013, pp. 1037–1042. lie primarily in the intersection of software engineer-
[97] E. Giger, M. Pinzger, and H. Gall, “Predicting the fix time of ing, business process management and service-oriented computing,
bugs,” in Proc. 2nd Int. Workshop Recommendation Syst. Softw. focusing on such areas as software engineering analytics, process analyt-
Eng., 2010, pp. 52–56. ics, and service analytics. His research has won multiple Best Paper
[98] L. D. Panjer, “Predicting eclipse bug lifetimes,” in Proc. 4th Int. Awards (at WICSA, APCCM, and ASWEC) and ACM SIGSOFT Distin-
Workshop Mining Softw. Repositories, 2007, pp. 29–32. guished Paper Award (at MSR).
[99] P. Bhattacharya and I. Neamtiu, “Bug-fix time prediction models:
Can we do better?” in Proc. 8th Working Conf. Mining Softw.
Repositories, 2011, pp. 207–210. Truyen Tran received the BSc degree from the
[100] P. Hooimeijer and W. Weimer, “Modeling bug report quality,” in University of Melbourne and the PhD degree in
Proc. 22 IEEE/ACM Int. Conf. Automated Softw. Eng., Nov. 2007, computer science from Curtin University, in 2001
pp. 34–44. and 2008, respectively. He is a senior lecturer
[101] H. K. Dam, T. Tran, J. Grundy, and A. Ghose, “DeepSoft: A with Deakin University, Australia. His research
vision for a deep model of software,” in Proc. 24th ACM SIGSOFT interests include AI and its applications to bio-
Int. Symp. Found. Softw. Eng., 2016, pp. 944–947. medicine, sciences, and software. He have won
[102] M. White, C. Vendome, M. Linares-Vasquez, and D. Poshyvanyk, multiple paper awards and prizes including UAI
“Toward deep learning software repositories,” in Proc. 12th Work. 2009, CRESP 2014, Kaggle 2014, PAKDD 2015,
Conf. Mining Softw. Repositories, 2015, pp. 334–345. ACM SIGSOFT 2015, and ADMA 2016.
[103] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk,
“Deep learning code fragments for code clone detection,” in
Proc. 31st IEEE/ACM Int. Conf. Automated Softw. Eng., 2016, Trang Pham received the bachelor’s degree in
pp. 87–98. [Online]. Available: http://doi.acm.org/10.1145/ computer science from Vietnam National Univer-
2970276.2970326 sity, in 2014. She is working toward the PhD degree
[104] V. Raychev, M. Vechev, and E. Yahav, “Code completion at Deakin University. Currently, her research focu-
with statistical language models,” in Proc. 35th ACM SIGPLAN ses on deep learning for structured data. She has
Conf. Program. Language Des. Implementation, 2013, pp. 419–428. worked on different types of structured data such
[Online]. Available: http://dl.acm.org/citation.cfm?doid=2594291. as electronic medical records, molecular and net-
2594321 worked data, and software code.
[105] H. Dam, T. Tran, and T. Pham, “A deep language model for soft-
ware code,” arXiv:1608.02715, pp. 1–4, 2016. [Online]. Available:
http://arxiv.org/abs/1608.02715
[106] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API learning,”
Aditya Ghose received the bachelor of engineer-
in Proc. 24th ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2016,
ing degree in computer science and engineering
pp. 631–642. [Online]. Available: http://doi.acm.org/10.1145/
from Jadavpur University, Kolkata, India, and the
2950290.2950334
MSc and PhD degrees in computing science
[107] R. Gupta, S. Pal, A. Kanade, and S. Shevade, “DeepFix: Fixing com-
from the University of Alberta, Canada. He also
mon C language errors by deep learning,” in Proc. 31st AAAI Conf.
spent parts of his PhD candidature at the Beck-
Artif. Intell., 2017, pp. 1345–1351. [Online]. Available: http://aaai.
man Institute, University of Illinois at Urbana
org/ocs/index.php/AAAI/AAAI17/paper/view/14603
Champaign and the University of Tokyo. He is a
[108] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality
professor of computer science with the University
of data with neural networks,” Sci., vol. 313, no. 5786, pp. 504–
of Wollongong. He leads a team conducting
507, 2006.
research into knowledge representation, agent
[109] S. Wang, T. Liu, and L. Tan, “Automatically learning semantic
systems, services, business process management, software engineer-
features for defect prediction,” in Proc. Int. Conf. Softw. Eng., 2016,
ing and optimization and draws inspiration from the cross-fertilization of
pp. 297–308. [Online]. Available: http://dx.doi.org/10.1145/
ideas from this spread of research areas. He works closely with some of
2884781.2884804
the leading global IT firms. He is president of the Service Science Soci-
[110] X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun, “Deep learning for
ety of Australia and served as vice-president of CORE (2010-2014),
just-in-time defect prediction,” in Proc. IEEE Int. Conf. Softw.
Australia’s apex body for computing academics.
Quality Rel. Secur., 2015, pp. 17–26.
[111] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolu-
tional neural network for modelling sentences,” in Proc. 52nd Tim Menzies received the PhD degree from the
Annu. Meet. Assoc. Comput. Linguistics, 2014, pp. 655–665. University of New South Wales, in 1995. He is a
full professor in CS at North Carolina State Uni-
Morakot Choetkiertikul received the BS and MS versity, where he explores SE, data mining, AI,
degrees in computer science from Faculty of Infor- search-based SE, and open access science. He
mation and Communication Technology (ICT), is the author of more than 250 referred publica-
Mahidol University, Thailand. He is working toward tions and co-founder of the PROMISE confer-
the PhD degree in computer science and software ence series devoted to reproducible experiments
engineering in Faculty of Engineering and Informa- in SE (http://tiny.cc/seacraft). He also serves as
tion Sciences (EIS), University of Wollongong associated editor of many journals: the IEEE
(UOW), Australia. He is a part of Decision Systems Transactions on Software Engineering (2010 to
Lab (DSL). His research interests include empirical 2016), the ACM Transactions on Software Engineering Methodologies,
software engineering, software engineering analyt- the Empirical Software Engineering, the Automated Software Engineer-
ics, mining software repositories, and software pro- ing Journal, the Big Data Journal, the Information Software Technology,
cess improvement. For more details, see his home page: http://www.dsl. the IEEE Software, and the Software Quality Journal. He has served as
uow.edu.au/sasite/. co-general chair of ICSME’16 and co-PC chair for ASE’12, ICSE’15,
SSBSE’17. For more, see http://menzies.us.

Estimations Using UCP
No ratings yet
Estimations Using UCP
4 pages
Machine Learning Approaches To Estimating Software Development Effort
No ratings yet
Machine Learning Approaches To Estimating Software Development Effort
13 pages
Factors Influencing User Story Esti
100% (1)
Factors Influencing User Story Esti
20 pages
Effort Estimation in Agile Software Development A Systematic Literature Review
100% (1)
Effort Estimation in Agile Software Development A Systematic Literature Review
6 pages
W5 Planning
No ratings yet
W5 Planning
78 pages
Analysis of Effort Estimation Model in Traditional and Agile (USING METRICS TO IMPROVE AGILE METHODOLOGY)
No ratings yet
Analysis of Effort Estimation Model in Traditional and Agile (USING METRICS TO IMPROVE AGILE METHODOLOGY)
4 pages
A Deep Learning Model For Estimating Story Points
No ratings yet
A Deep Learning Model For Estimating Story Points
21 pages
Just Culture (Aviation Safety Management Systems)
100% (1)
Just Culture (Aviation Safety Management Systems)
0 pages
Ch05 Software Effort Estimation
No ratings yet
Ch05 Software Effort Estimation
85 pages
Oose Unit5
No ratings yet
Oose Unit5
51 pages
6 Week 6
No ratings yet
6 Week 6
50 pages
Ribu-Estimating O-O SW Projects With Use Cases
No ratings yet
Ribu-Estimating O-O SW Projects With Use Cases
147 pages
pm6 EffortEstimation
No ratings yet
pm6 EffortEstimation
35 pages
Unsolved Case Files Who Whacked Jack 01
No ratings yet
Unsolved Case Files Who Whacked Jack 01
11 pages
Software Effort Estimation: Software Development Project Management (CSC 4125)
No ratings yet
Software Effort Estimation: Software Development Project Management (CSC 4125)
31 pages
Estimation
No ratings yet
Estimation
26 pages
Notes 3 SPM
No ratings yet
Notes 3 SPM
22 pages
Estimation Session 4.6
No ratings yet
Estimation Session 4.6
29 pages
Chapter 05 - Software Effort Estimation IV
No ratings yet
Chapter 05 - Software Effort Estimation IV
44 pages
Ch.5 Software Effort Estimation
No ratings yet
Ch.5 Software Effort Estimation
14 pages
2023 - Agile Effort Estimation - Have We Solved The Problem Yet - Insights From A Replication Study
No ratings yet
2023 - Agile Effort Estimation - Have We Solved The Problem Yet - Insights From A Replication Study
19 pages
Software Project MGT
No ratings yet
Software Project MGT
31 pages
A Proposed Genetic Algorithm Model To Improve Effort Estimation in Agile Software Development
No ratings yet
A Proposed Genetic Algorithm Model To Improve Effort Estimation in Agile Software Development
9 pages
Group 4 - Deep Learning For Agile Effort Estimation
No ratings yet
Group 4 - Deep Learning For Agile Effort Estimation
17 pages
Z00900010220164078Z0090 Session 24 (Tutorial) 2016
No ratings yet
Z00900010220164078Z0090 Session 24 (Tutorial) 2016
44 pages
Carpentry LP 1
No ratings yet
Carpentry LP 1
7 pages
SE3M
No ratings yet
SE3M
17 pages
Software Project
No ratings yet
Software Project
18 pages
Function Point VS Story Point: Samir Khanal Bgsu
No ratings yet
Function Point VS Story Point: Samir Khanal Bgsu
27 pages
ComparingEffortEstimatesBasedOnUCP (BenteAnda)
No ratings yet
ComparingEffortEstimatesBasedOnUCP (BenteAnda)
13 pages
8 Tawosi2022saner
No ratings yet
8 Tawosi2022saner
12 pages
User Story Estimation: Point
No ratings yet
User Story Estimation: Point
39 pages
Group 29 (Software Project Estimation Report - )
No ratings yet
Group 29 (Software Project Estimation Report - )
10 pages
Final Report For Launching A Software House
69% (13)
Final Report For Launching A Software House
32 pages
Software Project Management Lecture Seven
No ratings yet
Software Project Management Lecture Seven
24 pages
An Effort Estimation Model For Agile Software Deve
No ratings yet
An Effort Estimation Model For Agile Software Deve
12 pages
Software Analogies
No ratings yet
Software Analogies
8 pages
Project Estimation Report
No ratings yet
Project Estimation Report
10 pages
2023-Effort Estimation in Agile Software Development Using Autoencoders
No ratings yet
2023-Effort Estimation in Agile Software Development Using Autoencoders
7 pages
spm3 - Estimation
No ratings yet
spm3 - Estimation
32 pages
Project Time and Cost Estimations
No ratings yet
Project Time and Cost Estimations
7 pages
A Comprehensive Machine Learning Framework For Evaluating Agility of A Software Development Organization
No ratings yet
A Comprehensive Machine Learning Framework For Evaluating Agility of A Software Development Organization
8 pages
2022-A Review of Effort Estimation in Agile Software Development Using Machine Learning Techniques
No ratings yet
2022-A Review of Effort Estimation in Agile Software Development Using Machine Learning Techniques
7 pages
Software Project Effort and Cost Estimation Techniques: January 2013
No ratings yet
Software Project Effort and Cost Estimation Techniques: January 2013
11 pages
Ijaz - PHD
No ratings yet
Ijaz - PHD
6 pages
Software Project Cost Estimation Using AI Techniqu
No ratings yet
Software Project Cost Estimation Using AI Techniqu
6 pages
Estimation by Function Point and COCOMO Method
No ratings yet
Estimation by Function Point and COCOMO Method
6 pages
Using Developers' Features To Estimate Story Points
No ratings yet
Using Developers' Features To Estimate Story Points
5 pages
Lab 3 Theory
No ratings yet
Lab 3 Theory
8 pages
Empirical Validation of Neural Network Models For Agile Software Effort Estimation Based On Story Points
No ratings yet
Empirical Validation of Neural Network Models For Agile Software Effort Estimation Based On Story Points
10 pages
Study 3-A - Review - of - Effort - Estimation - in - Agile - Software - Development - Using - Machine - Learning - Techniques-2022
No ratings yet
Study 3-A - Review - of - Effort - Estimation - in - Agile - Software - Development - Using - Machine - Learning - Techniques-2022
7 pages
Estimated Software Measurement Base On Use Case For Online Admission System
No ratings yet
Estimated Software Measurement Base On Use Case For Online Admission System
8 pages
Software Effort Estimation Using Feed Forward Backpropagation Neural Network
No ratings yet
Software Effort Estimation Using Feed Forward Backpropagation Neural Network
5 pages
Project Planning and Estimation: Chapters 23, 24
No ratings yet
Project Planning and Estimation: Chapters 23, 24
31 pages
1993 - Karner - Resource Estimation For Objectory Projects
No ratings yet
1993 - Karner - Resource Estimation For Objectory Projects
9 pages
Effort Estimation Tool Based On Use Case Points Method
No ratings yet
Effort Estimation Tool Based On Use Case Points Method
8 pages
Seppp 8
No ratings yet
Seppp 8
4 pages
An Efficient Effort and Cost Estimation Framework For Scrum Based Projects PDF
No ratings yet
An Efficient Effort and Cost Estimation Framework For Scrum Based Projects PDF
6 pages
I 201305 Schofield
No ratings yet
I 201305 Schofield
5 pages
Software Effort Estimation From Use Case Diagrams Using Nonlinear Regression Analysis
No ratings yet
Software Effort Estimation From Use Case Diagrams Using Nonlinear Regression Analysis
4 pages
Discount Rates: III: Relative Risk Measures
No ratings yet
Discount Rates: III: Relative Risk Measures
20 pages
Demystifying Story Points1 0
No ratings yet
Demystifying Story Points1 0
4 pages
A Novel Approach in Cost Estimation Based On Algorithmic Model
No ratings yet
A Novel Approach in Cost Estimation Based On Algorithmic Model
9 pages
N I Act Esaay
100% (2)
N I Act Esaay
55 pages
Arts and Crafts Movement - History, Influene and Important Figures (Contribution
No ratings yet
Arts and Crafts Movement - History, Influene and Important Figures (Contribution
66 pages
18CV54
No ratings yet
18CV54
4 pages
Business Requirement Document (BRD)
No ratings yet
Business Requirement Document (BRD)
8 pages
Humanities Chapter 3
No ratings yet
Humanities Chapter 3
26 pages
Microscopical Determination of The Vitrinite Reflectance of Coal
No ratings yet
Microscopical Determination of The Vitrinite Reflectance of Coal
6 pages
In Re Garcia
No ratings yet
In Re Garcia
1 page
Islamic Law of Evidence and Procedure
No ratings yet
Islamic Law of Evidence and Procedure
24 pages
Elite Daily Paul Kim Lawsuit 2018
100% (2)
Elite Daily Paul Kim Lawsuit 2018
27 pages
Quiz Analyzing Heritage
No ratings yet
Quiz Analyzing Heritage
2 pages
Current Affairs Supplement 2018
No ratings yet
Current Affairs Supplement 2018
46 pages
As 1789
No ratings yet
As 1789
2 pages
Giao An Tieng Anh 11 Hay Nhin La Muon Tai
No ratings yet
Giao An Tieng Anh 11 Hay Nhin La Muon Tai
315 pages
Test Answer Sheet 3 (En)
No ratings yet
Test Answer Sheet 3 (En)
18 pages
Reasonable Domain and Range
No ratings yet
Reasonable Domain and Range
11 pages
Assessing The Opportunities Available To Banks in Financing Horticulture Sub-Sector in Ethiopia
No ratings yet
Assessing The Opportunities Available To Banks in Financing Horticulture Sub-Sector in Ethiopia
12 pages
Pascal Programming Assignment 2
No ratings yet
Pascal Programming Assignment 2
5 pages
ESO Crafting Style Checklist
No ratings yet
ESO Crafting Style Checklist
54 pages
Comparative Table
No ratings yet
Comparative Table
9 pages
test 1 ЗНО 2021 без слушанья
No ratings yet
test 1 ЗНО 2021 без слушанья
7 pages
2022.09.27 OCE Axne
No ratings yet
2022.09.27 OCE Axne
4 pages
Comparative Study of Classifications of History
No ratings yet
Comparative Study of Classifications of History
2 pages
Assessment
No ratings yet
Assessment
3 pages
Middle East Airlines - Book Your Flight Tickets To Worldwide ..
No ratings yet
Middle East Airlines - Book Your Flight Tickets To Worldwide ..
5 pages
Autoplants PDF
No ratings yet
Autoplants PDF
6 pages