Story Point PDF
Story Point PDF
Abstract—Although there has been substantial research in software analytics for effort estimation in traditional software projects,
little work has been done for estimation in agile projects, especially estimating the effort required for completing user stories or issues.
Story points are the most common unit of measure used for estimating the effort involved in completing a user story or resolving an
issue. In this paper, we propose a prediction model for estimating story points based on a novel combination of two powerful deep
learning architectures: long short-term memory and recurrent highway network. Our prediction system is end-to-end trainable
from raw input data to prediction outcomes without any manual feature engineering. We offer a comprehensive dataset for story
points-based estimation that contains 23,313 issues from 16 open source projects. An empirical evaluation demonstrates that our
approach consistently outperforms three common baselines (Random Guessing, Mean, and Median methods) and six alternatives
(e.g., using Doc2Vec and Random Forests) in Mean Absolute Error, Median Absolute Error, and the Standardized Accuracy.
Index Terms—Software analytics, effort estimation, story point estimation, deep learning
Ç
1 INTRODUCTION
h lþ1 ¼ al hl þ ð1 a l Þ s l ðh l Þ; (1)
embedding and LSTM), which operate at the word level. Since every component in the model is differentiable, we
Pre-training is effective when the labels are not abundant. use the popular stochastic gradient descent to perform opti-
During pre-training, we do not use the ground-truth story mization: through backpropagation, the model parameters u
points, but instead leverage two sources of information: the are updated in the opposite direction of the gradient of the
strong predictiveness of natural language, and availability loss function LðuÞ. In this search, a learning rate h is used to
of free texts without labels (e.g., issue reports without story control how large of a step we take to reach a (local) mini-
points). The first source comes from the property of lan- mum. We use RMSprop, an adaptive stochastic gradient
guages that the next word can be predicted using previous method (unpublished note by Geoffrey Hinton), which is
words, thanks to grammars and common expressions. known to work best for recurrent models. We tuned RMSprop
Thus, at each time step t, we can predict the next word wtþ1 by partitioning the data into mutually exclusive training, vali-
using the state h t , using the softmax function dation, and test sets and running the training multiple times.
expðUk h t Þ Specifically, the training set is used to learn a useful model.
P ðwtþ1 ¼ k j w1:t Þ ¼ P ; (3) After each training epoch, the learned model was evaluated
k0 expðUk0 h t Þ
on the validation set and its performance was used to assess
where Uk is a free parameter. Essentially we are building a against hyperparameters (e.g., learning rate in gradient
language model, i.e., P ðsÞ ¼ P ðw1:n QÞ, which can be factor-
searches). Note that the validation set was not used to learn
ized using the chain-rule as: P ðw1 Þ nt¼2 P ðwtþ1 j w 1:t Þ. any of the model’s parameters. The best performing model in
We note that the probability of the first word P ðw1 Þ in a the validation set was chosen to be evaluated on the test set.
sequence is the number of sequences in the corpus which We also employed the early stopping strategy (see Section 5.4),
has that word w1 starting first. At step t, ht is computed by i.e., monitoring the model’s performance during the valida-
feeding ht1 and wt to the LSTM unit (see Fig. 2). Since wt is tion phase and stopping when the performance got worse. If
a word embedding vector, Eq. (3) indirectly refers to the the log-loss does not improve for ten consecutive runs, we
embedding matrix . than terminate the training.
The language model can be learned by optimizing the log- To prevent overfitting in our neural network, we have
loss log P ðsÞ. However, the main bottleneck is computa- implemented an effective solution called dropout in our
tional: Eq. (3) costs jV j time to evaluate where jV j is the vocab- model [54], where the elements of input and output states
ulary size, which can be hundreds of thousands for a big are randomly set to zeros during training. During testing,
corpus. For that reason, we implemented an approximate but parameter averaging is used. In effect, dropout implicitly
very fast alternative based on Noise-Contrastive Estimation trains many models in parallel, and all of them share the
[52], which reduces the time to M jV j, where M can be as same parameter set. The final model parameters represent
small as 100. We also run the pre-training multiple times the average of the parameters across these models. Typi-
against a validation set to choose the best model. We use cally, the dropout rate is set at 0.5.
perplexity, a common intrinsic evaluation metric based on the An important step prior to optimization is parameter ini-
log-loss, as a criterion for choosing the best model and early tialization. Typically the parameters are initialized ran-
stopping. A smaller perplexity implies a better language domly, but our experience shows that a good initialization
model. The word embedding matrix M 2 RdjV j (which is (through pre-training of embedding and LSTM layers) helps
first randomly initialized) and the initialization for LSTM learning converge faster to good solutions.
parameters are learned through this pre-training process.
5 EVALUATION
4.2 Training Deep-SE
The empirical evaluation we carried out aimed to answer
We have implemented the Deep-SE model in Python using the following research questions:
Theano [53]. To simplify our model, we set the size of the
memory cell in an LSTM unit and the size of a recurrent RQ1. Sanity check: Is the proposed approach suitable for
layer in RHWN to be the same as the embedding size. We estimating story points?
tuned some important hyper-parameters (e.g., embedding This sanity check requires us to compare our
size and the number of hidden layers) by conducting experi- Deep-SE prediction model with the three common
ments with different values, while for some other hyper- baseline benchmarks used in the context of effort
parameters, we used the default values. This will be dis- estimation: Random Guessing, Mean Effort, and
cussed in more details in the evaluation section. Median Effort. Random guessing is a naive bench-
Recall that the entire network can be reduced to a param- mark used to assess if an estimation model is useful
eterized function defined, which maps sequences of raw [55]. Random guessing performs random sampling
words (in issue reports) to story points. Let u be the set of all (with equal probability) over the set of issues with
parameters in the model. We define a loss function LðuÞ that known story points, chooses randomly one issue
measures the quality of a particular set of parameters based from the sample, and uses the story point value of
on the difference between the predicted story points and that issue as the estimate of the target issue. Random
the ground truth story points in the training data. A setting guessing does not use any information associated
of the parameters u that produces a prediction for an issue with the target issue. Thus any useful estimation
in the training data consistent with its ground truth story model should outperform random guessing. Mean
points would have a very low loss L. Hence, learning is and Median Effort estimations are commonly used
achieved through the optimization process of finding the as baseline benchmarks for effort estimation [19].
set of parameters u that minimizes the loss function. They use the mean or median story points of the past
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 643
issues to estimate the story points of the target issue. “Java” than “logging”. To address this issue, Doc2-
Note that the samples used for all the naive baselines vec [59] (i.e., alternatively known as paragraph2vec)
(i.e., Random Guessing, Mean Effort, and Median is an unsupervised algorithm that learns fixed-length
Effort) were from the training set. feature representations from texts (e.g., title and
RQ2. Benefits of deep representation: Does the description of issues). Each document is represented
use of Recurrent Highway Nets provide more accurate in a dense vector which is trained to predict next
story point estimates than using a traditional regression words in the document.
technique? Both BoW and Doc2vec representations however
To answer this question, we replaced the Recurrent effectively destroys the sequential nature of text.
Highway Net component with a regressor for imm- This question aims to explore whether LSTM with
ediate prediction. Here, we compare our approach its capability of modeling this sequential structure
against four common regressors: Random Forests would improve the story point estimation. To answer
(RF), Support Vector Machine (SVM), Automatically this question, we feed three different feature vectors:
Transformed Linear Model (ATLM), and Linear one learned by LSTM and the other two derived from
Regression (LR). We choose RF over other baselines BoW technique and Doc2vec to the same Random
since ensemble methods like RF, which combine the Forrests regressor, and compare the predictive per-
estimates from multiple estimators, are an effective formance of the former (i.e., LSTM+RF) against
method for effort estimation [20]. RF achieves a sig- that of the latter (i.e., BoW+RF and Doc2vec+RF).
nificant improvement over the decision tree approach We used Gensim,1 a well-known implementation for
by generating many classification and regression Doc2vec in our experiments.
trees, each of which is built on a random resampling RQ4. Cross-project estimation. Is the proposed appr-
of the data, with a random subset of variables at each oach suitable for cross-project estimation?
node split. Tree predictions are then aggregated Story point estimation in new projects is often dif-
through averaging. We used the issues in the valida- ficult due to lack of training data. One common tech-
tion set to fine-tune parameters (i.e., the number of nique to address this issue is training a model using
tress, the maximum depth of the tree, and The mini- data from a (source) project and applying it to the
mum number of samples). For SVM, it has been new (target) project. Since our approach requires
widely use in software analytics (e.g., defect predic- only the title and description of issues in the source
tion) and document classification (e.g., sentiment and target projects, it is readily applicable to both
analysis) [56]. SVM is known as Support Vector within-project estimation and cross-project estimation.
Regression (SVR) for regression problems. We also In practice, story point estimation is however known
used the issues in the validation set to find the kernel to be specific to teams and projects. Hence, this
type (e.g., linear, polynomial) for testing. We used question aims to investigate whether our approach is
the Automatically Transformed Linear Model [57] suitable for cross-project estimation. We have imple-
recently proposed as the baseline model for software mented Analogy-based estimation called ABE0, which
effort estimation. Although ATLM is simple and were proposed in previous work [60], [61], [62], [63]
requires no parameter tuning, it performs well over a for cross-project estimation, and used it as a bench-
range of various project types in the traditional effort mark. The ABE0 estimation bases on the distances
estimation [57]. Since LR is the top layer of our between individual issues. Specifically, the story point
approach, we also used LR as the immediate regres- of issues in the target project is the mean of story
sor after LSTM layers to assess whether RHWN points of k-nearest issues from the source project. We
improves the predictive performance. We then com- used the euclidean distance as a distance measure,
pare the performance of these alternatives, namely Bag-of-Words of the title and the description as the
LSTM+RF, LSTM+SVM, LSTM+ATLM, and LSTM features of an issue, and k = 3.
+LR against our Deep-SE model. RQ5. Normalizing/adjusting story points. Does our
RQ3. Benefits of LSTM document representation: approach still perform well with normalized/adjusted
Does the use of LSTM for modeling issue reports provide story points?
more accurate results than the traditional Doc2Vec and We have ran our experiments again using the new
Bag-of-Words approach? labels (i.e., the normalized story points) for address-
The most popular text representation is Bag-of- ing the concern that whether our approach still per-
Words [58], where a text is represented as a vector of forms well on those adjusted ground-truths. We
word counts. For example, the title and description adjusted the story points of each issue using a range
of issue XD-2970 in Fig. 1 would be converted into a of information, including the number of days from
sparse binary vector of vocabulary size, whose ele- creation to resolved time, the development time, the
ments are mostly zeros, except for those at the posi- number of comments, the number of users who com-
tions designated to “standardize”, “XD”, “logging” mented on the issue, the number of times that an
and so on. However, BoW has two major weak- issue had their attributes changed, the number of
nesses: they lose the sequence of the words and they users who changed the issue’s attributes, the number
also ignore semantics of the words. For example, of issue links, the number of affect versions, and the
“Python”, “Java”, and “logging ” are equally distant,
while semantically “Python” should be closer to 1. https://radimrehurek.com/gensim/models/doc2vec.html
644 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019
TABLE 1
Descriptive Statistics of Our Story Point Dataset
Repo. Project Abb. # issues min SP max SP mean SP median SP mode SP var SP std SP mean TD length LOC
Apache Mesos ME 1,680 1 40 3.09 3 3 5.87 2.42 181.12 247,542+
Usergrid UG 482 1 8 2.85 3 3 1.97 1.40 108.60 639,110+
Appcelerator Appcelerator Studio AS 2,919 1 40 5.64 5 5 11.07 3.33 124.61 2,941,856#
Aptana Studio AP 829 1 40 8.02 8 8 35.46 5.95 124.61 6,536,521+
Titanium SDK/CLI TI 2,251 1 34 6.32 5 5 25.97 5.10 205.90 882,986+
DuraSpace DuraCloud DC 666 1 16 2.13 1 1 4.12 2.03 70.91 88,978+
Atlassian Bamboo BB 521 1 20 2.42 2 1 4.60 2.14 133.28 6,230,465#
Clover CV 384 1 40 4.59 2 1 42.95 6.55 124.48 890,020#
JIRA Software JI 352 1 20 4.43 3 5 12.35 3.51 114.57 7,070,022#
Moodle Moodle MD 1,166 1 100 15.54 8 5 468.53 21.65 88.86 2,976,645+
Lsstcorp Data Management DM 4,667 1 100 9.57 4 1 275.71 16.61 69.41 125,651*
Mulesoft Mule MU 889 1 21 5.08 5 5 12.24 3.50 81.16 589,212+
Mule Studio MS 732 1 34 6.40 5 5 29.01 5.39 70.99 16,140,452#
Spring Spring XD XD 3,526 1 40 3.70 3 1 10.42 3.23 78.47 107,916+
Talendforge Talend Data Quality TD 1,381 1 40 5.92 5 8 26.96 5.19 104.86 1,753,463#
Talend ESB TE 868 1 13 2.16 2 1 2.24 1.50 128.97 18,571,052#
Total 23,313
SP: story points, TD length: the number of words in the title and description of an issue, LOC: line of code
(+: LOC obtained from www.openhub.net, *: LOC from GitHub, and #: LOC from the reverse engineering)
number of fix versions. These information reflect the Apache, Appcelerator, DuraSpace, Atlassian, Moodle, Lsst-
actual effort and we thus refer to them as effort indi- corp, MuleSoft, Spring, and Talendforge. We then used the
cators. The values of these indicators were extracted Representational State Transfer (REST) API provided by JIRA
after the issue was completed. The normalized story to query and collected those issue reports. We collected all the
point (SPnormalized ) is then computed as the following: issues which were assigned a story point measure from
the nine open source repositories up until August 8, 2016.
SPnormalized ¼ ð0:5ÞSPoriginal þ ð0:5ÞSPnearest ; We then extracted the story point, title and description from
where SPorginal is the original story point, and the collected issue reports. Each repository contains a number
SPnearest is the mean of story points from 10 nearest of projects, and we chose to include in our dataset only proj-
issues based on their actual effort indicators. Note ects that had more than 300 issues with story points. Issues
that we use K-Nearest Neighbour (KNN) to find the that were assigned a story point of zero (e.g., a non-reproduc-
nearest issues and the euclidean metric to measure ible bug), as well as issues with a negative, or unrealistically
the distance. We ran the experiment on the new large story point (e.g., greater than 100) were filtered out. Ulti-
labels (i.e., SPnormalized ) using our proposed approach mately, about 2.66 percent of the collected issues were filtered
against all other baseline benchmark methods. out in this fashion. In total, our dataset has 23,313 issues with
RQ6. Compare against the existing approach. How story points from 16 different projects: Apache Mesos (ME),
does our approach perform against existing approaches in Apache Usergrid (UG), Appcelerator Studio (AS), Aptana
story point estimation? Studio (AP), Titanum SDK/CLI (TI), DuraCloud (DC), Bam-
Recently, Porru et al. [64] also proposed an esti- boo (BB), Clover (CV), JIRA Software (JI), Moodle (MD), Data
mation model for story points. Their approach uses Management (DM), Mule (MU), Mule Studio (MS), Spring
the type of an issue, the component(s) assigned to it, XD (XD), Talend Data Quality (TD), and Talend ESB (TE).
and the TF-IDF derived from its summary and Table 1 summarizes the descriptive statistics of all the projects
description as features representing the issue. They in terms of the minimum, maximum, mean, median, mode,
also performed univariate feature selection to variance, and standard deviations of story points assigned
choose a subset of features for building a classifier. used and the average length of the title and description of
By contrast, our approach automatically learns issues in each project. These sixteen projects bring diversity to
semantic features which represent the actual mean- our dataset in terms of both application domains and project’s
ing of the issue’s report, thus potentially providing characteristics. Specifically, they are different in the following
more accurate estimates. To answer this research aspects: number of observation (from 352 to 4,667 issues),
question, we ran Deep-SE on the dataset used in technical characteristics (different programming languages
Porru et. al, re-implemented their approach, and and different application domains), sizes (from 88 KLOC to
performed a comparison on the results produced by 18 millions LOC), and team characteristics (different team
the two approaches. structures and participants from different regions).
Since story points rate the relative effort of work between
5.1 Story Point Datasets user stories, they are usually measured on a certain scale
To collect data for our dataset, we looked for issues that were (e.g., 1, 2, 4, 8, etc.) to facilitate comparison (e.g., a user story
estimated with story points. JIRA is one of the few widely- is double the effort of the other) [25]. The story points used
used issue tracking systems that support agile development in planning poker typically follow a Fibonacci scale, i.e., 1,
(and thus story point estimation) with its JIRA Agile plugin. 2, 3, 5, 8, 13, 21, and so on [24]. Among the projects we stud-
Hence, we selected a diverse collection of nine major open ied, only seven of them (i.e., Usergrid, Talend ESB, Talend
source repositories that use the JIRA issue tracking system: Data Quality, Mule Studio, Mule, Appcelerator Studio, and
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 645
Aptana Studio followed the Fibonacci scale, while the other SA is based on MAE and it is defined as
nine projects did not use any scale. When our prediction
system gives an estimate, we did not round it to the nearest MAE
SA ¼ 1 100;
story point value on the Fibonacci scale. An alternative MAErguess
approach (for those project which follow a Fibonacci scale)
is treating this as a classification problem: each value on the where MAErguess is the MAE of a large number (e.g.,
Fibonacci scale represents a class. The limitations of this 1,000 runs) of random guesses. SA measures the compari-
approach is that the number of classes must be pre-deter- son against random guessing. Predictive performance can
mined and that it is not applicable to projects that do not fol- be improved by decreasing MAE or increasing SA.
low this scale. We however note that the Fibonacci scale is We assess the story point estimates produced by the esti-
only a guidance for estimating story points. In practice, mation models using MAE, MdAE and SA. To compare the
teams may follow other common scales, define their own performance of two estimation models, we tested the statisti-
scales or may not follow any scale at all. Our approach does cal significance of the absolute errors achieved with the two
not rely on these specific scales, thus making it applicable to models using the Wilcoxon Signed Rank Test [71]. The Wil-
a wider range of projects. It predicts a scalar value (regres- coxon test is a safe test since it makes no assumptions about
sion) rather than a class (classification). underlying data distributions. The null hypothesis here is:
“the absolute errors provided by an estimation model are not
5.2 Experimental Setting different to those provided by another estimation model”. We
We performed experiments on the sixteen projects in our set the confidence limit at 0.05 and also applied Bonferroni
dataset—see Table 1 for their details. To mimic a real correction [72] (0.05/K, where K is the number of statistical
deployment scenario that prediction on a current issue is tests) when multiple testing were performed.
made by using knowledge from estimations of the past In addition, we also employed a non-parametric effect
issues, the issues in each project were split into training set size measure, the correlated samples case of the Vargha and
Delaney’s A ^XY statistic [73] to assess whether the effect size
(60 percent of the issues), development/validation set (i.e.,
is interesting. The A^XY measure is chosen since it is agnostic
20 percent), and test set (i.e., 20 percent) based on their crea-
tion time. The issues in the training set and the validation to the underlying distribution of the data, and is suitable for
set were created before the issues in the test set, and the assessing randomized algorithms in software engineering
issues in the training set were also created before the issues generally [74] and effort estimation in particular [19]. Specif-
in the validation set. ically, given a performance measure (e.g., the Absolute
Error from each estimation in our case), the A ^XY measures
5.3 Performance Measures the probability that estimation model X achieves better
There are a range of measures used in evaluating the accu- results (with respect to the performance measure) than esti-
racy of an effort estimation model. Most of them are based mation model Y . We note that this falls into the correlated
on the Absolute Error, (i.e., jActualSP EstimatedSP j). samples case of the Vargha and Delaney [73] where the
where AcutalSP is the real story points assigned to an issue Absolute Error is derived by applying different estimation
and EstimatedSP is the outcome given by an estimation methods on the same data (i.e., same issues). We thus use
model. Mean of Magnitude of Relative Error (MRE) or the following formula to calculate the stochastic superiority
Mean Percentage Error and Prediction at level l [65], i.e., value between two estimation methods
Pred(l), have also been used in effort estimation. However,
a number of studies [66], [67], [68], [69] have found that ^XY ¼ ½#ðX < Y Þ þ ð0:5 #ðX ¼ Y ÞÞ ;
A
those measures bias towards underestimation and are not n
stable when comparing effort estimation models. Thus, the where #ðX < Y Þ is the number of issues that the Absolute
Mean Absolute Error (MAE), Median Absolute Error (MdAE), Error from X less than Y , #ðX ¼ Y Þ is the number of issues
and the Standardized Accuracy (SA) have recently been rec- that the Absolute Error from X equal to Y , and n is the num-
ommended to compare the performance of effort estimation ber of issues. We also compute the average of the stochastic
models [19], [70]. MAE is defined as superiority measures (Aiu ) of our approach against each of
the others using the following formular:
1X N
P
MAE ¼ jActualSPi EstimatedSPi j; k6¼i Aik
N i¼1 Aiu ¼ ;
l1
where N is the number of issues used for evaluating the where Aik is the pairwise stochastic superiority values
performance (i.e., test set), ActualSPi is the actual story ^XY ) for all ði; kÞ pairs of estimation methods, k ¼ 1; . . . ; l,
(A
point, and EstimatedSPi is the estimated story point, for the and l is a number of estimation methods, e.g., variable i
issue i. refers to Deep-SE and l ¼ 4 when comparing Deep-SE
We also report the Median Absolute Error since it is more against Random, Mean and Median methods.
robust to large outliers. MdAE is defined as
5.4 Hyper-Parameter Settings for Training a
MdAE ¼ MedianfjActualSPi EstimatedSPi jg; Deep-SE Model
We focused on tuning two important hyper-parameters: the
where 1 i N. number of word embedding dimensions and the number of
646 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019
TABLE 2 TABLE 4
The Coefficient and p-Value of the Spearman’s Rank and Comparison on the Effort Estimation Benchmarks
Pearson Rank Correlation on the Story Points ^XY Effect Size (in Brackets)
Using Wilcoxon Test and A
Against the Development Time
Deep-SE vs Mean Median Random Aiu
Spearman’s rank Pearson correlation ME <0.001 [0.77] <0.001 [0.81] <0.001 [0.90] 0.83
UG <0.001 [0.79] <0.001 [0.79] <0.001 [0.81] 0.80
Project coefficient p-value coefficient p-value AS <0.001 [0.78] <0.001 [0.78] <0.001 [0.91] 0.82
Appcelerator Studio 0.330 <0.001 0.311 <0.001 AP 0.040 [0.69] <0.001 [0.79] <0.001 [0.84] 0.77
Aptana Studio 0.241 <0.001 0.325 <0.001 TI <0.001 [0.77] <0.001 [0.72] <0.001 [0.88] 0.79
DC <0.001 [0.80] 0.415 [0.54] <0.001 [0.81] 0.72
Bamboo 0.505 <0.001 0.476 <0.001 BB <0.001 [0.78] <0.001 [0.78] <0.001 [0.85] 0.80
Clover 0.551 <0.001 0.418 <0.001 CV <0.001 [0.75] <0.001 [0.70] <0.001 [0.91] 0.79
Data Management 0.753 <0.001 0.769 <0.001 JI <0.001 [0.76] <0.001 [0.79] <0.001 [0.79] 0.78
DuraCloud 0.225 <0.001 0.393 <0.001 MD <0.001 [0.81] <0.001 [0.75] <0.001 [0.80] 0.79
JIRA Software 0.512 <0.001 0.560 <0.001 DM <0.001 [0.69] <0.001 [0.59] <0.001 [0.75] 0.68
MU 0.003 [0.73] <0.001 [0.73] <0.001 [0.82] 0.76
Mesos 0.615 <0.001 0.766 <0.001 MS 0.799 [0.56] 0.842 [0.56] <0.001 [0.69] 0.60
Moodle 0.791 <0.001 0.816 <0.001 XD <0.001 [0.70] <0.001 [0.70] <0.001 [0.78] 0.73
Mule 0.711 <0.001 0.722 <0.001 TD <0.001 [0.86] <0.001 [0.85] <0.001 [0.87] 0.86
Mule Studio 0.630 <0.001 0.565 <0.001 TE <0.001 [0.73] <0.001 [0.73] <0.001 [0.92] 0.79
Spring XD 0.486 <0.001 0.614 <0.001
Talend Data Quality 0.390 <0.001 0.370 <0.001
Talend ESB 0.504 <0.001 0.524 <0.001 by Deep-SE over the Mean and Median method is 34.06 and
Titanium SDK/CLI 0.322 <0.001 0.305 <0.001 26.77 percent in terms of MAE, averaging across all projects.
Usergrid 0.212 0.005 0.263 0.001 We note that the results achieved by the estimation
models vary between different projects. For example, our
Deep-SE achieved 0.64 MAE in the Talend ESB project (TE),
suggests that the estimations obtained with our approach, while it achieved 5.97 MAE in Moodle (MD) project. The
Deep-SE, are better than those achieved by using Mean, distribution of story points may be the cause of this varia-
Median, and Random estimates. Deep-SE consistently out- tion: the standard deviation of story points in TE is only
performs all these three baselines in all sixteen projects. 1.50, while that in MD is 21.65 (see Table 1).
Our approach improved between 3.29 percent (in project Table 4 shows the results of the Wilcoxon test (together
MS) to 57.71 percent (in project BB) in terms of MAE, 11.71 per- with the corresponding A ^XY effect size) to measure the sta-
cent (in MU) to 73.86 percent (in CV) in terms of MdAE, and tistical significance and effect size (in brackets) of the
20.83 percent (in MS) to 449.02 percent (in MD) in terms of SA improved accuracy achieved by Deep-SE over the baselines:
over the Mean method. The improvements of our approach Mean Effort, Median Effort, and Random Guessing. In
over the Median method are between 2.12 percent (in MS) 45/48 cases, our Deep-SE significantly outperforms the
to 52.90 percent (in JI) in MAE, 0.50 percent (in MS) to baselines after applying Bonferroni correction with effect
63.50 percent (in ME) in MdAE, and 2.70 percent (in DC) to sizes greater than 0.5. Moreover, the average of the stochas-
328.82 percent (in JI) in SA. Overall, the improvement achieved tic superiority (Aiu ) of our approach against the baselines is
greater than 0.7 in the most cases. The highest Aiu achieving
in the Talend Data Quality project (TD) is 0.86 which can be
TABLE 3
considered as large effect size (A^XY > 0:8).
Evaluation Results of Deep-SE, the Mean and
Median Method (the Best Results Are Highlighted in Bold) We note that the improvement brought by our approach
over the baselines was not significant for project MS.
Proj Method MAE MdAE SA Proj Method MAE MdAE SA One possible reason is that the size of the training and pre-
ME Deep-SE 1.02 0.73 59.84 JI Deep-SE 1.38 1.09 59.52 training data for MS is small, and deep learning techniques
mean 1.64 1.78 35.61 mean 2.48 2.15 27.06 tend to perform well with large training samples.
median 1.73 2.00 32.01 median 2.93 2.00 13.88
UG Deep-SE 1.03 0.80 52.66 MD Deep-SE 5.97 4.93 50.29
mean 1.48 1.23 32.13 mean 10.90 12.11 9.16 Our approach outperforms the baselines, thus passing
median 1.60 1.00 26.29 median 7.18 6.00 40.16 the sanity check required by RQ1.
AS Deep-SE 1.36 0.58 60.26 DM Deep-SE 3.77 2.22 47.87
mean 2.08 1.52 39.02 mean 5.29 4.55 26.85
median 1.84 1.00 46.17 median 4.82 3.00 33.38 RQ2: Benefits of Deep Representation
AP Deep-SE 2.71 2.52 42.58 MU Deep-SE 2.18 1.96 40.09
mean 3.15 3.46 33.30 mean 2.59 2.22 28.82 Table 5 shows MAE, MdAE, and SA achieved from Deep-
median 3.71 4.00 21.54 median 2.69 2.00 26.07 SE using Recurrent Highway Networks for deep representa-
TI Deep-SE 1.97 1.34 55.92 MS Deep-SE 3.23 1.99 17.17
mean 3.05 1.97 31.59 mean 3.34 2.68 14.21 tion of issue reports against using Random Forests, Support
median 2.47 2.00 44.65 median 3.30 2.00 15.42 Vector Machine, Automatically Transformed Linear Model,
DC Deep-SE 0.68 0.53 69.92 XD Deep-SE 1.63 1.31 46.82 and Linear Regression Model coupled with LSTM (i.e.,
mean 1.30 1.14 42.88 mean 2.27 2.53 26.00
median 0.73 1.00 68.08 median 2.07 2.00 32.55 LSTM+RF, LSTM+SVM, LSTM+ATLM, and LSTM+LR).
BB Deep-SE 0.74 0.61 71.24 TD Deep-SE 2.97 2.92 48.28 The distribution of the Absolute Error is reported in
mean 1.75 1.31 32.11 mean 4.81 5.08 16.18
median 1.32 1.00 48.72 median 3.87 4.00 32.43 Appendix A.2, available in the online supplemental mate-
CV Deep-SE 2.11 0.80 50.45 TE Deep-SE 0.64 0.59 69.67 rial. When we use MAE, MdAE, and SA as evaluation
mean 3.49 3.06 17.84 mean 1.14 0.91 45.86 criteria, Deep-SE is still the best approach, consistently
median 2.84 2.00 33.33 median 1.16 1.00 44.44
outperforming LSTM+RF, LSTM+SVM, LSTM+ATLM,
MAE and MdAE - the lower the better, SA - the higher the better. and LSTM+LR across all sixteen projects.
648 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019
TABLE 5 TABLE 6
Evaluation Results of Deep-SE, LSTM+RF, LSTM+SVM, Comparison Between the Recurrent Highway Net Against
LSTM+ATLM, and LSTM+LR (the Best Results Are Random Forests, Support Vector Machine, Automatically
Highlighted in Bold) Transformed Linear Model, and Linear Regression Using
^12 Effect Size (in Brackets)
Wilcoxon Test and A
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
Deep-SE vs LSTM+RF LSTM+SVM LSTM+ATLM LSTM+LR Aiu
ME Deep-SE 1.02 0.73 59.84 JI Deep-SE 1.38 1.09 59.52
lstm+rf 1.08 0.90 57.57 lstm+rf 1.71 1.27 49.71 ME <0.001 [0.57] <0.001 [0.54] <0.001 [0.59] <0.001 [0.59] 0.57
lstm+svm 1.07 0.90 58.02 lstm+svm 2.04 1.89 40.05 UG 0.004 [0.59] 0.010 [0.55] <0.001 [1.00] <0.001 [0.73] 0.72
lstm+atlm 1.08 0.95 57.60 lstm+atlm 2.10 1.95 38.26 AS <0.001 [0.69] <0.001 [0.51] <0.001 [0.71] <0.001 [0.75] 0.67
lstm+lr 1.10 0.96 56.94 lstm+lr 2.10 1.95 38.26 AP <0.001 [0.60] <0.001 [0.52] <0.001 [0.62] <0.001 [0.64] 0.60
TI <0.001 [0.65] 0.007 [0.51] <0.001 [0.69] <0.001 [0.71] 0.64
UG Deep-SE 1.03 0.80 52.66 MD Deep-SE 5.97 4.93 50.29 DC 0.406 [0.55] 0.015 [0.60] <0.001 [0.97] 0.024 [0.58] 0.68
lstm+rf 1.07 0.85 50.70 lstm+rf 9.86 9.69 17.86 BB <0.001 [0.73] 0.007 [0.60] <0.001 [0.84] <0.001 [0.75] 0.73
lstm+svm 1.06 1.04 51.23 lstm+svm 6.70 5.44 44.19 CV <0.001 [0.70] 0.140 [0.63] <0.001 [0.82] 0.001 [0.70] 0.71
lstm+atlm 1.40 1.20 35.55 lstm+atlm 9.97 9.61 16.92 JI 0.006 [0.71] 0.001 [0.67] 0.002 [0.89] <0.001 [0.79] 0.77
lstm+lr 1.40 1.20 35.55 lstm+lr 9.97 9.61 16.92 MD <0.001 [0.76] <0.001 [0.57] <0.001 [0.74] <0.001 [0.69] 0.69
DM <0.001 [0.62] <0.001 [0.56] <0.001 [0.61] <0.001 [0.62] 0.60
AS Deep-SE 1.36 0.58 60.26 DM Deep-SE 3.77 2.22 47.87
MU 0.846 [0.53] 0.005 [0.62] 0.009 [0.67] 0.003 [0.64] 0.62
lstm+rf 1.62 1.40 52.38 lstm+rf 4.51 3.69 37.71 MS 0.502 [0.53] 0.054 [0.50] <0.001 [0.82] 0.195 [0.56] 0.60
lstm+svm 1.46 1.42 57.20 lstm+svm 4.20 2.87 41.93 XD <0.001 [0.63] <0.001 [0.57] <0.001 [0.65] <0.001 [0.60] 0.61
lstm+atlm 1.59 1.30 53.29 lstm+atlm 4.70 3.74 35.01 TD <0.001 [0.78] <0.001 [0.68] <0.001 [0.70] <0.001 [0.70] 0.72
lstm+lr 1.68 1.46 50.78 lstm+lr 5.30 3.66 26.68 TE 0.020 [0.53] 0.002 [0.59] <0.001 [0.66] 0.006 [0.65] 0.61
AP Deep-SE 2.71 2.52 42.58 MU Deep-SE 2.18 1.96 40.09
lstm+rf 2.96 2.80 37.34 lstm+rf 2.20 2.21 38.73
lstm+svm 3.06 2.90 35.26 lstm+svm 2.28 2.89 37.44
improvement of our approach over LSTM+RF, LSTM
lstm+atlm 3.06 2.76 35.21 lstm+atlm 2.46 2.39 32.51
lstm+lr 3.75 3.66 20.63 lstm+lr 2.46 2.39 32.51 +SVM, and LSTM+ATLM is still significant after applying
p-value correction with the effect size greater than 0.5 in
TI Deep-SE 1.97 1.34 55.92 MS Deep-SE 3.23 1.99 17.17
lstm+rf 2.32 1.97 48.02 lstm+rf 3.30 2.77 15.30 59/64 cases. In most cases, when comparing the proposed
lstm+svm 2.00 2.10 55.20 lstm+svm 3.31 3.09 15.10 model against LSTM+RF, LSTM+SVM, LSTM+ATLM, and
lstm+atlm 2.51 2.03 43.87 lstm+atlm 3.42 2.75 12.21 LSTM+LR, the effect sizes are small (between 0.5 and 0.6).
lstm+lr 2.71 2.31 39.32 lstm+lr 3.42 2.75 12.21 A major part of those improvement were brought by our
DC Deep-SE 0.68 0.53 69.92 XD Deep-SE 1.63 1.31 46.82 use of the deep learning LSTM architecture to model the
lstm+rf 0.69 0.62 69.52 lstm+rf 1.81 1.63 40.99 textual description of an issue. The use of highway recur-
lstm+svm 0.75 0.90 67.02 lstm+svm 1.80 1.77 41.33 rent networks (on top of LSTM) has also improved the pre-
lstm+atlm 0.87 0.59 61.57 lstm+atlm 1.83 1.65 40.45
lstm+lr 0.80 0.67 64.96 lstm+lr 1.85 1.72 39.63 dictive performance, but not as large effects as the LSTM
itself (especially for those projects which have very small
BB Deep-SE 0.74 0.61 71.24 TD Deep-SE 2.97 2.92 48.28
lstm+rf 1.01 1.00 60.95 lstm+rf 3.89 4.37 32.14
number of issues). However, our approach, Deep-SE,
lstm+svm 0.81 1.00 68.55 lstm+svm 3.49 3.37 39.13 achieved Aiu greater than 0.6 in the most cases.
lstm+atlm 1.97 1.78 23.70 lstm+atlm 3.86 4.11 32.71
lstm+lr 1.26 1.16 51.24 lstm+lr 3.79 3.67 33.88 The proposed approach of using Recurrent Highway
CV Deep-SE 2.11 0.80 50.45 TE Deep-SE 0.64 0.59 69.67 Networks is effective in building a deep representation
lstm+rf 3.08 2.77 27.58 lstm+rf 0.66 0.65 68.51 of issue reports and consequently improving story point
lstm+svm 2.50 2.32 41.22 lstm+svm 0.70 0.90 66.61
lstm+atlm 3.11 2.49 26.90 lstm+atlm 0.70 0.72 66.51
estimation.
lstm+lr 3.36 2.76 21.07 lstm+lr 0.77 0.71 63.20
MAE and MdAE - the lower the better, SA - the higher the better. RQ3: Benefits of LSTM Document Representation
To study the benefits of using LSTM in representing issue
Using RHWN improved over RF between 0.91 percent reports, we compared the improved accuracy achieved by
(in MU) to 39.45 percent (in MD) in MAE, 5.88 percent Random Forest using the features derived from LSTM
(in UG) to 71.12 percent (in CV) in MdAE, and 0.58 percent against that using the features derived from BoW and Doc2-
(in DC) to 181.58 percent (in MD) in SA. The improvements vec. For a fair comparison we used Random Forests as the
of RHWN over SVM are between 1.50 percent (in TI) to regressor in all settings and the result is reported in Table 7
32.35 percent (in JI) in MAE, 9.38 percent (in MD) to 65.52 (see the distribution of the Absolute Error in Appendix A.3,
percent (in CV) in MdAE, and 1.30 percent (in TI) to 48.61 available in the online supplemental material). LSTM per-
percent (in JI). In terms of using ATLM, RHWN improved forms better than BoW and Doc2vec with respect to the
over it between 5.56 percent (in MS) to 62.44 percent (in BB) in MAE, MdAE, and SA measures in twelve projects (e.g., ME,
MAE, 8.70 percent (in AP) to 67.87 percent (in CV) in MdAE, UG, and AS) from sixteen projects. LSTM improved 4.16
and 3.89 percent (in ME) to 200.59 percent (in BB) in SA. Over- and 11.05 percent in MAE over Doc2vec and BoW, respec-
all, RHWN improved , in terms of MAE, 9.63 percent over tively, averaging across all projects.
SVM, 13.96 percent over RF, 21.84 percent over ATLM, and Among those twelve projects, LSTM improved over BoW
23.24 percent over LR, averaging across all projects. between 0.30 percent (in MS) to 28.13 percent (in DC) in
In addition, the results for the Wilcoxon test to compare terms of MAE, 1.06 percent (in AP) to 45.96 percent (in JI) in
our approach (Deep-SE) against LSTM+RF, LSTM+SVM, terms of MdAE, and 0.67 percent (in AP) to 47.77 percent
LSTM+ATLM, and LSTM+LR is shown in Table 6. The (in TD) in terms of SA. It also improved over Doc2vec
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 649
TABLE 7 TABLE 9
Evaluation Results of LSTM+RF, BoW+RF, and Doc2vec+RF Mean Absolute Error (MAE) on Cross-Project Estimation
(the Best Results Are Highlighted in Bold) and Comparison of Deep-SE and ABE0 Using Wilcoxon
Test and A^XY Effect Size (in Brackets)
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
ME lstm+rf 1.08 0.90 57.57 JI lstm+rf 1.71 1.27 49.71 Source Target Deep-SE ABE0 Deep-SE vs ABE0
bow+rf 1.31 1.34 48.66 bow+rf 2.10 2.35 38.34 (i) within-repository
d2v+rf 1.14 0.98 55.28 d2v+rf 2.10 2.14 38.29 ME UG 1.07 1.23 <0.001 [0.78]
UG lstm+rf 1.07 0.85 50.70 MD lstm+rf 9.86 9.69 17.86 UG ME 1.14 1.22 0.012 [0.52]
bow+rf 1.19 1.28 45.24 bow+rf 10.20 10.22 15.07 AS AP 2.75 3.08 <0.001 [0.67]
d2v+rf 1.12 0.92 48.47 d2v+rf 8.02 9.87 33.19 AS TI 1.99 2.56 <0.001 [0.70]
AS lstm+rf 1.62 1.40 52.38 DM lstm+rf 4.51 3.69 37.71 AP AS 2.85 3.00 0.051 [0.55]
bow+rf 1.83 1.53 46.34 bow+rf 4.78 3.98 33.84 AP TI 3.41 3.53 0.003 [0.56]
d2v+rf 1.62 1.41 52.38 d2v+rf 4.71 3.99 34.87 MU MS 3.14 3.55 0.041 [0.55]
MS MU 2.31 2.64 0.030 [0.56]
AP lstm+rf 2.96 2.80 37.34 MU lstm+rf 2.20 2.21 38.73
bow+rf 2.97 2.83 37.09 bow+rf 2.31 2.54 36.64 Avg 2.33 2.60
d2v+rf 3.20 2.91 32.29 d2v+rf 2.21 2.69 39.36
(ii) cross-repository
TI lstm+rf 2.32 1.97 48.02 MS lstm+rf 3.30 2.77 15.30
AS UG 1.57 2.04 0.004 [0.61]
bow+rf 2.58 2.30 42.15 bow+rf 3.31 2.57 15.58
d2v+rf 2.41 2.16 46.02 d2v+rf 3.40 2.93 12.79
AS ME 2.08 2.14 0.022 [0.51]
MD AP 5.37 6.95 <0.001 [0.58]
DC lstm+rf 0.69 0.62 69.52 XD lstm+rf 1.81 1.63 40.99 MD TI 6.36 7.10 0.097 [0.54]
bow+rf 0.96 1.11 57.78 bow+rf 1.98 1.72 35.56 MD AS 5.55 6.77 <0.001 [0.61]
d2v+rf 0.77 0.77 66.14 d2v+rf 1.88 1.73 38.72 DM TI 2.67 3.94 <0.001 [0.64]
BB lstm+rf 1.01 1.00 60.95 TD lstm+rf 3.89 4.37 32.14 UG MS 4.24 4.45 0.005 [0.54]
bow+rf 1.34 1.26 48.06 bow+rf 4.49 5.05 21.75 ME MU 2.70 2.97 0.015 [0.53]
d2v+rf 1.12 1.16 56.51 d2v+rf 4.33 4.80 24.48
Avg 3.82 4.55
CV lstm+rf 3.08 2.77 27.58 TE lstm+rf 0.66 0.65 68.51
bow+rf 2.98 2.93 29.91 bow+rf 0.86 0.69 58.89
d2v+rf 3.16 2.79 25.70 d2v+rf 0.70 0.89 66.61
The improvement of LSTM over BoW and Doc2vec is sig-
MAE and MdAE - the lower the better, SA - the higher the better. nificant after applying Bonferroni correction with effect size
greater than 0.5 in 24/32 cases and Aiu being greater than
between 0.45 percent (in MU) to 18.57 percent (in JI) in 0.5 in all projects (see Table 8).
terms of MAE, 0.71 percent (in AS) to 40.65 percent (in JI) in
terms of MdAE, and 2.85 percent (in TE) to 31.29 percent The proposed LSTM-based approach is effective in auto-
(in TD) in terms of SA. matically learning semantic features representing issue
We acknowledge that BoW and Doc2vec perform bet- description, which improves story-point estimation.
ter than LSTM in some cases. For example, in the Moo-
dle project (MD), D2V+RF performed better than LSTM RQ4: Cross-Project Estimation
+RF in MAE and SA—it achieved 8.02 MAE and 33.19 We performed sixteen sets of cross-project estimation
SA. This could reflect that the combination between experiments to test two settings: (i) within-repository: both
LSTM and RHWN significantly improves the accuracy of the source and target projects (e.g., Apache Mesos and
the estimations. Apache Usergrid) were from the same repository, and pre-
training was done using only the source projects, not the
TABLE 8 target projects; and (ii) cross-repository: the source project
Comparison of Random Forest with LSTM, Random Forests
(e.g., Appcelerator Studio) was in a different repository
with BoW, and Random Forests with Doc2vec Using
^XY Effect Size (in Brackets)
Wilcoxon Test and A from the target project Apache Usergrid, and pre-training
was done using only the source project.
LSTM vs BoW Doc2Vec Aiu Table 9 shows the performance of our Deep-SE model and
ME <0.001 [0.70] 0.142 [0.53] 0.62 ABE0 for cross-project estimation (see the distribution of the
UG <0.001 [0.71] 0.135 [0.60] 0.66 Absolute Error in Appendix A.4, available in the online sup-
AS <0.001 [0.66] <0.001 [0.51] 0.59 plemental material). We also used a benchmark of within-
AP 0.093 [0.51] 0.144 [0.52] 0.52 project estimation where older issues of the target project
TI <0.001 [0.67] <0.001 [0.55] 0.61 were used for training (see Table 3). In all cases, the proposed
DC <0.001 [0.73] 0.008 [0.59] 0.66 approach when used for cross-project estimation performed
BB <0.001 [0.77] 0.002 [0.66] 0.72
worse than when used for within-project estimation (e.g., on
CV 0.109 [0.61] 0.581 [0.57] 0.59
JI 0.009 [0.67] 0.011 [0.62] 0.65 average 20.75 percent reduction in performance for within-
MD 0.022 [0.63] 0.301 [0.51] 0.57 repository and 97.92 percent for cross-repository). However,
DM <0.001 [0.60] <0.001 [0.55] 0.58 our approach outperformed the cross-project baseline (i.e.,
MU 0.006 [0.59] 0.011 [0.57] 0.58 ABE0) in all cases—it achieved 2.33 and 3.82 MAE in within
MS 0.780 [0.54] 0.006 [0.57] 0.56 and cross repository, while ABE0 achieved 2.60 and 4.55
XD <0.001 [0.60] 0.005 [0.55] 0.58 MAE. The improvement of our approach over ABE0 is still
TD <0.001 [0.73] <0.001 [0.67] 0.70
TE <0.001 [0.69] 0.005 [0.61] 0.65 significant after applying p-value correction with the effect
size greater than 0.5 in 14/16 cases.
650 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019
TABLE 10 TABLE 11
Evaluation Results on the Adjusted Story Points Mean Absolute Error (MAE) and Comparison of Deep-SE
(the Best Results Are Highlighted in Bold) and the Porru’s Approach Using Wilcoxon Test and
A^XY Effect Size (in Brackets)
Proj Method MAE MdAE SA Proj Method MAE MdAE SA
ME Deep-SE 0.27 0.03 76.58 JI Deep-SE 0.60 0.51 63.20 Proj Deep-SE Porru Deep-SE vs Porru
lstm+rf 0.34 0.15 70.43 lstm+rf 0.74 0.79 54.42 APSTUD 2.67 5.69 <0.001 [0.63]
bow+rf 0.36 0.16 68.82 bow+rf 0.66 0.53 58.99
DNN 0.47 1.08 <0.001 [0.74]
d2v+rf 0.35 0.15 69.87 d2v+rf 0.70 0.53 56.99
lstm+svm 0.33 0.10 71.20 lstm+svm 0.94 0.89 41.97
MESOS 0.76 1.23 0.003 [0.70]
lstm+atlm 0.33 0.14 70.97 lstm+atlm 0.89 0.89 45.18 MULE 2.32 3.37 <0.001 [0.61]
lstm+lr 0.37 0.21 67.68 lstm+lr 0.89 0.89 45.18 NEXUS 0.21 0.39 0.005 [0.67]
mean 1.12 1.07 3.06 mean 1.31 1.71 18.95 TIMOB 1.44 1.76 0.047 [0.57]
median 1.05 1.00 8.87 median 1.60 2.00 1.29 TISTUD 1.04 1.28 <0.001 [0.58]
XD 1.00 1.86 <0.001 [0.69]
UG Deep-SE 0.07 0.01 93.50 MD Deep-SE 2.56 2.29 31.83
lstm+rf 0.08 0.00 92.59 lstm+rf 3.45 3.55 8.24 avg 1.24 2.08
bow+rf 0.11 0.01 90.31 bow+rf 3.32 3.27 11.54
d2v+rf 0.10 0.01 91.22 d2v+rf 3.39 3.48 9.70
lstm+svm 0.15 0.10 86.38 lstm+svm 3.12 3.07 16.94
lstm+atlm 0.15 0.08 86.25 lstm+atlm 3.48 3.49 7.41 These results confirm a universal understanding [25] in
lstm+lr 0.15 0.08 86.25 lstm+lr 3.57 3.28 4.98 agile development that story point estimation is specific to
mean 1.04 0.98 4.79 mean 3.60 3.67 4.18
median 1.06 1.00 2.64 median 2.95 3.00 21.48 teams and projects. Since story points are relatively mea-
AS Deep-SE 0.53 0.20 69.16 DM Deep-SE 2.30 1.43 31.99 sured, it is not uncommon that two different same-sized
lstm+rf 0.56 0.45 67.49 lstm+rf 2.83 2.59 16.23 teams could give different estimates for the same user story.
bow+rf 0.56 0.49 67.39 bow+rf 2.83 2.63 16.33 For example, team A may estimate 5 story points for user
d2v+rf 0.56 0.46 67.37 d2v+rf 2.92 2.80 13.80
lstm+svm 0.55 0.32 68.34 lstm+svm 2.45 1.78 27.56 story UC1 while team B gives 10 story points. However, it
lstm+atlm 0.57 0.46 66.87 lstm+atlm 2.83 2.57 16.28 does not necessarily mean that team B would do more work
lstm+lr 0.57 0.49 67.12 lstm+lr 2.83 2.57 16.28
mean 1.18 0.79 31.89 mean 3.27 3.41 3.25
for completing UC1 than team A. It more likely means that
median 1.35 1.00 21.54 median 2.61 2.00 22.94 team’B baselines are twice bigger than team A’s, i.e., for
AP Deep-SE 0.92 0.86 21.95 MU Deep-SE 0.68 0.59 63.83 “baseline” user story which requires 5 times less the effort
lstm+rf 0.99 0.87 16.23 lstm+rf 0.70 0.55 63.01 than UC1 takes, team A would give it 1 story point while
bow+rf 1.00 0.87 15.33 bow+rf 0.70 0.57 62.79
d2v+rf 0.99 0.86 15.94 d2v+rf 0.71 0.57 62.17
team B gives 2 story points. Hence, historical estimates are
lstm+svm 1.12 0.92 5.26 lstm+svm 0.70 0.62 62.62 more valuable for within-project estimation, which is dem-
lstm+atlm 1.03 0.84 12.63 lstm+atlm 0.93 0.74 50.77 onstrated by this result.
lstm+lr 1.17 1.05 1.14 lstm+lr 0.79 0.61 58.00
mean 1.15 0.64 2.49 mean 1.21 1.51 35.86
median 0.94 1.00 20.29 median 1.64 2.00 12.80 Given the specificity of story points to teams and proj-
TI Deep-SE 0.59 0.17 56.53 MS Deep-SE 0.86 0.65 56.82 ects, our proposed approach is more effective for within-
lstm+rf 0.72 0.56 46.22 lstm+rf 0.91 0.76 54.37 project estimation.
bow+rf 0.73 0.58 46.10 bow+rf 0.89 0.93 55.48
d2v+rf 0.72 0.56 46.17 d2v+rf 0.90 0.69 54.66
lstm+svm 0.73 0.62 45.74 lstm+svm 0.94 0.78 52.91 RQ5: Adjusted/Normalized Story Points
lstm+atlm 0.73 0.57 45.86 lstm+atlm 0.99 0.87 50.45
lstm+lr 0.73 0.56 45.77 lstm+lr 0.99 0.87 50.45
Table 10 shows the results of our Deep-SE and the other
mean 1.32 1.56 1.57 mean 1.23 0.62 38.49 baseline methods in predicting the normalized story points.
median 0.86 1.00 36.04 median 1.44 1.00 27.83 Deep-SE performs well across all projects. Deep-SE impro-
DC Deep-SE 0.48 0.48 55.77 XD Deep-SE 0.35 0.08 80.66 ved MAE between 2.13 to 93.40 percent over the Mean
lstm+rf 0.49 0.49 55.02 lstm+rf 0.44 0.37 75.78
bow+rf 0.49 0.48 54.76 bow+rf 0.45 0.38 75.33
method, 9.45 to 93.27 percent over the Median method, 7.02
d2v+rf 0.50 0.50 53.59 d2v+rf 0.45 0.32 75.31 to 53.33 percent over LSTM+LR, 1.20 to 61.96 percent over
lstm+svm 0.49 0.43 55.24 lstm+svm 0.38 0.20 79.16 LSTM+ATLM, 1.20 to 53.33 percent over LSTM+SVM, 4.00
lstm+atlm 0.53 0.47 51.02 lstm+atlm 0.92 0.76 49.05
lstm+lr 0.53 0.47 51.02 lstm+lr 0.45 0.40 75.33 to 30.00 percent over Doc2vec+RF, 2.04 to 36.36 percent
mean 1.07 1.49 1.29 mean 1.03 1.28 43.06 over BoW+RF, and 0.86 to 25.80 percent over LSTM+RF.
median 0.58 1.00 46.76 median 0.75 1.00 58.74
The best result is obtained in the Usergrid project (UG), it is
BB Deep-SE 0.41 0.12 72.00 TD Deep-SE 0.82 0.64 53.36 0.07 MAE, 0.01 MdAE, and 93.50 SA. We however note that
lstm+rf 0.43 0.38 70.37 lstm+rf 0.84 0.68 52.65
bow+rf 0.45 0.40 69.33 bow+rf 0.88 0.65 50.30 the adjusted story points benefits all methods since it nar-
d2v+rf 0.49 0.45 66.34 d2v+rf 0.86 0.70 51.46 rows the gap between minimum and maximum value and
lstm+svm 0.42 0.21 71.21 lstm+svm 0.83 0.62 53.24
lstm+atlm 0.47 0.41 67.53 lstm+atlm 0.83 0.58 52.82
the distribution of the story points.
lstm+lr 0.47 0.41 67.53 lstm+lr 0.90 0.74 48.88
mean 1.15 0.76 20.92 mean 1.29 1.42 27.20 Our proposed approach still outperformed other techni-
median 1.39 1.00 4.50 median 0.99 1.00 44.17
ques in estimating the new adjusted story points.
CV Deep-SE 1.15 0.79 23.29 TE Deep-SE 0.40 0.05 74.58
lstm+rf 1.16 1.05 22.55 lstm+rf 0.47 0.46 70.39
bow+rf 1.22 1.10 18.95 bow+rf 0.48 0.48 69.52 RQ6: Compare Deep-SE Against the Existing Approach
d2v+rf 1.20 1.09 20.30 d2v+rf 0.48 0.48 69.41 We applied our approach, Deep-SE, and the Porru et al.’s
lstm+svm 1.22 1.15 18.77 lstm+svm 0.45 0.41 71.77
lstm+atlm 1.47 1.28 2.22 lstm+atlm 0.49 0.48 69.14 approach on their dataset consisted of eight projects.
lstm+lr 1.47 1.28 2.22 lstm+lr 0.49 0.48 69.14 Table 11 shows the evaluation results in MAE and the com-
mean 1.27 1.11 15.18 mean 0.99 0.60 37.28
median 1.29 1.00 13.92 median 1.39 1.00 12.09
parison of Deep-SE and the Porru et al.’s approach. The dis-
tribution of the Absolute Error is reported in Appendix A.5,
MAE and MdAE - the lower the better, SA - the higher the better. available in the online supplemental material. Deep-SE
CHOETKIERTIKUL ET AL.: A DEEP LEARNING MODEL FOR ESTIMATING STORY POINTS 651
[17], [18]), and multi-objective evolutionary approaches (e.g., process the input sequence and a decoder RNN with attention
[19]). It is however likely that no single method will be the to generate the output sequence. This model takes as input a
best performer for all project types [10], [20], [91]. Hence, given API-related natural language query and returns API
some recent work (e.g., [20]) proposes to combine the esti- usage sequences. The work in [107] also uses RNN Encoder—
mates from multiple estimators. Hybrid approaches (e.g., Decoder but for fixing common errors in C programs. Deep
[21], [22]) combine expert judgements with the available Belief Network [108] is another common deep learning model,
data—similarly to the notions of our proposal. which has been used in software engineering, e.g., for build-
While most existing work focuses on estimating a whole ing defection prediction models [109], [110].
project, little work has been done in building models specifi-
cally for agile projects. Today’s agile, dynamic and change- 7 CONCLUSION
driven projects require different approaches to planning and In this paper, we have contributed to the research community
estimating [24]. Some recent approaches leverage machine the dataset for story point estimations, sourcing from 16 large
learning techniques to support effort estimation for agile and diverse software projects. We have also proposed a
projects. Recently, the work in [64] proposed an approach deep learning-based, fully end-to-end prediction system for
which extracts TF-IDF features from issue description to estimating story points, removing the users from manually
develop an story-point estimation model. The univariate fea- designing features from the textual description of issues.
ture selection technique are then applied on the extracted A key novelty of our approach is the combination of two pow-
features and fed into classifiers (e.g., SVM). In addition, the erful deep learning architectures: Long Short-Term Memory
work in [92] applied Cosmic Function Points (CFP) [93] to (to learn a vector representation for issue reports) and Recur-
estimate the effort for completing an agile project. The work rent Highway Network (for building a deep representation).
in [94] developed an effort prediction model for iterative The proposed approach has consistently outperformed
software development setting using regression models and three common baselines and four alternatives according to
neural networks. Differing from traditional effort estimation our evaluation results. Compared against the Mean and
models, this model is built after each iteration (rather than at Median techniques, the proposed approach has improved
the end of a project) to estimate effort for the next iteration. 34.06 and 26.77 percent respectively in MAE averaging
The work in [95] built a Bayesian network model for effort across 16 projects we studied. Compared against the BoW
prediction in software projects which adhere to the agile and Doc2Vec techniques, our approach has improved 23.68
Extreme Programming method. Their model however relies and 17.90 percent in MAE. These are significant results in
on several parameters (e.g., process effectiveness and pro- the literature of effort estimation. A major part of those
cess improvement) that require learning and extensive fine improvement were brought by our use of the deep learning
tuning. Bayesian networks are also used in [96] to model LSTM architecture to model the textual description of an
dependencies between different factors (e.g., sprint progress issue. The use of highway recurrent networks (on top of
LSTM) has also improved the predictive performance, but
and sprint planning quality influence product quality) in
not as significantly as the LSTM itself (especially for those
Scrum-based software development project in order to detect
project which have very small number of issues).
problems in the project. Our work specifically focuses on
Our future work would involve expanding our study to
estimating issues with story points using deep learning tech-
commercial software projects and other large open source
niques to automatically learn semantic features representing
projects to further evaluate our proposed method. We also
the actual meaning of issue descriptions, which is the key dif- consider performing team analytics (e.g., features character-
ference from previous work. Previous research (e.g., [97], izing a team) to model team changes over time and feed it
[98], [99], [100]) has also been done in predicting the elapsed into our prediction model. We also plan to investigate how
time for fixing a bug or the delay risk of resolving an issue. to learn a semantic representation of the codebase and use it
However, effort estimation using story points is a more pref- as another input to our model. Furthermore, we will look
erable practice in agile development. into experimenting with a sliding window setting to explore
LSTM has shown successes in many applications such as incremental learning. In addition, we will also investigate
language models [35], speech recognition [36] and video anal- how to best use the issue’s metadata (e.g., priority and type)
ysis [37]. Our Deep-SE is a generic in which it maps text to a and still maintain the end-to-end nature of our entire model.
numerical score or a class, and can be used for other tasks, Our future work also involve comparing our use of the
e.g., mapping a movie review to a score, or assigning scores to LSTM model against other state-of-the-art models of natural
essays, or sentiment analysis. Deep learning has recently language such as paragraph2vec [59] or Convolutional Neu-
attracted increasing interests in software engineering. Our ral Network [111]. We have discussed (informally) our work
previous work [101] proposed a generic deep learning frame- with several software developers who has been practising
work based on LSTM for modeling software and its develop- agile and estimating story points. They all agreed that our
ment process. White et al. [102] has employed recurrent prediction system could be useful in practice. However, to
neural networks to build a language model for source code. make such a claim, we need to implement it into a tool and
Their later work [103] extended these RNN models for detect- perform a user study. Hence, we would like to evaluate
ing code clones. The work in [104] also used RNNs to build a empirically the impact of our prediction system for story
statistical model for code completion. Our recent work [105] point estimation in practice by project managers and/or soft-
used LSTM to build a language model for code and demon- ware developers. This would involve developing the model
strated the improvement of this model compared to the into a tool (e.g., a JIRA plugin) and then organising trial use
one using RNNs. Gu et al. [106] used a special RNN in practice. This is an important part of our future work to
Encoder—Decoder, which consists of an encoder RNN to confirm the ultimate benefits of our approach in general.
654 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019
[48] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the num- [71] K. Muller, “Statistical power analysis for the behavioral scien-
ber of linear regions of deep neural networks,” in Proc. Int. Conf. ces,” Technometrics, vol. 31, no. 4, pp. 499–500, 1989.
Neural Inf. Process. Syst., 2014, pp. 2924–2932. [72] H. H. Abdi, “The bonferonni and Sidak corrections for multiple
[49] M. Bianchini and F. Scarselli, “On the complexity of neural net- comparisons,” Encyclopedia Meas. Statist.., vol. 1, pp. 1–9, 2007.
work classifiers: A comparison between shallow and deep [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/
architectures,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, download?doi=10.1.1.78.8747&rep=rep1&type=pdf
no. 8, pp. 1553–1565, Aug. 2014. [73] A. Vargha and H. D. Delaney, “A critique and improvement of
[50] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very the CL common language effect size statistics of McGraw and
deep networks,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst. Wong,” J. Educational Behavioral Statist., vol. 25, no. 2, pp. 101–
(NIPS), 2015, pp. 2377–2385. 132, 2000. [Online]. Available: http://jeb.sagepub.com/cgi/doi/
[51] J. Schmidhuber, “Deep learning in neural networks: An over- 10.3102/10769986025002101
view,” Neural Netw., vol. 61, pp. 85–117, 2015. [Online]. Avail- [74] A. Arcuri and L. Briand, “A practical guide for using statistical
able: http://dx.doi.org/10.1016/j.neunet.2014.09.003 tests to assess randomized algorithms in software engineering,”
[52] M. U. Gutmann and A. Hyv€arinen, “Noise-contrastive estima- in Proc. 33rd Int. Conf. Softw. Eng., 2011, pp. 1–10.
tion of unnormalized statistical models, with applications to [75] L. van der Maaten and G. Hinton, “Visualizing high-dimensional
natural image statistics,” J. Mach. Learn. Res., vol. 13, pp. 307–361, data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605,
2012. Nov. 2008.
[53] T. D. Team, “Theano: A {Python} framework for fast computa- [76] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent,
tion of mathematical expressions,” arXiv e-prints, vol. abs/ and S. Bengio, “Why does unsupervised pre-training help deep
1605.0, 2016. [Online]. Available: http://deeplearning.net/ learning?” J. Mach. Learn. Res., vol. 11, pp. 625–660, Mar. 2010.
software/theano [77] J. Weston, F. Ratle, and R. Collobert, “Deep learning via semi-
[54] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and supervised embedding,” in Proc. 25th Int. Conf. Mach. Learn.,
R. Salakhutdinov, “Dropout: A simple way to prevent neural net- 2008, pp. 1168–1175. [Online]. Available: http://doi.acm.org/
works from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 10.1145/1390156.1390303
2014. [78] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
[55] M. Shepperd and S. MacDonell , “Evaluating prediction systems and P. Kuksa, “Natural language processing (almost) from
in software project estimation,” Inf. Softw. Technol., vol. 54, no. 8, scratch,” J. Mach. Learn. Res., vol. 12, pp. 2493–2537, Nov. 2011.
pp. 820–827, 2012. [Online]. Available: http://dx.doi.org/ [Online]. Available: http://dl.acm.org/citation.cfm?
10.1016/j.infsof.2011.12.008 id=1953048.2078186
[56] R. Moraes, J. F. Valiati, and W. P. Gavi~ao Neto, “Document- [79] D. Zwillinger and S. Kokoska, CRC Standard Probability and Statis-
level sentiment classification: An empirical comparison between tics Tables and Formulae. Boca Raton, FL, USA: CRC Press, 1999.
SVM and ANN,” Expert Syst. Appl., vol. 40, no. 2, pp. 621–633, [80] J. McCarthy, “From here to human-level AI,” Artif. Intell.,
2013. vol. 171, no. 18, pp. 1174–1182, 2007.
[57] P. A. Whigham, C. A. Owen, and S. G. Macdonell, “A baseline [81] A. Arcuri and L. Briand, “A Hitchhiker’s guide to statistical tests
model for software effort estimation,” ACM Trans. Softw. Eng. for assessing randomized algorithms in software engineering,”
Methodology, vol. 24, no. 3, 2015, Art. no. 20. Softw. Testing Verification Rel., vol. 24, no. 3, pp. 219–250, 2014.
[58] P. Tirilly, V. Claveau, and P. Gros, “Language modeling for bag- [82] T. Menzies and M. Shepperd, “Special issue on repeatable results
of-visual words image categorization,” in Proc. 2008 Int. Conf. in software engineering prediction,” Empirical Softw. Eng.,
Content-Based Image Video Retrieval, 2008, pp. 249–258. vol. 17, no. 1/2, pp. 1–17, 2012.
[59] Q. Le and T. Mikolov, “Distributed representations of sentences [83] T. Menzies, et al., “The PROMISE Repository of empirical soft-
and documents,” in Proc. 31st Int. Conf. Mach. Learn., vol. 32, ware engineering data,” North Carolina State University,
pp. 1188–1196, 2014. Department of Computer Science, 2015, [Online]. Available:
[60] E. Kocaguneli, S. Member, and T. Menzies, “Exploiting the essen- http://openscience.us/repo
tial assumptions of analogy-based effort estimation,” IEEE Trans. [84] P. L. Braga, A. L. I. Oliveira, and S. R. L. Meira, “Software effort
Softw. Eng., vol. 38, no. 2, pp. 425–438, Mar./Apr. 2012. estimation using machine learning techniques with robust confi-
[61] E. Kocaguneli, T. Menzies, and E. Mendes, “Transfer learning in dence intervals,” in Proc. 7th Int. Conf. Hybrid Intell. Syst., 2007,
effort estimation,” Empirical Softw. Eng., vol. 20, no. 3, pp. 813–843, pp. 352–357.
2015. [85] Y. Jia, et al., “Caffe: Convolutional architecture for fast feature
[62] E. Mendes, I. Watson, and C. Triggs, “A comparative study of embedding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014,
cost estimation models for web hypermedia applications,” pp. 675–678.
Empirical Softw. Eng., vol. 8, pp. 163–196, 2003. [86] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and un-
[63] Y. F. Li, M. Xie, and T. N. Goh, “A study of project selection and derstanding recurrent networks,” in arXiv:1506.02078, 2015,
feature weighting for analogy based software cost estimation,” pp. 1–12. [Online]. Available: http://arxiv.org/abs/1506.02078
J. Syst. Softw., vol. 82, no. 2, pp. 241–252, Feb. 2009. [Online]. [87] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why should I trust
Available: http://dx.doi.org/10.1016/j.jss.2008.06.001 you?’: Explaining the predictions of any classifier,” in Proc.
[64] S. Porru, A. Murgia, S. Demeyer, M. Marchesi, and 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016,
R. Tonelli, “Estimating story points from issue reports,” pp. 1135–1144. [Online]. Available: http://arxiv.org/abs/
Proc. 12th Int. Conf. Predictive Models Data Anal. Softw. Eng., 1602.04938
2016, Art. no. 2. [88] M. Jorgensen, “A review of studies on expert estimation of
[65] S. D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineering software development effort,” J. Syst. Softw., vol. 70, no. 1/2,
Metrics and Models. Redwood City, CA, USA: Benjamin- pp. 37–60, 2004.
Cummings Publishing Co., Inc., 1986. [89] M. Jorgensen and T. M. Gruschke, “The impact of lessons-learned
[66] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit, “A simula- sessions on effort estimation and uncertainty assessments,” IEEE
tion study of the model evaluation criterion MMRE,” IEEE Trans. Trans. Softw. Eng., vol. 35, no. 3, pp. 368–383, May/Jun. 2009.
Softw. Eng., vol. 29, no. 11, pp. 985–995, Nov. 2003. [90] A. Panda, S. M. Satapathy, and S. K. Rath, “Empirical validation of
[67] B. Kitchenham, L. Pickard, S. MacDonell, and M. Shepperd, neural network models for agile software effort estimation based
“What accuracy statistics really measure,” IEE Proc. Softw., on story points,” Procedia Comput. Sci., vol. 57, pp. 772–781, 2015.
vol. 148, no. 3, pp. 81–85, Jun. 2001. [91] F. Collopy, “Difficulty and complexity as factors in software eff-
[68] M. Korte and D. Port, “Confidence in software cost estimation ort estimation,” Int. J. Forecasting, vol. 23, no. 3, pp. 469–471, 2007.
results based on MMRE and PRED,” in Proc. 4th Int. Workshop [92] R. Djouab, C. Commeyne, and A. Abran, “Effort estimation with
Predictor Models Softw. Eng., 2008, pp. 63–70. story points and COSMIC function points - an industry case
[69] D. Port and M. Korte, “Comparative studies of the model evalua- study,” pp. 25–36, 2008. [Online]. Available: http://cosmic-
tion criterions MMRE and PRED in software cost estimation sizing.org/wp-content/uploads/2016/03/Estimation-model-v-
research,” in Proc. 2nd ACM-IEEE Int. Symp. Empirical Softw. Eng. Print-Format-adapter.pdf
Meas., 2008, pp. 51–60. [93] ISO/IEC JTC 1/SC 7, INTERNATIONAL STANDARD ISO/IEC
[70] T. Menzies, E. Kocaguneli, B. Turhan, L. Minku, and F. Peters, Software Engineering COSMIC: A Functional Size Measurement
Sharing Data and Models in Software Engineering. San Mateo, CA, Method, vol. 2011, 2011. [Online]. Available: https://www.iso.
USA: Morgan Kaufmann, 2014. org/standard/54849.html
656 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 45, NO. 7, JULY 2019
[94] P. Abrahamsson, R. Moser, W. Pedrycz, A. Sillitti, and G. Succi, Hoa Khanh Dam received the bachelor of com-
“Effort prediction in iterative software development processes— puter science degree from the University of
incremental versus global prediction models,” in Proc. 1st Int. Melbourne, in Australia, and the master’s and PhD
Symp. Empirical Softw. Eng. Meas., 2007, pp. 344–353. degrees in computer science from RMIT University.
[95] P. Hearty, N. Fenton, D. Marquez, and M. Neil, “Predicting proj- He is a senior lecturer in the School of Computing
ect velocity in XP using a learning dynamic Bayesian network and Information Technology, University of Wollon-
model,” IEEE Trans. Softw. Eng., vol. 35, no. 1, pp. 124–137, gong (UOW), in Australia. He is an associate direc-
Jan./Feb. 2009. tor for the Decision System Lab, University of
[96] M. Perkusich, H. De Almeida, and A. Perkusich, “A model to Wollongong, heading its Software Engineering
detect problems on scrum-based software development proj- Analytics research program. His research interests
ects,” in Proc. ACM Symp. Appl. Comput., 2013, pp. 1037–1042. lie primarily in the intersection of software engineer-
[97] E. Giger, M. Pinzger, and H. Gall, “Predicting the fix time of ing, business process management and service-oriented computing,
bugs,” in Proc. 2nd Int. Workshop Recommendation Syst. Softw. focusing on such areas as software engineering analytics, process analyt-
Eng., 2010, pp. 52–56. ics, and service analytics. His research has won multiple Best Paper
[98] L. D. Panjer, “Predicting eclipse bug lifetimes,” in Proc. 4th Int. Awards (at WICSA, APCCM, and ASWEC) and ACM SIGSOFT Distin-
Workshop Mining Softw. Repositories, 2007, pp. 29–32. guished Paper Award (at MSR).
[99] P. Bhattacharya and I. Neamtiu, “Bug-fix time prediction models:
Can we do better?” in Proc. 8th Working Conf. Mining Softw.
Repositories, 2011, pp. 207–210. Truyen Tran received the BSc degree from the
[100] P. Hooimeijer and W. Weimer, “Modeling bug report quality,” in University of Melbourne and the PhD degree in
Proc. 22 IEEE/ACM Int. Conf. Automated Softw. Eng., Nov. 2007, computer science from Curtin University, in 2001
pp. 34–44. and 2008, respectively. He is a senior lecturer
[101] H. K. Dam, T. Tran, J. Grundy, and A. Ghose, “DeepSoft: A with Deakin University, Australia. His research
vision for a deep model of software,” in Proc. 24th ACM SIGSOFT interests include AI and its applications to bio-
Int. Symp. Found. Softw. Eng., 2016, pp. 944–947. medicine, sciences, and software. He have won
[102] M. White, C. Vendome, M. Linares-Vasquez, and D. Poshyvanyk, multiple paper awards and prizes including UAI
“Toward deep learning software repositories,” in Proc. 12th Work. 2009, CRESP 2014, Kaggle 2014, PAKDD 2015,
Conf. Mining Softw. Repositories, 2015, pp. 334–345. ACM SIGSOFT 2015, and ADMA 2016.
[103] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk,
“Deep learning code fragments for code clone detection,” in
Proc. 31st IEEE/ACM Int. Conf. Automated Softw. Eng., 2016, Trang Pham received the bachelor’s degree in
pp. 87–98. [Online]. Available: http://doi.acm.org/10.1145/ computer science from Vietnam National Univer-
2970276.2970326 sity, in 2014. She is working toward the PhD degree
[104] V. Raychev, M. Vechev, and E. Yahav, “Code completion at Deakin University. Currently, her research focu-
with statistical language models,” in Proc. 35th ACM SIGPLAN ses on deep learning for structured data. She has
Conf. Program. Language Des. Implementation, 2013, pp. 419–428. worked on different types of structured data such
[Online]. Available: http://dl.acm.org/citation.cfm?doid=2594291. as electronic medical records, molecular and net-
2594321 worked data, and software code.
[105] H. Dam, T. Tran, and T. Pham, “A deep language model for soft-
ware code,” arXiv:1608.02715, pp. 1–4, 2016. [Online]. Available:
http://arxiv.org/abs/1608.02715
[106] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API learning,”
Aditya Ghose received the bachelor of engineer-
in Proc. 24th ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2016,
ing degree in computer science and engineering
pp. 631–642. [Online]. Available: http://doi.acm.org/10.1145/
from Jadavpur University, Kolkata, India, and the
2950290.2950334
MSc and PhD degrees in computing science
[107] R. Gupta, S. Pal, A. Kanade, and S. Shevade, “DeepFix: Fixing com-
from the University of Alberta, Canada. He also
mon C language errors by deep learning,” in Proc. 31st AAAI Conf.
spent parts of his PhD candidature at the Beck-
Artif. Intell., 2017, pp. 1345–1351. [Online]. Available: http://aaai.
man Institute, University of Illinois at Urbana
org/ocs/index.php/AAAI/AAAI17/paper/view/14603
Champaign and the University of Tokyo. He is a
[108] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality
professor of computer science with the University
of data with neural networks,” Sci., vol. 313, no. 5786, pp. 504–
of Wollongong. He leads a team conducting
507, 2006.
research into knowledge representation, agent
[109] S. Wang, T. Liu, and L. Tan, “Automatically learning semantic
systems, services, business process management, software engineer-
features for defect prediction,” in Proc. Int. Conf. Softw. Eng., 2016,
ing and optimization and draws inspiration from the cross-fertilization of
pp. 297–308. [Online]. Available: http://dx.doi.org/10.1145/
ideas from this spread of research areas. He works closely with some of
2884781.2884804
the leading global IT firms. He is president of the Service Science Soci-
[110] X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun, “Deep learning for
ety of Australia and served as vice-president of CORE (2010-2014),
just-in-time defect prediction,” in Proc. IEEE Int. Conf. Softw.
Australia’s apex body for computing academics.
Quality Rel. Secur., 2015, pp. 17–26.
[111] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolu-
tional neural network for modelling sentences,” in Proc. 52nd Tim Menzies received the PhD degree from the
Annu. Meet. Assoc. Comput. Linguistics, 2014, pp. 655–665. University of New South Wales, in 1995. He is a
full professor in CS at North Carolina State Uni-
Morakot Choetkiertikul received the BS and MS versity, where he explores SE, data mining, AI,
degrees in computer science from Faculty of Infor- search-based SE, and open access science. He
mation and Communication Technology (ICT), is the author of more than 250 referred publica-
Mahidol University, Thailand. He is working toward tions and co-founder of the PROMISE confer-
the PhD degree in computer science and software ence series devoted to reproducible experiments
engineering in Faculty of Engineering and Informa- in SE (http://tiny.cc/seacraft). He also serves as
tion Sciences (EIS), University of Wollongong associated editor of many journals: the IEEE
(UOW), Australia. He is a part of Decision Systems Transactions on Software Engineering (2010 to
Lab (DSL). His research interests include empirical 2016), the ACM Transactions on Software Engineering Methodologies,
software engineering, software engineering analyt- the Empirical Software Engineering, the Automated Software Engineer-
ics, mining software repositories, and software pro- ing Journal, the Big Data Journal, the Information Software Technology,
cess improvement. For more details, see his home page: http://www.dsl. the IEEE Software, and the Software Quality Journal. He has served as
uow.edu.au/sasite/. co-general chair of ICSME’16 and co-PC chair for ASE’12, ICSE’15,
SSBSE’17. For more, see http://menzies.us.