(ICSE17) Semantically Enhanced Software Traceability Using Deep Learning Techniques
(ICSE17) Semantically Enhanced Software Traceability Using Deep Learning Techniques
Abstract—In most safety-critical domains the need for trace- maintaining trace links [1], [17], [34]. Solutions have included
ability is prescribed by certifying bodies. Trace links are generally information retrieval approaches [17], [18], [5], machine learn-
created among requirements, design, source code, test cases and ing [47], [30], [30], [51], heuristic techniques [64], [28], and
other artifacts; however, creating such links manually is time
consuming and error prone. Automated solutions use information AI swarming algorithms [65]. Other approaches, especially in
retrieval and machine learning techniques to generate trace links; the area of feature location [20], require additional information
however, current techniques fail to understand semantics of the obtained from runtime execution traces. Results have been
software artifacts or to integrate domain knowledge into the mixed, especially when applied to industrial-sized datasets,
tracing process and therefore tend to deliver imprecise and where acceptable recall levels above 90% can often only be
inaccurate results. In this paper, we present a solution that
uses deep learning to incorporate requirements artifact semantics achieved at extremely low levels of precision [43].
and domain knowledge into the tracing solution. We propose a One of the primary reasons that automated approaches have
tracing network architecture that utilizes Word Embedding and underperformed is the term mismatch that often exists between
Recurrent Neural Network (RNN) models to generate trace links. pairs of related artifacts [10]. To illustrate this we draw on an
Word embedding learns word vectors that represent knowledge example from the Positive Train Control (PTC) domain. PTC
of the domain corpus and RNN uses these word vectors to learn
the sentence semantics of requirements artifacts. We trained 360 is a communication-based train control system designed to
different configurations of the tracing network using existing ensure that trains follow directives in order to prevent accidents
trace links in the Positive Train Control domain and identified the from occurring [68]. The requirement stating that “The BOS
Bidirectional Gated Recurrent Unit (BI-GRU) as the best model Administrative Toolset shall allow the Authorized Administra-
for the tracing task. BI-GRU significantly out-performed state- tor to view an On-board’s last reported On-board Software
of-the-art tracing methods including the Vector Space Model and
Latent Semantic Indexing. Version, including the associated repository name, MD5, and
whether the fileset is preferred or acceptable.” is associated
Keywords-Traceability, Deep Learning, Recurrent Neural Net- with the design artifact stating that “The Operational Data
work, Semantic Representation.
Panel is used to provide information about the current PTC
operations in a subdivision”. Recognizing and establishing
I. Introduction
this link requires non-trivial knowledge of domain concepts –
Requirements traceability plays an essential role in the soft- for example, understanding that BOS Administrative Toolset
ware development process. Defined as “the ability to describe contains the Operational Data Panel, each locomotive contains
and follow the life of a requirement in both a forwards and an On-board unit for PTC operation, and that the Operational
backwards direction through periods of ongoing refinement Data Panel displays the information of locomotives such
and iteration” [26], traceability supports a diverse set of as the On-board Software Version to the BOS Authorized
software engineering activities including change impact anal- Administrator. This link would likely be missed by popular
ysis, regression test selection, cost prediction, and compliance trace retrieval algorithms such as the Vector Space Model
verification [25]. In high-dependability systems, regulatory (VSM), Latent Semantic Indexing (LSI), and Latent Direchlet
standards, such as the US Federal Aviation Authority’s (FAA) Allocation (LDA), which all represent artifacts as bags of
DO178b/c [22], prescribe the need for trace links to be estab- words and therefore lose the artifacts’ embedded semantics. It
lished and maintained between hazards, faults, requirements, would also be missed by techniques that incorporate phrasing
design, code, and test cases in order to demonstrate that a without understanding their conceptual associations [3], [15].
system is safe for use [38], [23]. Unfortunately, the tracing In fact, most current techniques lack the sophistication needed
task is arduous to perform and error-prone [46], even when to reason about semantic associations between artifacts and
industrial tools are used to manually create links or to capture therefore fail to establish trace links when there is little
them as a byproduct of the development process [14]. In meaningful overlap in use of terms.
practice, trace links are often incomplete and inaccurate [16], In our prior work we developed Domain-Contextualized
even in safety-critical systems [45], [56]. Intelligent Traceability (DoCIT) [29] as a proof of concept
To address these problems, researchers have proposed and solution to investigate the integration of domain knowledge
developed solutions for automating the task of creating and into the tracing process. We demonstrated that for the domain
4
where W, U and b are the affine transformation parameters, Figure 2 (a) illustrates a typical LSTM unit. Using retained
and tanh is the hyperbolic tangent function: tanh(z) = (ez − memory cell state and the gating mechanism, the LSTM unit
e−z )/(ez + e−z ). “remembers” information until it is erased by the forget gate;
5
and backward directions at the same time; this enables the
output to be influenced by both past and future data in the
sequence. In this study, we also explored the use of two-
layered RNNs and bidirectional RNNs for generating trace
links.
much of the current output is updated using h̃t and ht−1 .
"
ht = (1 − ut ) ht−1 + ut h̃t (7)
#" %
In addition to modifying the structure within a single RNN The design of the neural network architecture is shown in
unit, the structure of the overall RNN network can be varied. Figure 3. Given the textual content of a source artifact A s and
For instance, multi-layered RNNs [52] stack more than one a target artifact At , each word in A s and At is first mapped onto
RNN unit at each time step [59] with the aim of extracting its vector representation through the Word Embedding layer.
more abstract features from the input sequence. In contrast, bi- Such mappings are trained from the domain corpus using the
directional RNNs [60] process sequential data in both forward Skip-gram model introduced in Section II-A. The vectors of
6
words in source artifact s1 , s2 , . . . , sm are then sent to the RNN actual category of that example (i.e. link or non-link); as a
layers sequentially and output as a single vector v s representing result, P(Y = yi |xi , θ) represents the network’s prediction on the
its semantic information. In the case of the bidirectional-RNN, correct category given the current input and parameters. The
the word vectors are also sent in reverse order as sm , sm−1 , . . . , second part of the loss function represents a L2 parameter regu-
s1 . The target semantic vector vt is generated in the same way larization that prevents overfitting, where θ2 is the Euclidean
using RNN layers. Finally, these two vectors are compared in norm of θ, and λ controls the strength of the regularization.
the Semantic Relation Evaluation layers. Based on this loss function, we used a stochastic gradient
The Semantic Relation Evaluation layers in our tracing descent method [67] to update the network parameters. Ac-
network adopt the structure proposed by Tai et al. [66], cording to this method, a typical training process is comprised
targeted to perform semantic entailment classification tasks of a number of epochs. Each epoch iterates through all of the
for sentence pairs. The overall calculation of this part of the training data one time to train the network; as such, the overall
network can be represented as: training process uses all the training data several times until the
r pmul = v s vt objective loss is sufficiently small or fails to decrease further.
In each epoch, the training data is further randomly divided
r sub = |v s − vt |
(8) into a number of “mini batches;” each contains one or more
r = σ(W r r pmul + U r r sub + br ) training datapoints. After each batch is processed, a gradient
ptracelink = so f tmax(W p r + b p ) of parameters is calculated based on the loss function. The
where network then updates its parameters based on this gradient and
K
a “learning rate” that specifies how fast the parameters move
so f tmax(z) j = ez j / ezk , f or j = 1, . . . , K (9)
along the gradient direction. During training, we adopted an
k=1
adaptive learning rate procedure [67] to adjust the learning
Here, is the point-wise multiplication operator used to
rate based on the current training performance. To help the
compare the direction of source and target vectors on each
network converge, we also decreased the learning rate after
dimension. The absolute vector subtraction result, r sub , repre-
each epoch until epoch τ, such that the learning rate at epoch
sents the distance between the two vectors in each dimension.
k is determined by:
The network then uses a hidden sigmoid layer to integrate r pmul
and r sub and output a single vector to represent their semantic k = (1 − α)0 + ατ (11)
similarity. Finally, the output softmax layer uses the result to where 0 is initial learning rate, α = k/τ. In our experiment,
produce the probability that a valid trace link exists; the result τ is set to 0 /100, and τ set to 500.
of a softmax function is a K-dimensional vector of real values Based on these general methods, a tracing network training
in the range (0, 1) that add up to 1 (K=2 in this case). process is further determined by a set of predefined hyper-
A concrete tracing network is built upon this architecture parameters that help steer the learning behavior. Common
and is further configured by a set of network settings. Those hyper-parameters include the initial learning rate, gradient clip
settings specify the type of RNN unit (i.e. GRU or LSTM), the value, regulation strength (λ), the number of datapoints in a
number of hidden dimensions in RNN units and the Semantic mini batch (i.e. mini batch size), and the number of epochs
Relation Evaluation layers, and other RNN variables such as included in the training process. The techniques for selecting
the number of RNN layers and whether to use bidirectional hyper-parameters are described in Section IV-B.
RNN. To address our first research question (RQ1) we ex-
plored several different configurations. We describe how we IV. Experiment Setup
optimized network settings in Section IV-B. The tracing net- In this section, we explain the methods used to (1) prepare
work is implemented on the Torch framework (http://torch.ch), data, (2) systematically tune the configuration (i.e. network
and the source code is available at https://github.com/jin- settings and hyper-parameters) of the tracing network, and
guo/TraceNN. (3) compare the performance of the best configuration against
other popular trace evaluation methods.
B. Training the Tracing Network
A powerful network is only useful when it can be properly A. Data Preparation
trained using existing data and when it is generalizable to To train the word embeddings, we used a corpus from
unseen data. To train the tracing network, we use the regular- the PTC domain that is comprised of 52.7MB of clean text
ized negative log likelihood as our objective loss function to extracted from related domain documents and software arti-
be minimized. This objective function is commonly used in facts. The original corpus of domain documents was collected
categorical prediction models [7] and can be written as: from the Internet by our industrial collaborators as part of
their initial domain analysis process. We also added the
1
N
λ
J(θ) = − logP(Y = yi |xi , θ) + θ22 (10) latest Wikipedia dump containing about 19.92GB of clean
N i=1 2 text to the corpus and used it for one variant of the word
where θ indicates the network parameters that need to be embedding configuration. All documents were preprocessed by
trained, N is the total number of examples in the training transforming characters to lower-case and removing all non-
data, xi is the input of the ith training example, yi is the alpha-numeric characters except for underscores and hyphens.
7
To train and evaluate other parts of the tracing network, TABLE I: Tracing Network Configuration Search Space
we used PTC project data provided by our industrial col-
laborators. The dataset contains 1,651 Software Subsystem Word Embedding Source PTC docs – 50 dim,
PTC docs + Wikipedia dump – 300 dim
Requirements (SSRS) as source artifacts and 466 Software
RNN Unit Type GRU, LSTM, BI-GRU, BI-LSTM,
Subsystem Design Descriptions (SSDD) as target artifacts. (AveVect as baseline)
Each source artifact contained an average of 33 tokens and RNN Layer 1, 2
described a functional requirement of the Back Office Server Hidden Dimension RNN30 + Intg10, RNN60 + Intg20
(BOS) subsystem. Each target artifact contained an average Init Learning Rate (lr) 1e-03, 1e-02, 1e-01
of 99 tokens and specified design details. There were 1,387 Gradient Clip Value (gc) 10, 100
Regularization Strength λ 1e-04, 1e-03
trace links between SSRS and SSDD artifacts, all of which
Mini Batch Size 1
were constructed and validated by our industrial collaborators. Epoch 60
This dataset is considerably larger than those used in most
previous studies on requirements traceability [5], [44], [34] the sampled non-links used for training were representative
and the task of creating links across such a large dataset and preserved an equal contribution of links and non-links
represents a challenging industrial-strength tracing problem. during each epoch. Our initial experimental results showed
We randomly selected 45% of the 769,366 artifact pairs from this technique to be effective for training our tracing network.
the PTC project dataset (i.e. 1,651 × 466) for inclusion in a
training set, 10% for a development set, and 45% as a testing B. Model Selection and Hyper-Parameters Optimization
set. Given a fixed tracing network configuration, the training Finding suitable network settings and a good set of hyper-
set was used to update the network parameters (i.e. the weight parameters is crucial to the success of applying deep learn-
and bias for affine transformation in each layer) in order to ing methods to practical problems [54]. However, given the
minimize the objective loss function. The development set running time required for training, the search space of all
was used to select the best general model during an initial the possible combinations of different configurations was too
training process to ensure that the model was not overtrained. large to provide full coverage. We therefore first identified
The test data was set aside and only used for evaluating the several configurations that were expected to produce good
performance of the final network model. performance. This was accomplished by manually observing
Software project data exhibits special characteristics that how training loss changed during early epochs and following
impact the training of a neural network [30]. In particular, heuristics suggested in [8]. We then created a network config-
the number of actual trace links is usually very small for a uration search space centered around these manually identified
given set of source and target artifacts compared to the total configurations; our search space is summarized in table I. We
number of artifact pairs. In our dataset, among all 769,366 conducted a grid search and trained all the combinations of
artifact pairs, only 0.18% are valid links. Training a neural each configuration in Table I using the training set and then
network using such an unbalanced dataset is very challenging. compared their performance on the development set to find
A common and relatively simple approach for handling unbal- the best configuration. We describe our search space below.
anced datasets is to weight the minority class (i.e. the linked For learning the word embeddings, we used the Skip-gram
artifacts) higher than the majority class (i.e. the non-linked model provided by the Word2vec tool [48]. We trained the
ones). However, in the gradient descent method, a larger loss word vectors with two settings: 50-dimension vectors using
weighting could improperly amplify the gradient update for the the PTC corpus only and 300-dimension using both PTC and
minority class making the training unstable and causing failure Wikipedia dump. The number of dimensions are set differently
to converge. Another common way to handle unbalanced data because the PTC corpus (38,771 tokens) contains considerably
is to downsample the majority case in order to produce a fixed less tokens than the PTC + Wikipedia dump (8,025,288
and balanced training set. Based on initial experimentation we tokens). While a smaller vector dimension would result in
found that this approach did not yield good results because the faster training, a larger dimension is needed to effectively
examples of non-links used for training the network tended to represent the semantics of all tokens in the latter corpus.
randomly exclude artifact pairs that lay at the frontier at which To compare which variation of RNN best suits the tracing
links and non-links are differentiated. Furthermore, based on network, we evaluated GRU, LSTM, bi-directional GRU (BI-
initial experimentation, we also ruled out the upsampling GRU), bi-directional LSTM (BI-LSTM) with both 1 and 2
method because this considerably increased the size of the layers introduced in Section II-C. The hidden dimensions in
training set, excessively prolonging the training time. each RNN unit were set to either 30 or 60, while the hidden
Based on further experimentation we adopted a strategy dimensions for the Integration layer were set to 10 or 20
that dynamically constructed balanced training sets using sub- correspondingly. As a baseline method, we also replaced the
datasets. In each epoch, a balanced training set was constructed RNN layers with a bag-of-word method in which the semantic
by including all valid links from the original training set as vector of an artifact is simply set to be the average of all word
well as a randomly selected equal number of non-links form vectors contained in the artifact (“AveVect” in Table I). We
the training set. The selection of non-links was updated at also summarize the search space for other hyper-parameters
the start of each epoch. This approach ensured that over time of the tracing network in Table I.
8
C. Comparison of Tracing Methods each RNN unit type; Table II summarizes these results. We
In practical requirements tracing settings, a tracing method found that the best configurations for all four RNN unit types
returns a list of candidate links between a source artifact, were very similar: one layer RNN model with 30 hidden
serving the role of user query, and a set of target artifacts. An dimensions and an Integration layer of 10 hidden dimensions,
effective algorithm would return all valid links close to the top learning rate of 1e-02, gradient clip value of 10, and λ of
of the list. The effectiveness of a tracing algorithm is therefore 1e-04. Performance varies for different RNN unit types.
often measured using Mean Average Precision (MAP). To
TABLE II: Best Configuration for Each RNN Unit Type
calculate MAP, we first calculate the Average Precision (AP)
for each individual query as:
|Retrieved| RNN Dev. Word
(Precision(i) × relevant(i)) Unit Loss Emb.
L Dr Ds lr gc λ
AP = i=1 (12)
|RelevantLinks|
BI-GRU .1045 PTC 1 30 10 .01 10 .0001
where |RetrievedLinks| is the number of retrieved links, i is the GRU .1301 PTC 1 30 10 .01 10 .0001
rank in the sequence of retrieved candidates links, relevant(i) BI-LSTM .1434 PTC 1 30 10 .01 100 .0001
is a binary function assigned 1 if the link is valid and 0 LSTM .2041 PTC+Wiki 1 30 10 .01 10 .0001
L – number of layers in the RNN model, Dr – hidden dimension in RNN
otherwise, and Precision(i) is the precision computed after unit, D s – hidden dimension in Integration layer, lr – initial learning rate,
truncating the list immediately below i. Then, Mean Average gc – gradient clipping value, λ – regularization strength
Precision (MAP) is computed as the mean AP across all
Figure 4 illustrates the learning curves on the training
queries. In typical information retrieval settings, MAP is com-
dataset of the four configurations. All four RNN unit types
puted for the top N returned links; however, for traceability
outperformed the Average Vector method. This supports our
purposes we compute it when returning all valid links as
hypothesis that word order plays an important role when
specified in the trace matrix. This means that our version of
comparing the semantics of sentences. Both GRU and BI-
MAP is computed for recall of 100%.
GRU achieved faster convergence and more desirable (smaller)
We computed MAP using the test dataset only, and com-
loss than LSTM and BI-LSTM. Although quite similar, the
pared the performance of our tracing network with other
bidirectional models performed slightly better than the unidi-
popular tracing methods, i.e. Vector Space Model (VSM) and
rectional models for both GRU and LSTM on the training set;
Latent Semantic Indexing (LSI). To make a fair comparison,
the bidirectional models also achieved better results on the
we also optimized the configurations for the VSM and the
development dataset compared to their unidirectional counter-
LSI methods using a Genetic Algorithm to search through an
parts. As a result, the overall best performance was achieved
extensive configuration space of preprocessors and parameters
using BI-GRU configured as shown in Table II.
[43]. Finally, we configured VSM to use a local Inverse Docu-
ment Frequency (IDF) weighting scheme when calculating the ! !
cosine similarity [34]. LSI was reduced to 75% dimensions.
For both VSM and LSI we preprocessed the text to remove
recall curve. The graph depicts recall and precision scores at
different similarity or probability values. The Precision-Recall
Curve thus shows trade-offs between precision and recall and
provides insights into where each method performs best – for Fig. 4: Comparison of learning curves for RNN variants using
example, whether a technique improves precision at higher or their best configurations. GRU and BI-GRU (left) converged
lower levels of recall [11]. A curve that is farther away from faster and achieved smaller loss than LSTM and BI-LSTM
the origin indicates better performance. (right). They all outperformed the baseline.
V. Results and Discussion We also found that in three of the four best configurations,
In this section, we report (1) the best configurations found the word embedding vectors were trained using the PTC
in our network configuration search space, (2) the performance corpus alone. We speculate that one reason the PTC-trained
of the tracing network with the best configuration compared word vectors performed better than the PTC+Wiki-trained
against VSM and LSI, and (3) the performance of the tracing vectors is due to differences in content of the two corpora.
network when trained with a larger training set of data. The PTC+Wiki corpus contains significantly more words that
are used in diverse contexts because the majority of articles
A. What is the best configuration for the tracing network? in the Wiki corpus are not related to the PTC domain. In
This experiment aims to address the first research question the case of words that appear commonly in Wiki articles but
(RQ1). When optimizing the network configuration of the convey specific meanings in the PTC domain (e.g. message,
tracing network, we first selected the best configuration for administrator, field, etc.), their context in more general articles
9
(a) GRU output at one time step
Fig. 5: Precision-Recall Curve on test set – 45% total data
is likely to negatively affect the reasoning task on domain
specific semantics when there is insufficient training data to
disambiguate their usage. We also tested the tracing network
performance using 300-dimension word embedding trained by
found in Table II. These findings suggest that using only the
domain corpus to train word vectors with a reasonable size is
trieval algorithms?
10
update gate for this dimension was also small for most of
&
the input words until the words datapoint, transfer and to
"&
came. Those spikes indicate moments when the temporary
memory was integrated into the persisting memory. As such,
we can speculate that this semantic vector dimension functions
to accumulate information from specific keywords across the
entire sentence. Conversely, for the 12th dimension, the reset
gate was constantly high, indicating that the information stored
in the temporary memory was based on both previous output
and current input. Therefore, the actual output is more sensitive
to local context rather than a single keyword. This is confirmed
by the fluctuating shape of the actual output shown in the fig-
ure. For example, the output value remained in the same range
until the topic was changed from “datapoint indication” to
! " #
“message transfer”. We believe that this versatile behavior of
the gating mechanism might enable GRU to encode complex
semantic information from a sentence into a vector to support Fig. 7: Precision-Recall Curve on test set – 10% total data.
trace link generation and other challenging NLP tasks.
From Figure 5, we also notice that the tracing network trained with 45% and 80% of the data respectively) to generate
hits a glass ceiling for improving precision above 0.27. We trace links against the same small test set (sized at 10%). To
consider this to be caused by its inability to rule out some false reduce the effect of random data selection, we repeated this
positive links that contain valid associations. For example, the process five times and report the average results.
tracing network assigns a 97.27% probability of a valid link With a larger training set, the MAP was .834 compared to
between artifact “The BOS administrative toolset shall allow .803 for the smaller training set. Results are depicted in the
an authorized administrator to view internal errors generated Precision-Recall Curve in Figure 7. With increased training
by the BOS” and the artifact “The MessagesEvents panel data, the network can better differentiate links and non-links,
provides the functionality to view message and event logs. The and therefore improve both precision and recall. Improvements
panel provides searching and filtering capabilities where the were observed especially at low levels of recall.
user can search by a number of parameters depending on The performance of the tracing network trained using 80%
the type of data the user wants to view”. There are direct of data was again compared against VSM and LSI for this
associations between these artifacts: MessagesEvents panel is larger training set. A Friedman test identified a statistically
part of the BOS administrative toolset for viewing message and significant difference among the APs associated with the
event log, administrator is a system user and internal errors three methods on this new dataset division (X2 (2) = 141.11,
generated by the BOS is an event. But this is not a valid p < .001). Using pairwise Wilcoxon signed ranks tests with
link because MessagesEvents panel only displays messages Bonferroni p-value adjustments, we found that our tracing
and events related to external Railroad Systems rather than to network performed significantly better (MAP = .834) than
internal events. It is likely that the tracing network fails to VSM (MAP = .625; p < .001) and LSI (MAP = .637;
exclude this link because it has not been exposed to sufficient p < .001). As such, we address RQ2 and conclude that in
similar negative examples in the training data. As we described general our tracing network improved trace link accuracy in
in Section IV-A, every positive example is used while negative comparison to both VSM and LSI, and that improvements
examples (i.e. non-links) are randomly selected during each were more marked as the size of the training set increased. We
epoch. We plan to explore more adequate methods for handling expect additional improvements by reconfiguring the tracing
unbalanced data problems caused by characteristics of the network for use with a larger training set [8]. However, we
tracing data in our future work. also observed that when using the original training set (i.e.
45% of data), our tracing network only outperformed LSI at
C. How does the tracing network react to more training data? higher levels of recall as shown in Figure 5 and Figure 7. In
The number of trace links tends to increase as a software future work, we plan to explore the trade-offs between these
project evolves. To explore the potential impact of folding two methods for specific data features, and to further improve
them into the training set, we increased the training dataset to the performance of the tracing network.
80%. We randomly selected part of the test data and moved
it into the training set to reach 80%, while retaining the VI. Related Work
remaining data in the testing set. Using the same configuration In this section we focus on prior work that has integrated on-
as described in Section IV-B, we then retrained the tracing tology, semantics, or NLP into the tracing process. Researchers
network. Because the size of the test set decreased to 10%, have attempted to improve bag-of-word approaches such as
we could not make direct comparisons to our previous results. the Vector Space Model (VSM) [34] by integrating matching
Instead we used both of the trained tracing networks (i.e. terms, project glossaries, and other forms of thesauri [34].
11
Basic enhancements have included user feedback techniques including artifacts and trace links, and the time needed to
such as Rocchio [34] or Direct Query manipulation (DQM) experiment with different algorithms for learning word em-
[61] to increase or decrease term weights. However, these beddings and generating trace links, our work focused on a
approaches fail to leverage semantic information. single domain of Positive Train Control. As a result, we cannot
Other techniques identify terms for briding the term mis- claim generalizability. However, the PTC dataset included
match between source and target artifacts. Dietrich et al. text taken from external regulations, and written by multiple
utilized validated trace links to identify frequent item sets requirements engineers, systems engineers, and developers.
of terms occurring across pairs of source and target artifacts, The threat to validity arises primarily from the possibility
and then used these to augment the text in the trace query that characteristics of our specific dataset may have impacted
[19]. Gibiec et al. [24] approached trace query augmentation results of our experiment. For example, the size of the overall
by acquiring related documents from the Internet, and then dataset, the characteristics of the vocabulary used, and/or the
extracting domain related terms. Researchers have also used nature of each individual artifact, may make our approach
phrase detection and chunking to search for requirements more or less effective. In the next phase of our work we will
impacted by change requests [2] or to improve the trace evaluate our approach on additional datasets.
retrieval process [70]. None of these techniques attempted to Second, we cannot guarantee that the trace matrix used
understand semantics of the artifacts. for evaluation is 100% correct. However, it was provided by
Researchers have also explored the use of knowledge bases our industrial collaborators and used throughout their project
to create and utilize semantically aware associations in the to demonstrate coverage or regulatory codes. The metrics
trace creation process – where a knowledge base include basic we used (i.e. MAP, Recall, and Precision) are all accepted
domain terms and sentences that describe the relationships be- research standards for evaluating trace results [34]. To avoid
tween those terms [36],[27],[62]. Data is typically represented comparison against a weak baseline, we report comparisons
as an ontology in which relationships are represented using against two standard baselines: VSM and LSI and configured
AND, OR, implication, and negation operators [36]. Trace- them using a Genetic Algorithm.
ability researchers have proposed the idea of using ontology
to connect source and target artifacts [31], [4]. Approaches
have been proposed for weighting the evidence for a trace VIII. Conclusions
link according to distance between concepts in the ontology
In this paper, we have proposed a neural network archi-
[42]. Unfortunately, building domain-specific ontologies is
tecture that utilizes word embedding and RNN techniques to
time consuming and ontologies are generally not available for
automatically generate trace links. The Bidirectional Recurrent
technical software engineering domains.
Gated Unit effectively constructed semantic associations be-
Finally, while researchers have proposed techniques that
tween artifacts, and delivered significantly higher MAP scores
more closely mimic the way human analysts reason about trace
than either VSM or LSI when evaluated on our large industrial
links and perform tracing tasks [28], [46], there is very limited
dataset. It also notably increased both precision and recall.
work in this area. Our prior work with DoCIT, described in
Given an initial training set of trace links, our tracing network
Section I, is one exception [28]. DoCIT utilizes both ontology
is fully automated and highly scalable. In future work, we
and heuristics to reason over concepts in the domain in order to
will focus on improving precision of the tracing network
deliver accurate trace links. However, as previously explained,
by identifying and including more representative negative
DoCIT requires non-trivial setup costs to build a customized
examples in the training set.
ontology and heuristics for each domain, and is sensitive to
flaws in syntactic parsing. In contrast, the RNN approach The tracing network is currently trained to process natural
described in this paper requires only a corpus of domain language text. In future work, we will investigate techniques
documents and a training set of validated trace links. for applying it to other types of artifacts such as source code
On the other hand, deep learning has been successfully or formatted data. Finally, given the difficulty and limitations
applied to many software engineering tasks. For example, of acquiring large corpora of data we will investigate hybrid
Lam et al. combined deep neural networks with information approaches that combine human knowledge with the neural
retrieval techniques to identify buggy files in bug reports [39]. network. In summary, the findings we have presented in this
Wang et al. utilized a deep belief network to extract semantic paper have demonstrated that deep learning techniques can be
features from source code for the purpose of defect prediction effectively applied to the tracing process. We see this as a
[69]. Raychev et al. adopted RNN and N-gram to build the non-trivial advance in our goal of automating the creation of
language model for the task of synthesizing code completions accurate trace links in industrial-strength datasets.
[55]. We were not able to find work that applied deep learning
techniques to traceability tasks in our literature review. IX. Acknowledgments
VII. Threats to Validity
The work in this paper was partially funded by the US
Two primary threats to validity potentially impact our work. National Science Foundation Grant CCF-1319680.
First, due to the challenge of obtaining large industrial datasets
12
References [22] Federal Aviation Authority (FAA). DO-178B: Software Considerations
in Airborne Systems and Equipment Certification.
[1] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. [23] Food and Drug Administration. Guidance for the Content of Premarket
Recovering traceability links between code and documentation. IEEE Submissions for Software Contained in Medical Devices, 2005.
Trans. Softw. Eng., 28(10):970–983, 2002. [24] M. Gibiec, A. Czauderna, and J. Cleland-Huang. Towards mining
[2] C. Arora, M. Sabetzadeh, L. C. Briand, F. Zimmer, and R. Gnaga. replacement queries for hard-to-retrieve traces. In ASE ’10: Proceed-
Automatic checking of conformance to requirement boilerplates via text ings of the IEEE/ACM international conference on Automated software
chunking: An industrial case study. In 2013 ACM / IEEE International engineering, pages 245–254, New York, NY, USA, 2010. ACM.
Symposium on Empirical Software Engineering and Measurement, Bal- [25] O. Gotel, J. Cleland-Huang, J. H. Hayes, A. Zisman, A. Egyed,
timore, Maryland, USA, October 10-11, 2013, pages 35–44, 2013. P. Grünbacher, A. Dekhtyar, G. Antoniol, J. I. Maletic, and P. Mäder.
[3] C. Arora, M. Sabetzadeh, A. Goknil, L. C. Briand, and F. Zimmer. Traceability fundamentals. In Software and Systems Traceability., pages
Change impact analysis for natural language requirements: An NLP 3–22. Springer, 2012.
approach. In 23rd IEEE International Requirements Engineering Con- [26] O. C. Z. Gotel and A. Finkelstein. An analysis of the requirements
ference, RE 2015, Ottawa, ON, Canada, August 24-28, 2015, pages traceability problem. In Proceedings of the First IEEE International
6–15, 2015. Conference on Requirements Engineering, ICRE ’94, Colorado Springs,
Colorado, USA, April 18-21, 1994, pages 94–101, 1994.
[4] N. Assawamekin, T. Sunetnanta, and C. Pluempitiwiriyawej. Ontology-
[27] T. Gruber. Ontology. In Encyclopedia of Database Systems, pages
based multiperspective requirements traceability framework. Knowl. Inf.
1963–1965. Springer US, 2009.
Syst., 25(3):493–522, 2010.
[28] J. Guo, J. Cleland-Huang, and B. Berenbach. Foundations for an
[5] H. U. Asuncion, A. Asuncion, and R. N. Taylor. Software traceability
expert system in domain-specific traceability. In 21st IEEE International
with topic modeling. In 32nd ACM/IEEE International Conference on
Requirements Engineering Conference, RE 2013, Rio de Janeiro-RJ,
Software Engineering (ICSE), pages 95–104, 2010.
Brazil, July 15-19, 2013, pages 42–51, 2013.
[6] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by
[29] J. Guo, N. Monaikul, C. Plepel, and J. Cleland-Huang. Towards
jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
an intelligent domain-specific traceability solution. In Proceedings of
[7] I. G. Y. Bengio and A. Courville. Deep learning. Book in preparation the 29th ACM/IEEE international conference on Automated software
for MIT Press, 2016. engineering, pages 755–766. ACM, 2014.
[8] Y. Bengio. Practical recommendations for gradient-based training of [30] J. Guo, M. Rahimi, J. Cleland-Huang, A. Rasin, J. H. Hayes, and
deep architectures. In Neural Networks: Tricks of the Trade, pages 437– M. Vierhauser. Cold-start software analytics. In Proceedings of the
478. Springer, 2012. 13th International Conference on Mining Software Repositories, MSR
[9] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies 2016, Austin, TX, USA, May 14-22, 2016, pages 142–153, 2016.
with gradient descent is difficult. IEEE transactions on neural networks, [31] S. Hayashi, T. Yoshikawa, and M. Saeki. Sentence-to-code traceability
5(2):157–166, 1994. recovery with domain ontologies. In J. Han and T. D. Thu, editors,
[10] T. J. Biggerstaff, B. G. Mitbander, and D. E. Webster. Program APSEC, pages 385–394. IEEE Computer Society, 2010.
understanding and the concept assignment problem. Communications [32] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice
of the ACM, 37(5):72–82, 1994. Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 1994.
[11] M. Buckland and F. Gey. The relationship between recall and precision. [33] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural
Journal of the American society for information science, 45(1):12, 1994. computation, 9(8):1735–1780, 1997.
[12] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the [34] J. Huffman Hayes, A. Dekhtyar, and S. K. Sundaram. Advancing can-
properties of neural machine translation: Encoder-decoder approaches. didate link generation for requirements tracing: The study of methods.
arXiv preprint arXiv:1409.1259, 2014. IEEE Transactions on Software Engineering, 32(1):4–19, 2006.
[13] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of [35] M. Iyyer, J. L. Boyd-Graber, L. M. B. Claudino, R. Socher, and
gated recurrent neural networks on sequence modeling. arXiv preprint H. Daumé III. A neural network for factoid question answering over
arXiv:1412.3555, 2014. paragraphs. In EMNLP, pages 633–644, 2014.
[14] J. Cleland-Huang, O. Gotel, J. H. Hayes, P. Mäder, and A. Zisman. [36] P. Jackson. Introduction To Expert Systems (3 ed.). Addison Wesley,
Software traceability: trends and future directions. In Proceedings of 1998.
the on Future of Software Engineering, FOSE 2014, Hyderabad, India, [37] A. Karpathy, J. Johnson, and F. Li. Visualizing and understanding
May 31 - June 7, 2014, pages 55–69, 2014. recurrent networks. CoRR, abs/1506.02078, 2015.
[15] J. Cleland-Huang and J. Guo. Towards more intelligent trace retrieval [38] J. C. Knight. Safety critical systems: challenges and directions. In 24th
algorithms. In (RAISE) Workshop on Realizing Artificial Intelligence International Conf. on Software Engineering, ICSE 2002, 19-25 May
Synergies in Software Engineering, 2014. 2002, Orlando, Florida, USA, pages 547–550, 2002.
[16] J. Cleland-Huang, M. Rahimi, and P. Mäder. Achieving lightweight [39] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Combining
trustworthy traceability. In Proceedings of the 22nd ACM SIGSOFT In- deep learning with information retrieval to localize buggy files for bug
ternational Symposium on Foundations of Software Engineering, (FSE- reports (n). In Automated Software Engineering (ASE), 2015 30th
22), Hong Kong, China, November 16 - 22, 2014, pages 849–852, 2014. IEEE/ACM International Conference on, pages 476–481. IEEE, 2015.
[17] A. De Lucia, F. Fasano, R. Oliveto, and G. Tortora. Enhancing an artefact [40] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
management system with traceability recovery features. In International W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten
Conference on Software Maintenance, pages 306–315, Washington, DC, zip code recognition. Neural computation, 1(4):541–551, 1989.
USA, 2004. IEEE Computer Society. [41] O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity
[18] A. Dekhtyar, J. Huffman Hayes, S. K. Sundaram, E. A. Holbrook, and with lessons learned from word embeddings. Transactions of the
O. Dekhtyar. Technique integration for requirements assessment. In 15th Association for Computational Linguistics, 3:211–225, 2015.
IEEE International Requirements Engineering Conference (RE), pages [42] Y. Li and J. Cleland-Huang. Ontology-based trace retrieval. In
141–150, 2007. Traceability in Emerging Forms of Software Engineering (TEFSE2013,
[19] T. Dietrich, J. Cleland-Huang, and Y. Shin. Learning effective query San Francisco, USA, May 2009.
transformations for enhanced requirements trace retrieval. In 2013 28th [43] S. Lohar, S. Amornborvornwong, A. Zisman, and J. Cleland-Huang.
IEEE/ACM International Conference on Automated Software Engineer- Improving trace accuracy through data-driven configuration and com-
ing, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013, pages position of tracing features. In 9th Joint Meeting on Foundations of
586–591, 2013. Software Engineering (ESEC/FSE), pages 378–388, 2013.
[20] B. Dit, M. Revelle, and D. Poshyvanyk. Integrating information retrieval, [44] A. D. Lucia, F. Fasano, R. Oliveto, and G. Tortora. Recovering trace-
execution and link analysis analysis algorithms to improve feature ability links in software artifact management systems using information
location in software. Empirical Software Engineering, 18(2):277–309, retrieval methods. ACM Transactions on Software Engineering and
2013. Methodology (TOSEM), 16(4):13, 2007.
[21] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and [45] P. Mäder, P. L. Jones, Y. Zhang, and J. Cleland-Huang. Strategic
S. Bengio. Why does unsupervised pre-training help deep learning? traceability for safety-critical projects. IEEE Software, 30(3):58–66,
Journal of Machine Learning Research, 11(Feb):625–660, 2010. 2013.
13
[46] A. Mahmoud, N. Niu, and S. Xu. A semantic relatedness approach for [70] X. Zou, R. Settimi, and J. Cleland-Huang. Improving automated re-
traceability link recovery. In IEEE 20th International Conference on quirements trace retrieval: a study of term-based enhancement methods.
Program Comprehension, ICPC 2012, Passau, Germany, June 11-13, Empirical Software Engineering, 15(2):119–146, 2010.
2012, pages 183–192, 2012.
[47] A. Mahmoud and G. Williams. Detecting, classifying, and tracing non-
functional software requirements. Requir. Eng., 21(3):357–381, 2016.
[48] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of
word representations in vector space. CoRR, abs/1301.3781, 2013.
[49] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur.
Recurrent neural network based language model. In Interspeech,
volume 2, page 3, 2010.
[50] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. CoRR,
abs/1310.4546, 2013.
[51] N. Niu and A. Mahmoud. Enhancing candidate link generation for re-
quirements tracing: The cluster hypothesis revisited. In 2012 20th IEEE
International Requirements Engineering Conference (RE), Chicago, IL,
USA, September 24-28, 2012, pages 81–90, 2012.
[52] R. Pascanu, Ç. Gülçehre, K. Cho, and Y. Bengio. How to construct deep
recurrent neural networks. CoRR, abs/1312.6026, 2013.
[53] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors
for word representation. In Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543, 2014.
[54] N. Pinto, D. Doukhan, J. J. DiCarlo, and D. D. Cox. A high-throughput
screening approach to discovering good forms of biologically inspired
visual representation. PLoS Comput Biol, 5(11):e1000579, 2009.
[55] V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical
language models. In ACM SIGPLAN Notices, volume 49, pages 419–
428. ACM, 2014.
[56] P. Rempel, P. Mäder, T. Kuschke, and J. Cleland-Huang. Traceabil-
ity gap analysis for assessing the conformance of software traceabil-
ity to relevant guidelines. In Software Engineering & Management
2015, Multikonferenz der GI-Fachbereiche Softwaretechnik (SWT) und
Wirtschaftsinformatik (WI), FA WI-MAW, 17. März - 20. März 2015,
Dresden, Germany, pages 120–121, 2015.
[57] T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Kociský, and
P. Blunsom. Reasoning about entailment with neural attention. CoRR,
abs/1509.06664, 2015.
[58] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning represen-
tations by back-propagating errors. Nature, 323(6088):533–536, 1986.
[59] J. Schmidhuber. Learning complex, extended sequences using the
principle of history compression. Neural Computation, 4(2):234–242,
1992.
[60] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks.
IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
[61] Y. Shin and J. Cleland-Huang. A comparative evaluation of two user
feedback techniques for requirements trace retrieval. In SAC, pages
1069–1074, 2012.
[62] P. Shvaiko and J. Euzenat. Ontology matching: State of the art and future
challenges. IEEE Trans. Knowl. Data Eng., 25(1):158–176, 2013.
[63] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes
and natural language with recursive neural networks. In Proceedings
of the 28th international conference on machine learning (ICML-11),
pages 129–136, 2011.
[64] G. Spanoudakis, A. Zisman, E. Pérez-Miñana, and P. Krause. Rule-based
generation of requirements traceability relations. Journal of Systems and
Software, 72(2):105–127, 2004.
[65] H. Sultanov, J. Huffman Hayes, and W.-K. Kong. Application of
swarm techniques to requirements tracing. Requirements Engineering,
16(3):209–226, 2011.
[66] K. S. Tai, R. Socher, and C. D. Manning. Improved semantic represen-
tations from tree-structured long short-term memory networks. arXiv
preprint arXiv:1503.00075, 2015.
[67] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient
by a running average of its recent magnitude. COURSERA: Neural
Networks for Machine Learning, 4(2), 2012.
[68] US Department of Railroads. Federal Railroad Administration,PTC Sys-
tem Information, https://www.fra.dot.gov/Page/P0358, Year = Accessed:
2016-08-26.
[69] S. Wang, T. Liu, and L. Tan. Automatically learning semantic features
for defect prediction. In Proceedings of the 38th International Confer-
ence on Software Engineering, pages 297–308. ACM, 2016.
14