0% found this document useful (0 votes)

11 views12 pages

(ICSE17) Semantically Enhanced Software Traceability Using Deep Learning Techniques

This paper presents a deep learning-based approach to automate software traceability by incorporating semantic understanding and domain knowledge into the tracing process. It introduces a tracing network architecture utilizing Word Embedding and Recurrent Neural Networks (RNN) to improve the accuracy of trace link generation, particularly in safety-critical domains like Positive Train Control systems. The proposed method significantly outperforms traditional techniques, addressing issues of term mismatch and semantic associations in traceability.

Uploaded by

Rubin Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views12 pages

(ICSE17) Semantically Enhanced Software Traceability Using Deep Learning Techniques

Uploaded by

Rubin Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

2017 IEEE/ACM 39th International Conference on Software Engineering

Semantically Enhanced Software Traceability

Using Deep Learning Techniques
Jin Guo, Jinghui Cheng and Jane Cleland-Huang
University of Notre Dame
Notre Dame, IN, USA
Email: {jguo3, JinghuiCheng, JaneClelandHuang}@nd.edu

Abstract—In most safety-critical domains the need for trace- maintaining trace links [1], [17], [34]. Solutions have included
ability is prescribed by certifying bodies. Trace links are generally information retrieval approaches [17], [18], [5], machine learn-
created among requirements, design, source code, test cases and ing [47], [30], [30], [51], heuristic techniques [64], [28], and
other artifacts; however, creating such links manually is time
consuming and error prone. Automated solutions use information AI swarming algorithms [65]. Other approaches, especially in
retrieval and machine learning techniques to generate trace links; the area of feature location [20], require additional information
however, current techniques fail to understand semantics of the obtained from runtime execution traces. Results have been
software artifacts or to integrate domain knowledge into the mixed, especially when applied to industrial-sized datasets,
tracing process and therefore tend to deliver imprecise and where acceptable recall levels above 90% can often only be
inaccurate results. In this paper, we present a solution that
uses deep learning to incorporate requirements artifact semantics achieved at extremely low levels of precision [43].
and domain knowledge into the tracing solution. We propose a One of the primary reasons that automated approaches have
tracing network architecture that utilizes Word Embedding and underperformed is the term mismatch that often exists between
Recurrent Neural Network (RNN) models to generate trace links. pairs of related artifacts [10]. To illustrate this we draw on an
Word embedding learns word vectors that represent knowledge example from the Positive Train Control (PTC) domain. PTC
of the domain corpus and RNN uses these word vectors to learn
the sentence semantics of requirements artifacts. We trained 360 is a communication-based train control system designed to
different configurations of the tracing network using existing ensure that trains follow directives in order to prevent accidents
trace links in the Positive Train Control domain and identified the from occurring [68]. The requirement stating that “The BOS
Bidirectional Gated Recurrent Unit (BI-GRU) as the best model Administrative Toolset shall allow the Authorized Administra-
for the tracing task. BI-GRU significantly out-performed state- tor to view an On-board’s last reported On-board Software
of-the-art tracing methods including the Vector Space Model and
Latent Semantic Indexing. Version, including the associated repository name, MD5, and
whether the fileset is preferred or acceptable.” is associated
Keywords-Traceability, Deep Learning, Recurrent Neural Net- with the design artifact stating that “The Operational Data
work, Semantic Representation.
Panel is used to provide information about the current PTC
operations in a subdivision”. Recognizing and establishing
I. Introduction
this link requires non-trivial knowledge of domain concepts –
Requirements traceability plays an essential role in the soft- for example, understanding that BOS Administrative Toolset
ware development process. Defined as “the ability to describe contains the Operational Data Panel, each locomotive contains
and follow the life of a requirement in both a forwards and an On-board unit for PTC operation, and that the Operational
backwards direction through periods of ongoing refinement Data Panel displays the information of locomotives such
and iteration” [26], traceability supports a diverse set of as the On-board Software Version to the BOS Authorized
software engineering activities including change impact anal- Administrator. This link would likely be missed by popular
ysis, regression test selection, cost prediction, and compliance trace retrieval algorithms such as the Vector Space Model
verification [25]. In high-dependability systems, regulatory (VSM), Latent Semantic Indexing (LSI), and Latent Direchlet
standards, such as the US Federal Aviation Authority’s (FAA) Allocation (LDA), which all represent artifacts as bags of
DO178b/c [22], prescribe the need for trace links to be estab- words and therefore lose the artifacts’ embedded semantics. It
lished and maintained between hazards, faults, requirements, would also be missed by techniques that incorporate phrasing
design, code, and test cases in order to demonstrate that a without understanding their conceptual associations [3], [15].
system is safe for use [38], [23]. Unfortunately, the tracing In fact, most current techniques lack the sophistication needed
task is arduous to perform and error-prone [46], even when to reason about semantic associations between artifacts and
industrial tools are used to manually create links or to capture therefore fail to establish trace links when there is little
them as a byproduct of the development process [14]. In meaningful overlap in use of terms.
practice, trace links are often incomplete and inaccurate [16], In our prior work we developed Domain-Contextualized
even in safety-critical systems [45], [56]. Intelligent Traceability (DoCIT) [29] as a proof of concept
To address these problems, researchers have proposed and solution to investigate the integration of domain knowledge
developed solutions for automating the task of creating and into the tracing process. We demonstrated that for the domain

1558-1225/17 $31.00 © 2017 IEEE 3

DOI 10.1109/ICSE.2017.9
of PTC systems DoCIT returned accurate trace links achieving been missed during the manual link construction process; and
mean average precision (MAP) of .822 in comparison to finally (3) Train the tracing network on the trace links for one
.590 achieved using VSM. However, the cost of setting up project, or for specific types of artifacts in one project, and
DoCIT for a domain was non-trivial, as it required carefully then apply it to other projects and artifact types within the
handcrafting a domain ontology and manually defining trace same domain. In this paper we focus on the first scenario.
link heuristics capable of reasoning over the semantics of We evaluate our approach on a large industrial dataset taken
the artifacts and associated domain knowledge. Furthermore, from the domain of PTC systems to address two research
DoCIT depended upon a conventional syntactic parser to questions:
analyze the artifacts in order to extract meaningful concepts. RQ1: How should RNN be configured in order to generate
The approach was therefore sensitive to errors in the parser, the most accurate trace links?
terms missing from the ontology, and missing or inadequate RQ2: Is RNN able to significantly improve trace link accuracy
heuristics. As such, DoCIT was effective but fragile, and would in comparison to standard baseline techniques?
require significant effort to transfer into new project domains. The remainder of the paper is structured as follows. We
On the other hand, deep learning techniques have success- first introduce deep learning techniques related to the tracing
fully been applied to solve many Natural Language Processing network in Section II. The architecture of the tracing network
(NLP) tasks including parsing [63], sentiment analysis [66], is described in Section III. Sections IV and V describe the pro-
question answering [35], and machine translation [6]. Such cess used to configure the tracing network, our experimental
techniques abstract problems into multiple layers of nonlinear design, and the results achieved. Finally, in Sections VI to VIII
processing nodes; they leverage either supervised or unsuper- we discuss related work, threats to validity, and conclusions.
vised learning techniques to automatically learn a representa-
tion of the language and then use this representation to perform II. Deep Learning for Natural Language Processing
complex NLP tasks. The goal of the work described in this
paper is to utilize deep learning to deliver a scalable, portable, Many modern deep learning models and associated training
and fully automated solution for bridging the semantic gap that methods originated from research in artificial neural networks
currently inhibits the success of trace link creation algorithms. (ANN). Inspired by advances in neuroscience, ANNs were
Our solution is designed to automate the capture of domain designed to approximate complex functions of the human
knowledge and the artifacts’ textual semantics with the explicit brain by connecting a large number of simple computational
goal of improving accuracy of the trace link generation task. units in a multi-layered structure. Based on ANNs, Deep
The approach we propose includes two primary phases. Learning models feature more complex network connections
First, we learn a set of word embeddings for the domain using in a larger number of layers. A benefit gained from a more
an unsupervised learning approach trained over a large set of complex structure is the ability to represent data features with
domain documents. The approach generates high dimensional multiple levels of abstraction; this is usually preferable to
word vectors that capture distributional semantics and co- more traditional machine learning techniques, in which human
occurrence statistics for each word [53]. Second, we use an expertise is needed to select features of data for training. Back-
existing training set of validated trace links from the domain propagation [58] is widely recognized as an effective method
to train a Tracing Network to predict the likelihood of a for training deep neural networks; it indicates how the network
trace link existing between two software artifacts. Within the should adapt its internal parameters to better compute the
tracing network, we adopt a Recurrent Neural Network representation in each layer. Before presenting our approach,
architecture to learn the representation of artifact semantics. we describe fundamental concepts of deep learning techniques,
For each artifact (i.e. each regulation, requirement, or source especially as related to NLP tasks. Furthermore, as our interest
code file etc.), each word is replaced by its associated vector lies in comparatively evaluating different models for purposes
representation learned in the word embedding training phase of trace link creation, we describe these various techniques in
and then sequentially fed into the RNN. The final output of some depth.
RNN is a vector that represents the semantic information of
the artifact. The tracing network then compares the semantic A. Word Embedding
vectors of two artifacts and outputs the probability that they Conventional NLP and information retrieval techniques treat
are linked. unique words as atomic symbols and therefore do not take
Given the need for an initial training set of trace links, our associations among words into account. To address this limi-
approach cannot be used in an entirely green field domain. tation, word embedding learns the representation of each word
However, based on requests from our industrial collaborators, from a corpus as a continuous high dimensional vector, such
we envision the following primary usage scenarios: (1) Train that similar words are close together in the vector space. In
the tracing network on an initial set of manually constructed addition, the embedded word vectors encode syntactic and
trace links for a project and then use it to automate the semantic relationships between words as linear relationships
production of other links as the project proceeds; (2) Train between word vectors [48]. The use of learned word vectors
the tracing network on the complete set of trace links for a is considered one of the primary reasons for the success of
project and then use it to find additional links that may have recent deep learning models for NLP tasks [21].

Skip-gram with negative sampling [48], [50] and GloVe

[53] are the most popular word embedding models due to the

notable improvement they bring to word analogy tasks over

more traditional approaches such as Latent Semantic Analysis

[48], [53]. Word embedding models are trained using unla-
beled natural language text by utilizing co-occurrence statistics Fig. 1: Standard RNN model (left) and its unfolded architec-
of words in the corpus. The Skip-gram model scans context ture through time (right). The black square in the left figure
windows across the entire training text to train prediction indicates a one time step delay.
models [48]. Given the center word in the window of size
T , this model maximizes the probability that the targeted A prominent drawback of the standard RNN model is that
word appears around the center word, while minimizing the the network degrades when long dependencies exist in the
probability that a random word appears around the center sequence due to the phenomenon of exploding or vanishing
word. The GloVe model uses matrix factorization; however we gradients during back-propagation [9]. This makes a standard
do not discuss it further because its performance is equivalent RNN model difficult to train. The exploding gradients problem
to the Skip-gram with negative sampling approach while it is can be effectively addressed by scaling down the gradient
less robust and utilizes more system resources [41]. when its norm is bigger than a preset value (i.e. Gradient
Clipping) [9]. To address the vanishing gradients problem of
B. Neural Network Structures
the standard RNN model, researchers have proposed several
Deep learning for NLP tasks are typically addressed us- variants with mechanics to preserve long-term dependencies;
ing neural network techniques. Feedforward networks, also these variants included Long Short Term Memory (LSTM) and
referred to as multi-layer perceptrons (MLPs), represent a the Gated Recurrent Unit (GRU).
traditional neural network structure and lay the foundation
for many other structures [32]. However, the number of D. Long Short Term Memory (LSTM)
parameters in a fully connected MLP can grow extremely LSTM networks include a memory cell vector in the re-
large as the width and depth of the network increases. To current unit to preserve long term dependencies[33]. LSTM
address this limitation, researchers have proposed various also introduces a gating mechanism to control when and how
neural network structures targeting different types of practical to read or write information to the memory cell. A gate in
problems. For example, convolutional neural networks (CNNs) LSTM usually uses a sigmoid function σ(z) = 1/(1 + e−z )
are especially well suited for image recognition and video and controls information throughput using a point-wise multi-
analysis tasks [40]. For NLP tasks, Recurrent neural networks plication operation . Specifically, when the sigmoid function
(RNNs) are widely used and are recognized as a good fit to outputs 0, the gate forbids any information from passing, while
the unique needs of NLP [58]. In particular, RNN and its all information is allowed to pass when the sigmoid function
variants have produced significant breakthroughs in many NLP output is 1. Each LSTM unit contains an input gate (it ), a
tasks including language modeling [49], machine translation forget gate ( ft ), and an output gate (ot ). The state of each gate
[6], and semantic entailment [66]. In the following sections, is decided by xt and ht−1 such that:
we first introduce background about RNN and then discuss
several RNN variants that were evaluated in this study. it = σ(W i xt + U i ht−1 + bi )
ft = σ(W f xt + U f ht−1 + b f ) (2)
C. Standard Recurrent Neural Networks (RNN)
ot = σ(W o xt + U o ht−1 + bo )
RNNs are particularly well suited for processing sequential To update the information in the memory cell, a memory
data such as text and audio. They connect computational units candidate vector c̃t is first calculated using the tanh function.
of the network in a directed cycle such that at each time step This memory candidate passes through the input gate, which
t, a unit in the RNN not only takes input of the current step controls how much each dimension in the candidate vector
(i.e., the next word embedding), but also the hidden state of should be “remembered”. At the same time, the forget gate
the same unit from the previous time step t − 1. This feedback controls how much each dimension in the previous memory
mechanism simulates a “memory”, so that a RNN’s output is cell state ct−1 should be retained. The actual memory cell state
determined by both its current and prior inputs. Furthermore, ct is then updated using the sum of these two parts.
because RNNs use the same unit (with the same parameters)
across all time steps, they are able to process sequential data c̃t = tanh(W c xt + U c ht−1 + bc )
(3)
of arbitrary length. This is illustrated in Figure 1. At a given ct = it c̃t + ft ct−1
time step t with input vector xt and its previous hidden output Finally, the LSTM unit calculates its output ht with an output
vector ht−1 , a standard RNN unit calculates its output as gate as follows:
ht = tanh(W xt + Uht−1 + b) (1) ht = ot tanh(ct ) (4)

where W, U and b are the aﬃne transformation parameters, Figure 2 (a) illustrates a typical LSTM unit. Using retained
and tanh is the hyperbolic tangent function: tanh(z) = (ez − memory cell state and the gating mechanism, the LSTM unit
e−z )/(ez + e−z ). “remembers” information until it is erased by the forget gate;

5
and backward directions at the same time; this enables the

output to be inﬂuenced by both past and future data in the

sequence. In this study, we also explored the use of two-

layered RNNs and bidirectional RNNs for generating trace

links.

III. The Tracing Network

(a) LSTM (b) GRU The tracing process comprises several steps: first, the hu-
Fig. 2: Comparison between single units from LSTM and GRU man analyst initiates a trace for a source artifact; second,
networks. In (2a), i, f and o are the input, forget and output the similarity between the source artifact and each of the
gates respectively; c is the memory cell vector. In (2b), r is potentially linked target artifacts is computed; third, a list of
the reset gate and u is the update gate [13]. ranked candidate links are returned; and finally the human
evaluates the links and accepts the ones deemed to be correct.
The process is repeated for all source artifacts [34]. Out study
as such, LSTM handles long-term dependencies more effec-
investigates the effectiveness of various deep learning models
tively. LSTM has been repeatedly applied to solve semantic
and methods for calculating similarities between source and
relatedness tasks and has achieved convincing performance
target artifact pairs, with the goal of generating accurate trace
[66], [57]. These advances motivated us to adopt LSTM for
links. This is essentially a textual comparison task in which
reasoning semantics in the tracing task.
the tracing network needs to leverage domain knowledge to
E. Gated Recurrent Unit (GRU) understand the semantics of two individual artifacts and then to
Finally, the recently proposed Gated Recurrent Unit (GRU) evaluate their semantic relatedness for tracing purposes. Valid
model also uses a gating mechanism to control the information associations need to be established between related artifacts
flow within a unit; but it has a simplified unit structure and even when no common words are present. Based on our
does not have a dedicated memory cell vector [12]. It contains initial analysis of the strengths and weaknesses of current
only a reset gate rt and an update gate ut : techniques, we decided to adopt word embeddings and RNN
rt = σ(W r xt + U r ht−1 + br ) techniques to achieve this goal. Therefore, we first need to
(5) learn word embeddings from a domain corpus in order to
ut = σ(W u xt + U u ht−1 + bu )
effectively encode word relations, and then utilize such word
In GRU networks, the previous hidden output ht−1 goes embeddings in the tracing network structure to extract and
through the reset gate rt and is sent back to the unit. An compare their semantics. In this section, we describe our
output candidate h̃t is then calculated using the gated ht−1 and approach for designing and training such a tracing network.
the unit’s current input xt as follows:

h̃t = tanh(W xt + U (rt ht−1 ) + b )
h h h
(6)

The actual output of a unit ht is a linear interpolation between

%

the previous output ht−1 and the candidate output h̃t , controlled %

%
' (
by the update gate ut . As such, the update gate balances how
' $(

much of the current output is updated using h̃t and ht−1 .

"

ht = (1 − ut ) ht−1 + ut h̃t (7)

#" %

Consequently, a GRU unit embeds long-term information

directly into the hidden output vectors. Figure 2 compares the

unit structures of LSTM and GRU networks. Despite having a
simpler structure, GRU has achieved competitive results with Fig. 3: The architecture of the tracing network. The software
LSTM for many NLP tasks [6], [13], and as such no decisive artifacts are ﬁrst mapped into sequences of embedded word
conclusion has been drawn about which model is better. In vectors and go through RNN layers to generate the semantic
this work, we compare the performance of LSTM and GRU vectors, which are then fed into the Semantic Relation Evalu-
in order to identify the most suitable model for addressing the ation layers to predict the probability that they are linked.
tracing problem.
F. Other RNN Variables A. Network Architecture

In addition to modifying the structure within a single RNN The design of the neural network architecture is shown in
unit, the structure of the overall RNN network can be varied. Figure 3. Given the textual content of a source artifact A s and
For instance, multi-layered RNNs [52] stack more than one a target artifact At , each word in A s and At is ﬁrst mapped onto
RNN unit at each time step [59] with the aim of extracting its vector representation through the Word Embedding layer.
more abstract features from the input sequence. In contrast, bi- Such mappings are trained from the domain corpus using the
directional RNNs [60] process sequential data in both forward Skip-gram model introduced in Section II-A. The vectors of

6
words in source artifact s1 , s2 , . . . , sm are then sent to the RNN actual category of that example (i.e. link or non-link); as a
layers sequentially and output as a single vector v s representing result, P(Y = yi |xi , θ) represents the network’s prediction on the
its semantic information. In the case of the bidirectional-RNN, correct category given the current input and parameters. The
the word vectors are also sent in reverse order as sm , sm−1 , . . . , second part of the loss function represents a L2 parameter regu-
s1 . The target semantic vector vt is generated in the same way larization that prevents overfitting, where θ2 is the Euclidean
using RNN layers. Finally, these two vectors are compared in norm of θ, and λ controls the strength of the regularization.
the Semantic Relation Evaluation layers. Based on this loss function, we used a stochastic gradient
The Semantic Relation Evaluation layers in our tracing descent method [67] to update the network parameters. Ac-
network adopt the structure proposed by Tai et al. [66], cording to this method, a typical training process is comprised
targeted to perform semantic entailment classification tasks of a number of epochs. Each epoch iterates through all of the
for sentence pairs. The overall calculation of this part of the training data one time to train the network; as such, the overall
network can be represented as: training process uses all the training data several times until the
r pmul = v s vt objective loss is sufficiently small or fails to decrease further.
In each epoch, the training data is further randomly divided
r sub = |v s − vt |
(8) into a number of “mini batches;” each contains one or more
r = σ(W r r pmul + U r r sub + br ) training datapoints. After each batch is processed, a gradient
ptracelink = so f tmax(W p r + b p ) of parameters is calculated based on the loss function. The
where network then updates its parameters based on this gradient and
K
a “learning rate” that specifies how fast the parameters move
so f tmax(z) j = ez j / ezk , f or j = 1, . . . , K (9)
along the gradient direction. During training, we adopted an
k=1
adaptive learning rate procedure [67] to adjust the learning
Here, is the point-wise multiplication operator used to
rate based on the current training performance. To help the
compare the direction of source and target vectors on each
network converge, we also decreased the learning rate after
dimension. The absolute vector subtraction result, r sub , repre-
each epoch until epoch τ, such that the learning rate at epoch
sents the distance between the two vectors in each dimension.
k is determined by:
The network then uses a hidden sigmoid layer to integrate r pmul
and r sub and output a single vector to represent their semantic k = (1 − α)0 + ατ (11)
similarity. Finally, the output softmax layer uses the result to where 0 is initial learning rate, α = k/τ. In our experiment,
produce the probability that a valid trace link exists; the result τ is set to 0 /100, and τ set to 500.
of a softmax function is a K-dimensional vector of real values Based on these general methods, a tracing network training
in the range (0, 1) that add up to 1 (K=2 in this case). process is further determined by a set of predefined hyper-
A concrete tracing network is built upon this architecture parameters that help steer the learning behavior. Common
and is further configured by a set of network settings. Those hyper-parameters include the initial learning rate, gradient clip
settings specify the type of RNN unit (i.e. GRU or LSTM), the value, regulation strength (λ), the number of datapoints in a
number of hidden dimensions in RNN units and the Semantic mini batch (i.e. mini batch size), and the number of epochs
Relation Evaluation layers, and other RNN variables such as included in the training process. The techniques for selecting
the number of RNN layers and whether to use bidirectional hyper-parameters are described in Section IV-B.
RNN. To address our first research question (RQ1) we ex-
plored several different configurations. We describe how we IV. Experiment Setup
optimized network settings in Section IV-B. The tracing net- In this section, we explain the methods used to (1) prepare
work is implemented on the Torch framework (http://torch.ch), data, (2) systematically tune the configuration (i.e. network
and the source code is available at https://github.com/jin- settings and hyper-parameters) of the tracing network, and
guo/TraceNN. (3) compare the performance of the best configuration against
other popular trace evaluation methods.
B. Training the Tracing Network
A powerful network is only useful when it can be properly A. Data Preparation
trained using existing data and when it is generalizable to To train the word embeddings, we used a corpus from
unseen data. To train the tracing network, we use the regular- the PTC domain that is comprised of 52.7MB of clean text
ized negative log likelihood as our objective loss function to extracted from related domain documents and software arti-
be minimized. This objective function is commonly used in facts. The original corpus of domain documents was collected
categorical prediction models [7] and can be written as: from the Internet by our industrial collaborators as part of
their initial domain analysis process. We also added the
1
N
λ
J(θ) = − logP(Y = yi |xi , θ) + θ22 (10) latest Wikipedia dump containing about 19.92GB of clean
N i=1 2 text to the corpus and used it for one variant of the word
where θ indicates the network parameters that need to be embedding configuration. All documents were preprocessed by
trained, N is the total number of examples in the training transforming characters to lower-case and removing all non-
data, xi is the input of the ith training example, yi is the alpha-numeric characters except for underscores and hyphens.

7
To train and evaluate other parts of the tracing network, TABLE I: Tracing Network Configuration Search Space
we used PTC project data provided by our industrial col-
laborators. The dataset contains 1,651 Software Subsystem Word Embedding Source PTC docs – 50 dim,
PTC docs + Wikipedia dump – 300 dim
Requirements (SSRS) as source artifacts and 466 Software
RNN Unit Type GRU, LSTM, BI-GRU, BI-LSTM,
Subsystem Design Descriptions (SSDD) as target artifacts. (AveVect as baseline)
Each source artifact contained an average of 33 tokens and RNN Layer 1, 2
described a functional requirement of the Back Office Server Hidden Dimension RNN30 + Intg10, RNN60 + Intg20
(BOS) subsystem. Each target artifact contained an average Init Learning Rate (lr) 1e-03, 1e-02, 1e-01
of 99 tokens and specified design details. There were 1,387 Gradient Clip Value (gc) 10, 100
Regularization Strength λ 1e-04, 1e-03
trace links between SSRS and SSDD artifacts, all of which
Mini Batch Size 1
were constructed and validated by our industrial collaborators. Epoch 60
This dataset is considerably larger than those used in most
previous studies on requirements traceability [5], [44], [34] the sampled non-links used for training were representative
and the task of creating links across such a large dataset and preserved an equal contribution of links and non-links
represents a challenging industrial-strength tracing problem. during each epoch. Our initial experimental results showed
We randomly selected 45% of the 769,366 artifact pairs from this technique to be effective for training our tracing network.
the PTC project dataset (i.e. 1,651 × 466) for inclusion in a
training set, 10% for a development set, and 45% as a testing B. Model Selection and Hyper-Parameters Optimization
set. Given a fixed tracing network configuration, the training Finding suitable network settings and a good set of hyper-
set was used to update the network parameters (i.e. the weight parameters is crucial to the success of applying deep learn-
and bias for affine transformation in each layer) in order to ing methods to practical problems [54]. However, given the
minimize the objective loss function. The development set running time required for training, the search space of all
was used to select the best general model during an initial the possible combinations of different configurations was too
training process to ensure that the model was not overtrained. large to provide full coverage. We therefore first identified
The test data was set aside and only used for evaluating the several configurations that were expected to produce good
performance of the final network model. performance. This was accomplished by manually observing
Software project data exhibits special characteristics that how training loss changed during early epochs and following
impact the training of a neural network [30]. In particular, heuristics suggested in [8]. We then created a network config-
the number of actual trace links is usually very small for a uration search space centered around these manually identified
given set of source and target artifacts compared to the total configurations; our search space is summarized in table I. We
number of artifact pairs. In our dataset, among all 769,366 conducted a grid search and trained all the combinations of
artifact pairs, only 0.18% are valid links. Training a neural each configuration in Table I using the training set and then
network using such an unbalanced dataset is very challenging. compared their performance on the development set to find
A common and relatively simple approach for handling unbal- the best configuration. We describe our search space below.
anced datasets is to weight the minority class (i.e. the linked For learning the word embeddings, we used the Skip-gram
artifacts) higher than the majority class (i.e. the non-linked model provided by the Word2vec tool [48]. We trained the
ones). However, in the gradient descent method, a larger loss word vectors with two settings: 50-dimension vectors using
weighting could improperly amplify the gradient update for the the PTC corpus only and 300-dimension using both PTC and
minority class making the training unstable and causing failure Wikipedia dump. The number of dimensions are set differently
to converge. Another common way to handle unbalanced data because the PTC corpus (38,771 tokens) contains considerably
is to downsample the majority case in order to produce a fixed less tokens than the PTC + Wikipedia dump (8,025,288
and balanced training set. Based on initial experimentation we tokens). While a smaller vector dimension would result in
found that this approach did not yield good results because the faster training, a larger dimension is needed to effectively
examples of non-links used for training the network tended to represent the semantics of all tokens in the latter corpus.
randomly exclude artifact pairs that lay at the frontier at which To compare which variation of RNN best suits the tracing
links and non-links are differentiated. Furthermore, based on network, we evaluated GRU, LSTM, bi-directional GRU (BI-
initial experimentation, we also ruled out the upsampling GRU), bi-directional LSTM (BI-LSTM) with both 1 and 2
method because this considerably increased the size of the layers introduced in Section II-C. The hidden dimensions in
training set, excessively prolonging the training time. each RNN unit were set to either 30 or 60, while the hidden
Based on further experimentation we adopted a strategy dimensions for the Integration layer were set to 10 or 20
that dynamically constructed balanced training sets using sub- correspondingly. As a baseline method, we also replaced the
datasets. In each epoch, a balanced training set was constructed RNN layers with a bag-of-word method in which the semantic
by including all valid links from the original training set as vector of an artifact is simply set to be the average of all word
well as a randomly selected equal number of non-links form vectors contained in the artifact (“AveVect” in Table I). We
the training set. The selection of non-links was updated at also summarize the search space for other hyper-parameters
the start of each epoch. This approach ensured that over time of the tracing network in Table I.

8
C. Comparison of Tracing Methods each RNN unit type; Table II summarizes these results. We
In practical requirements tracing settings, a tracing method found that the best configurations for all four RNN unit types
returns a list of candidate links between a source artifact, were very similar: one layer RNN model with 30 hidden
serving the role of user query, and a set of target artifacts. An dimensions and an Integration layer of 10 hidden dimensions,
effective algorithm would return all valid links close to the top learning rate of 1e-02, gradient clip value of 10, and λ of
of the list. The effectiveness of a tracing algorithm is therefore 1e-04. Performance varies for different RNN unit types.
often measured using Mean Average Precision (MAP). To
TABLE II: Best Configuration for Each RNN Unit Type
calculate MAP, we first calculate the Average Precision (AP)
for each individual query as:
|Retrieved| RNN Dev. Word
(Precision(i) × relevant(i)) Unit Loss Emb.
L Dr Ds lr gc λ
AP = i=1 (12)
|RelevantLinks|
BI-GRU .1045 PTC 1 30 10 .01 10 .0001
where |RetrievedLinks| is the number of retrieved links, i is the GRU .1301 PTC 1 30 10 .01 10 .0001
rank in the sequence of retrieved candidates links, relevant(i) BI-LSTM .1434 PTC 1 30 10 .01 100 .0001
is a binary function assigned 1 if the link is valid and 0 LSTM .2041 PTC+Wiki 1 30 10 .01 10 .0001
L – number of layers in the RNN model, Dr – hidden dimension in RNN
otherwise, and Precision(i) is the precision computed after unit, D s – hidden dimension in Integration layer, lr – initial learning rate,
truncating the list immediately below i. Then, Mean Average gc – gradient clipping value, λ – regularization strength
Precision (MAP) is computed as the mean AP across all
Figure 4 illustrates the learning curves on the training
queries. In typical information retrieval settings, MAP is com-
dataset of the four configurations. All four RNN unit types
puted for the top N returned links; however, for traceability
outperformed the Average Vector method. This supports our
purposes we compute it when returning all valid links as
hypothesis that word order plays an important role when
specified in the trace matrix. This means that our version of
comparing the semantics of sentences. Both GRU and BI-
MAP is computed for recall of 100%.
GRU achieved faster convergence and more desirable (smaller)
We computed MAP using the test dataset only, and com-
loss than LSTM and BI-LSTM. Although quite similar, the
pared the performance of our tracing network with other
bidirectional models performed slightly better than the unidi-
popular tracing methods, i.e. Vector Space Model (VSM) and
rectional models for both GRU and LSTM on the training set;
Latent Semantic Indexing (LSI). To make a fair comparison,
the bidirectional models also achieved better results on the
we also optimized the configurations for the VSM and the
development dataset compared to their unidirectional counter-
LSI methods using a Genetic Algorithm to search through an
parts. As a result, the overall best performance was achieved
extensive configuration space of preprocessors and parameters
using BI-GRU configured as shown in Table II.
[43]. Finally, we configured VSM to use a local Inverse Docu-
ment Frequency (IDF) weighting scheme when calculating the ! !

cosine similarity [34]. LSI was reduced to 75% dimensions.

For both VSM and LSI we preprocessed the text to remove

non alpha-numeric characters, remove stop words, and to stem

each word using Porter’s stemming algorithm.

We also evaluated the results by plotting a precision vs.

recall curve. The graph depicts recall and precision scores at

diﬀerent similarity or probability values. The Precision-Recall

Curve thus shows trade-oﬀs between precision and recall and

provides insights into where each method performs best – for Fig. 4: Comparison of learning curves for RNN variants using
example, whether a technique improves precision at higher or their best configurations. GRU and BI-GRU (left) converged
lower levels of recall [11]. A curve that is farther away from faster and achieved smaller loss than LSTM and BI-LSTM
the origin indicates better performance. (right). They all outperformed the baseline.
V. Results and Discussion We also found that in three of the four best configurations,
In this section, we report (1) the best configurations found the word embedding vectors were trained using the PTC
in our network configuration search space, (2) the performance corpus alone. We speculate that one reason the PTC-trained
of the tracing network with the best configuration compared word vectors performed better than the PTC+Wiki-trained
against VSM and LSI, and (3) the performance of the tracing vectors is due to differences in content of the two corpora.
network when trained with a larger training set of data. The PTC+Wiki corpus contains significantly more words that
are used in diverse contexts because the majority of articles
A. What is the best configuration for the tracing network? in the Wiki corpus are not related to the PTC domain. In
This experiment aims to address the first research question the case of words that appear commonly in Wiki articles but
(RQ1). When optimizing the network configuration of the convey specific meanings in the PTC domain (e.g. message,
tracing network, we first selected the best configuration for administrator, field, etc.), their context in more general articles

(a) GRU output at one time step

Fig. 5: Precision-Recall Curve on test set – 45% total data

is likely to negatively affect the reasoning task on domain
specific semantics when there is insufficient training data to

disambiguate their usage. We also tested the tracing network

performance using 300-dimension word embedding trained by

the PTC corpus. The result is similar to the best conﬁguration

found in Table II. These ﬁndings suggest that using only the

domain corpus to train word vectors with a reasonable size is

more computationally economical and can yield better results.

B. Does the tracing network outperform leading trace re-

trieval algorithms?

We now evaluate whether the best conﬁguration of the trac-

(b) GRU gate behavior
ing network (i.e. BI-GRU with configuration shown in Table
II) outperforms leading trace retrieval algorithms. Links were Fig. 6: The reset gate and update gate behavior in GRU and
therefore generated for each source and target artifact pair in their corresponding output on the 24th and 12th dimension of
the test set (45% total data) and for each source artifact ranked sentence semantic vector. The x-axis indicates the sequential
by descending probability scores. Average Precision (AP) was input of words while the y-axis is the value of reset gate,
calculated using Equation 12. We then compared the APs for update gate and output after taking each word.
our tracing network against those generated using the best
performing VSM and LSI configurations. Our tracing network
guage [37], observing gate behavior when GRU processes
was able to achieve a MAP of .598; this value is 41% higher
a sentence can provide some insights into how GRU
than that achieved using VSM (MAP = .423) and 32% higher
performs this task so well. As an example, we exam-
than LSI (MAP = .451). We conducted a Friedman test and
ine the gate behavior when our best performing GRU
found a statistically significant difference among the AP values
processes the following artifact text: “Exceptions: If the
associated with the three methods (X2 (2) = 89.40, p < .001).
SA FileTransferConfiguration datapoint indicates that the
We then conducted three pairwise Wilcoxon signed ranks tests
BOS is not configured for file transfer, the File Transfer
with Bonferroni p-value adjustments. Results indicated that
Manager sends an 01105 FileTransferUnavailable message
the APs associated with our tracing network were significantly
to the Onboard.”. The gate behavior varies among dimensions
higher (M = .598, SD = .370) than those achieved using VSM
in the sentence semantic vector. In Figure 6, we show the
(M = .423, SD = .391; p < .001) and LSI (M = .451, SD =
levels of the reset and the update gates and the change of the
.400; p < .001); in contrast, there was no significant difference
output values on two of the 30 dimensions after GRU takes
when comparing APs for VSM versus those for LSI.
each word from this artifact. Unlike LSTM, the GRU does not
When comparing the Precision-Recall Curves for the three
have an explicit memory cell; however, the unit output itself
methods (Figure 5), we observed that the tracing network out-
can be considered as a persisting memory, whereas the output
performed VSM and LSI at higher levels of recall. Given the
candidate in Equation 6 acts as a temporary memory. Recall
goal to achieve close to 100% recall when performing tracing
that the reset gate controls how much previous output adds
tasks, this is an important achievement. Precision improved
to the current input when determining the output candidate
notably when recall was above 0.2. This improvement can
(i.e. temporary memory); in contrast, the update gate balances
be attributed to the fact that the tracing network is able to
the contributions of this temporary memory and the previous
extract semantic information from artifacts and to reason over
output (i.e. persisting memory) to the actual current output.
associations between them.
From Figure 6b, we observe that for the 24th dimension, the
While it is almost impossible to completely decipher
reset gate is constantly small. This means that the temporary
how GRU extracts a semantic vector from natural lan-
memory is mostly decided by the current input word. The

10
update gate for this dimension was also small for most of
&
the input words until the words datapoint, transfer and to
"&
came. Those spikes indicate moments when the temporary

memory was integrated into the persisting memory. As such,
we can speculate that this semantic vector dimension functions
to accumulate information from speciﬁc keywords across the

entire sentence. Conversely, for the 12th dimension, the reset

gate was constantly high, indicating that the information stored
in the temporary memory was based on both previous output

and current input. Therefore, the actual output is more sensitive
to local context rather than a single keyword. This is confirmed
by the fluctuating shape of the actual output shown in the fig-

ure. For example, the output value remained in the same range
until the topic was changed from “datapoint indication” to
! " #
“message transfer”. We believe that this versatile behavior of
the gating mechanism might enable GRU to encode complex
semantic information from a sentence into a vector to support Fig. 7: Precision-Recall Curve on test set – 10% total data.
trace link generation and other challenging NLP tasks.
From Figure 5, we also notice that the tracing network trained with 45% and 80% of the data respectively) to generate
hits a glass ceiling for improving precision above 0.27. We trace links against the same small test set (sized at 10%). To
consider this to be caused by its inability to rule out some false reduce the effect of random data selection, we repeated this
positive links that contain valid associations. For example, the process five times and report the average results.
tracing network assigns a 97.27% probability of a valid link With a larger training set, the MAP was .834 compared to
between artifact “The BOS administrative toolset shall allow .803 for the smaller training set. Results are depicted in the
an authorized administrator to view internal errors generated Precision-Recall Curve in Figure 7. With increased training
by the BOS” and the artifact “The MessagesEvents panel data, the network can better differentiate links and non-links,
provides the functionality to view message and event logs. The and therefore improve both precision and recall. Improvements
panel provides searching and filtering capabilities where the were observed especially at low levels of recall.
user can search by a number of parameters depending on The performance of the tracing network trained using 80%
the type of data the user wants to view”. There are direct of data was again compared against VSM and LSI for this
associations between these artifacts: MessagesEvents panel is larger training set. A Friedman test identified a statistically
part of the BOS administrative toolset for viewing message and significant difference among the APs associated with the
event log, administrator is a system user and internal errors three methods on this new dataset division (X2 (2) = 141.11,
generated by the BOS is an event. But this is not a valid p < .001). Using pairwise Wilcoxon signed ranks tests with
link because MessagesEvents panel only displays messages Bonferroni p-value adjustments, we found that our tracing
and events related to external Railroad Systems rather than to network performed significantly better (MAP = .834) than
internal events. It is likely that the tracing network fails to VSM (MAP = .625; p < .001) and LSI (MAP = .637;
exclude this link because it has not been exposed to sufficient p < .001). As such, we address RQ2 and conclude that in
similar negative examples in the training data. As we described general our tracing network improved trace link accuracy in
in Section IV-A, every positive example is used while negative comparison to both VSM and LSI, and that improvements
examples (i.e. non-links) are randomly selected during each were more marked as the size of the training set increased. We
epoch. We plan to explore more adequate methods for handling expect additional improvements by reconfiguring the tracing
unbalanced data problems caused by characteristics of the network for use with a larger training set [8]. However, we
tracing data in our future work. also observed that when using the original training set (i.e.
45% of data), our tracing network only outperformed LSI at
C. How does the tracing network react to more training data? higher levels of recall as shown in Figure 5 and Figure 7. In
The number of trace links tends to increase as a software future work, we plan to explore the trade-offs between these
project evolves. To explore the potential impact of folding two methods for specific data features, and to further improve
them into the training set, we increased the training dataset to the performance of the tracing network.
80%. We randomly selected part of the test data and moved
it into the training set to reach 80%, while retaining the VI. Related Work
remaining data in the testing set. Using the same configuration In this section we focus on prior work that has integrated on-
as described in Section IV-B, we then retrained the tracing tology, semantics, or NLP into the tracing process. Researchers
network. Because the size of the test set decreased to 10%, have attempted to improve bag-of-word approaches such as
we could not make direct comparisons to our previous results. the Vector Space Model (VSM) [34] by integrating matching
Instead we used both of the trained tracing networks (i.e. terms, project glossaries, and other forms of thesauri [34].

11
Basic enhancements have included user feedback techniques including artifacts and trace links, and the time needed to
such as Rocchio [34] or Direct Query manipulation (DQM) experiment with different algorithms for learning word em-
[61] to increase or decrease term weights. However, these beddings and generating trace links, our work focused on a
approaches fail to leverage semantic information. single domain of Positive Train Control. As a result, we cannot
Other techniques identify terms for briding the term mis- claim generalizability. However, the PTC dataset included
match between source and target artifacts. Dietrich et al. text taken from external regulations, and written by multiple
utilized validated trace links to identify frequent item sets requirements engineers, systems engineers, and developers.
of terms occurring across pairs of source and target artifacts, The threat to validity arises primarily from the possibility
and then used these to augment the text in the trace query that characteristics of our specific dataset may have impacted
[19]. Gibiec et al. [24] approached trace query augmentation results of our experiment. For example, the size of the overall
by acquiring related documents from the Internet, and then dataset, the characteristics of the vocabulary used, and/or the
extracting domain related terms. Researchers have also used nature of each individual artifact, may make our approach
phrase detection and chunking to search for requirements more or less effective. In the next phase of our work we will
impacted by change requests [2] or to improve the trace evaluate our approach on additional datasets.
retrieval process [70]. None of these techniques attempted to Second, we cannot guarantee that the trace matrix used
understand semantics of the artifacts. for evaluation is 100% correct. However, it was provided by
Researchers have also explored the use of knowledge bases our industrial collaborators and used throughout their project
to create and utilize semantically aware associations in the to demonstrate coverage or regulatory codes. The metrics
trace creation process – where a knowledge base include basic we used (i.e. MAP, Recall, and Precision) are all accepted
domain terms and sentences that describe the relationships be- research standards for evaluating trace results [34]. To avoid
tween those terms [36],[27],[62]. Data is typically represented comparison against a weak baseline, we report comparisons
as an ontology in which relationships are represented using against two standard baselines: VSM and LSI and configured
AND, OR, implication, and negation operators [36]. Trace- them using a Genetic Algorithm.
ability researchers have proposed the idea of using ontology
to connect source and target artifacts [31], [4]. Approaches
have been proposed for weighting the evidence for a trace VIII. Conclusions
link according to distance between concepts in the ontology
In this paper, we have proposed a neural network archi-
[42]. Unfortunately, building domain-specific ontologies is
tecture that utilizes word embedding and RNN techniques to
time consuming and ontologies are generally not available for
automatically generate trace links. The Bidirectional Recurrent
technical software engineering domains.
Gated Unit effectively constructed semantic associations be-
Finally, while researchers have proposed techniques that
tween artifacts, and delivered significantly higher MAP scores
more closely mimic the way human analysts reason about trace
than either VSM or LSI when evaluated on our large industrial
links and perform tracing tasks [28], [46], there is very limited
dataset. It also notably increased both precision and recall.
work in this area. Our prior work with DoCIT, described in
Given an initial training set of trace links, our tracing network
Section I, is one exception [28]. DoCIT utilizes both ontology
is fully automated and highly scalable. In future work, we
and heuristics to reason over concepts in the domain in order to
will focus on improving precision of the tracing network
deliver accurate trace links. However, as previously explained,
by identifying and including more representative negative
DoCIT requires non-trivial setup costs to build a customized
examples in the training set.
ontology and heuristics for each domain, and is sensitive to
flaws in syntactic parsing. In contrast, the RNN approach The tracing network is currently trained to process natural
described in this paper requires only a corpus of domain language text. In future work, we will investigate techniques
documents and a training set of validated trace links. for applying it to other types of artifacts such as source code
On the other hand, deep learning has been successfully or formatted data. Finally, given the difficulty and limitations
applied to many software engineering tasks. For example, of acquiring large corpora of data we will investigate hybrid
Lam et al. combined deep neural networks with information approaches that combine human knowledge with the neural
retrieval techniques to identify buggy files in bug reports [39]. network. In summary, the findings we have presented in this
Wang et al. utilized a deep belief network to extract semantic paper have demonstrated that deep learning techniques can be
features from source code for the purpose of defect prediction effectively applied to the tracing process. We see this as a
[69]. Raychev et al. adopted RNN and N-gram to build the non-trivial advance in our goal of automating the creation of
language model for the task of synthesizing code completions accurate trace links in industrial-strength datasets.
[55]. We were not able to find work that applied deep learning
techniques to traceability tasks in our literature review. IX. Acknowledgments
VII. Threats to Validity
The work in this paper was partially funded by the US
Two primary threats to validity potentially impact our work. National Science Foundation Grant CCF-1319680.
First, due to the challenge of obtaining large industrial datasets

12
References [22] Federal Aviation Authority (FAA). DO-178B: Software Considerations
in Airborne Systems and Equipment Certification.
[1] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. [23] Food and Drug Administration. Guidance for the Content of Premarket
Recovering traceability links between code and documentation. IEEE Submissions for Software Contained in Medical Devices, 2005.
Trans. Softw. Eng., 28(10):970–983, 2002. [24] M. Gibiec, A. Czauderna, and J. Cleland-Huang. Towards mining
[2] C. Arora, M. Sabetzadeh, L. C. Briand, F. Zimmer, and R. Gnaga. replacement queries for hard-to-retrieve traces. In ASE ’10: Proceed-
Automatic checking of conformance to requirement boilerplates via text ings of the IEEE/ACM international conference on Automated software
chunking: An industrial case study. In 2013 ACM / IEEE International engineering, pages 245–254, New York, NY, USA, 2010. ACM.
Symposium on Empirical Software Engineering and Measurement, Bal- [25] O. Gotel, J. Cleland-Huang, J. H. Hayes, A. Zisman, A. Egyed,
timore, Maryland, USA, October 10-11, 2013, pages 35–44, 2013. P. Grünbacher, A. Dekhtyar, G. Antoniol, J. I. Maletic, and P. Mäder.
[3] C. Arora, M. Sabetzadeh, A. Goknil, L. C. Briand, and F. Zimmer. Traceability fundamentals. In Software and Systems Traceability., pages
Change impact analysis for natural language requirements: An NLP 3–22. Springer, 2012.
approach. In 23rd IEEE International Requirements Engineering Con- [26] O. C. Z. Gotel and A. Finkelstein. An analysis of the requirements
ference, RE 2015, Ottawa, ON, Canada, August 24-28, 2015, pages traceability problem. In Proceedings of the First IEEE International
6–15, 2015. Conference on Requirements Engineering, ICRE ’94, Colorado Springs,
Colorado, USA, April 18-21, 1994, pages 94–101, 1994.
[4] N. Assawamekin, T. Sunetnanta, and C. Pluempitiwiriyawej. Ontology-
[27] T. Gruber. Ontology. In Encyclopedia of Database Systems, pages
based multiperspective requirements traceability framework. Knowl. Inf.
1963–1965. Springer US, 2009.
Syst., 25(3):493–522, 2010.
[28] J. Guo, J. Cleland-Huang, and B. Berenbach. Foundations for an
[5] H. U. Asuncion, A. Asuncion, and R. N. Taylor. Software traceability
expert system in domain-specific traceability. In 21st IEEE International
with topic modeling. In 32nd ACM/IEEE International Conference on
Requirements Engineering Conference, RE 2013, Rio de Janeiro-RJ,
Software Engineering (ICSE), pages 95–104, 2010.
Brazil, July 15-19, 2013, pages 42–51, 2013.
[6] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by
[29] J. Guo, N. Monaikul, C. Plepel, and J. Cleland-Huang. Towards
jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
an intelligent domain-specific traceability solution. In Proceedings of
[7] I. G. Y. Bengio and A. Courville. Deep learning. Book in preparation the 29th ACM/IEEE international conference on Automated software
for MIT Press, 2016. engineering, pages 755–766. ACM, 2014.
[8] Y. Bengio. Practical recommendations for gradient-based training of [30] J. Guo, M. Rahimi, J. Cleland-Huang, A. Rasin, J. H. Hayes, and
deep architectures. In Neural Networks: Tricks of the Trade, pages 437– M. Vierhauser. Cold-start software analytics. In Proceedings of the
478. Springer, 2012. 13th International Conference on Mining Software Repositories, MSR
[9] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies 2016, Austin, TX, USA, May 14-22, 2016, pages 142–153, 2016.
with gradient descent is difficult. IEEE transactions on neural networks, [31] S. Hayashi, T. Yoshikawa, and M. Saeki. Sentence-to-code traceability
5(2):157–166, 1994. recovery with domain ontologies. In J. Han and T. D. Thu, editors,
[10] T. J. Biggerstaff, B. G. Mitbander, and D. E. Webster. Program APSEC, pages 385–394. IEEE Computer Society, 2010.
understanding and the concept assignment problem. Communications [32] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice
of the ACM, 37(5):72–82, 1994. Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 1994.
[11] M. Buckland and F. Gey. The relationship between recall and precision. [33] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural
Journal of the American society for information science, 45(1):12, 1994. computation, 9(8):1735–1780, 1997.
[12] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the [34] J. Huffman Hayes, A. Dekhtyar, and S. K. Sundaram. Advancing can-
properties of neural machine translation: Encoder-decoder approaches. didate link generation for requirements tracing: The study of methods.
arXiv preprint arXiv:1409.1259, 2014. IEEE Transactions on Software Engineering, 32(1):4–19, 2006.
[13] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of [35] M. Iyyer, J. L. Boyd-Graber, L. M. B. Claudino, R. Socher, and
gated recurrent neural networks on sequence modeling. arXiv preprint H. Daumé III. A neural network for factoid question answering over
arXiv:1412.3555, 2014. paragraphs. In EMNLP, pages 633–644, 2014.
[14] J. Cleland-Huang, O. Gotel, J. H. Hayes, P. Mäder, and A. Zisman. [36] P. Jackson. Introduction To Expert Systems (3 ed.). Addison Wesley,
Software traceability: trends and future directions. In Proceedings of 1998.
the on Future of Software Engineering, FOSE 2014, Hyderabad, India, [37] A. Karpathy, J. Johnson, and F. Li. Visualizing and understanding
May 31 - June 7, 2014, pages 55–69, 2014. recurrent networks. CoRR, abs/1506.02078, 2015.
[15] J. Cleland-Huang and J. Guo. Towards more intelligent trace retrieval [38] J. C. Knight. Safety critical systems: challenges and directions. In 24th
algorithms. In (RAISE) Workshop on Realizing Artificial Intelligence International Conf. on Software Engineering, ICSE 2002, 19-25 May
Synergies in Software Engineering, 2014. 2002, Orlando, Florida, USA, pages 547–550, 2002.
[16] J. Cleland-Huang, M. Rahimi, and P. Mäder. Achieving lightweight [39] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Combining
trustworthy traceability. In Proceedings of the 22nd ACM SIGSOFT In- deep learning with information retrieval to localize buggy files for bug
ternational Symposium on Foundations of Software Engineering, (FSE- reports (n). In Automated Software Engineering (ASE), 2015 30th
22), Hong Kong, China, November 16 - 22, 2014, pages 849–852, 2014. IEEE/ACM International Conference on, pages 476–481. IEEE, 2015.
[17] A. De Lucia, F. Fasano, R. Oliveto, and G. Tortora. Enhancing an artefact [40] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
management system with traceability recovery features. In International W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten
Conference on Software Maintenance, pages 306–315, Washington, DC, zip code recognition. Neural computation, 1(4):541–551, 1989.
USA, 2004. IEEE Computer Society. [41] O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity
[18] A. Dekhtyar, J. Huffman Hayes, S. K. Sundaram, E. A. Holbrook, and with lessons learned from word embeddings. Transactions of the
O. Dekhtyar. Technique integration for requirements assessment. In 15th Association for Computational Linguistics, 3:211–225, 2015.
IEEE International Requirements Engineering Conference (RE), pages [42] Y. Li and J. Cleland-Huang. Ontology-based trace retrieval. In
141–150, 2007. Traceability in Emerging Forms of Software Engineering (TEFSE2013,
[19] T. Dietrich, J. Cleland-Huang, and Y. Shin. Learning effective query San Francisco, USA, May 2009.
transformations for enhanced requirements trace retrieval. In 2013 28th [43] S. Lohar, S. Amornborvornwong, A. Zisman, and J. Cleland-Huang.
IEEE/ACM International Conference on Automated Software Engineer- Improving trace accuracy through data-driven configuration and com-
ing, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013, pages position of tracing features. In 9th Joint Meeting on Foundations of
586–591, 2013. Software Engineering (ESEC/FSE), pages 378–388, 2013.
[20] B. Dit, M. Revelle, and D. Poshyvanyk. Integrating information retrieval, [44] A. D. Lucia, F. Fasano, R. Oliveto, and G. Tortora. Recovering trace-
execution and link analysis analysis algorithms to improve feature ability links in software artifact management systems using information
location in software. Empirical Software Engineering, 18(2):277–309, retrieval methods. ACM Transactions on Software Engineering and
2013. Methodology (TOSEM), 16(4):13, 2007.
[21] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and [45] P. Mäder, P. L. Jones, Y. Zhang, and J. Cleland-Huang. Strategic
S. Bengio. Why does unsupervised pre-training help deep learning? traceability for safety-critical projects. IEEE Software, 30(3):58–66,
Journal of Machine Learning Research, 11(Feb):625–660, 2010. 2013.

13
[46] A. Mahmoud, N. Niu, and S. Xu. A semantic relatedness approach for [70] X. Zou, R. Settimi, and J. Cleland-Huang. Improving automated re-
traceability link recovery. In IEEE 20th International Conference on quirements trace retrieval: a study of term-based enhancement methods.
Program Comprehension, ICPC 2012, Passau, Germany, June 11-13, Empirical Software Engineering, 15(2):119–146, 2010.
2012, pages 183–192, 2012.
[47] A. Mahmoud and G. Williams. Detecting, classifying, and tracing non-
functional software requirements. Requir. Eng., 21(3):357–381, 2016.
[48] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of
word representations in vector space. CoRR, abs/1301.3781, 2013.
[49] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur.
Recurrent neural network based language model. In Interspeech,
volume 2, page 3, 2010.
[50] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. CoRR,
abs/1310.4546, 2013.
[51] N. Niu and A. Mahmoud. Enhancing candidate link generation for re-
quirements tracing: The cluster hypothesis revisited. In 2012 20th IEEE
International Requirements Engineering Conference (RE), Chicago, IL,
USA, September 24-28, 2012, pages 81–90, 2012.
[52] R. Pascanu, Ç. Gülçehre, K. Cho, and Y. Bengio. How to construct deep
recurrent neural networks. CoRR, abs/1312.6026, 2013.
[53] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors
for word representation. In Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543, 2014.
[54] N. Pinto, D. Doukhan, J. J. DiCarlo, and D. D. Cox. A high-throughput
screening approach to discovering good forms of biologically inspired
visual representation. PLoS Comput Biol, 5(11):e1000579, 2009.
[55] V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical
language models. In ACM SIGPLAN Notices, volume 49, pages 419–
428. ACM, 2014.
[56] P. Rempel, P. Mäder, T. Kuschke, and J. Cleland-Huang. Traceabil-
ity gap analysis for assessing the conformance of software traceabil-
ity to relevant guidelines. In Software Engineering & Management
2015, Multikonferenz der GI-Fachbereiche Softwaretechnik (SWT) und
Wirtschaftsinformatik (WI), FA WI-MAW, 17. März - 20. März 2015,
Dresden, Germany, pages 120–121, 2015.
[57] T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Kociský, and
P. Blunsom. Reasoning about entailment with neural attention. CoRR,
abs/1509.06664, 2015.
[58] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning represen-
tations by back-propagating errors. Nature, 323(6088):533–536, 1986.
[59] J. Schmidhuber. Learning complex, extended sequences using the
principle of history compression. Neural Computation, 4(2):234–242,
1992.
[60] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks.
IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
[61] Y. Shin and J. Cleland-Huang. A comparative evaluation of two user
feedback techniques for requirements trace retrieval. In SAC, pages
1069–1074, 2012.
[62] P. Shvaiko and J. Euzenat. Ontology matching: State of the art and future
challenges. IEEE Trans. Knowl. Data Eng., 25(1):158–176, 2013.
[63] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes
and natural language with recursive neural networks. In Proceedings
of the 28th international conference on machine learning (ICML-11),
pages 129–136, 2011.
[64] G. Spanoudakis, A. Zisman, E. Pérez-Miñana, and P. Krause. Rule-based
generation of requirements traceability relations. Journal of Systems and
Software, 72(2):105–127, 2004.
[65] H. Sultanov, J. Huffman Hayes, and W.-K. Kong. Application of
swarm techniques to requirements tracing. Requirements Engineering,
16(3):209–226, 2011.
[66] K. S. Tai, R. Socher, and C. D. Manning. Improved semantic represen-
tations from tree-structured long short-term memory networks. arXiv
preprint arXiv:1503.00075, 2015.
[67] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient
by a running average of its recent magnitude. COURSERA: Neural
Networks for Machine Learning, 4(2), 2012.
[68] US Department of Railroads. Federal Railroad Administration,PTC Sys-
tem Information, https://www.fra.dot.gov/Page/P0358, Year = Accessed:
2016-08-26.
[69] S. Wang, T. Liu, and L. Tan. Automatically learning semantic features
for defect prediction. In Proceedings of the 38th International Confer-
ence on Software Engineering, pages 297–308. ACM, 2016.