Classifying Non-Functional Requirements Using RNN
Classifying Non-Functional Requirements Using RNN
net/publication/335058609
CITATIONS READS
33 1,277
4 authors:
All content following this page was uploaded by Md. Nurul Ahad Tawhid on 09 September 2019.
25
MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia Md. Abdur Rahman, Md. Ariful Haque, Md. Nurul Ahad Tawhid, and Md. Saeed Siddik
and Estonia may be represented as Id224 and Id453 respectively, Information Retrieval (IR) technique plays a vital part for de-
meaning the 453r d entry of the long sparse vector is 1. Such repre- tecting and classifying NFR, which was presented in [2]. In IR
sentation does not provide any useful information to the system approach, a set of indicator terms was identified for each NFR cate-
regarding similarities between individual symbols. This machine gory to classify user requirements. A probabilistic weight for each
learning shortcoming would be overcome using deep learning based potential indicator term was calculated from the requirements. This
distributed vector representation, which attempts to learn multiple strategy requires less effort for NFR classification compared to semi-
levels of representation for handling increased complexity. automated classification methods. However, this technique suffers
In the literature, several machine learning approaches have been for evaluating correctness of candidate NFR manually.
reported for text and NFR classification [2, 4, 5]. Even though few Requirements were classified into functional and non-functional
research found in deep learning for NFR classification, including using linguistic knowledge in [4]. The method works for sentence
text data representation in low dimensional vector space using Con- level as the characteristics of functional and non-functional require-
volutional Neural Network (CNN) [9]. However, hardly found any ments remain within the scope of sentences. The results were cross
research direction for NFR classification with Recurrent Neural Net- checked ten times, and showed that their method significantly im-
work (RNN). Where, RNN is one of the most popular architecture proved classification performance. However, more training and
used in NLP, because its recurrent structure is suitable to process testing data could be introduced to justify performance.
variable length text [10]. RNN utilized distributed representations Using requirements frame model, a method was proposed to de-
of words by comprising each text into vectors to form a matrix. tect the redundancy, ambiguity, inconsistency and incompleteness
This paper presents an efficient NFR classification technique of NFR in SRS [6]. The research focused on the requirements re-
using RNN variants to categories requirements into pre-defined lated to response time and usability. Here, the specific requirements
labels. At first, the textual requirement has been processed to elim- were retrieved using respective keywords and multiline statements
inate unnecessary text, symbols, etc. from the dataset. Then the were converted to single sentences manually. However, this manual
processed documents were vectorized using word2vec algorithm statements conversion is time consuming and error-prone.
to fed in the neural network model. The RNN variants named as Combining different feature extraction and machine learning
Long Short Term Memory (LSTM) [8] and Gated Recurrent Unit techniques, a NFR classification method was presented to find out
(GRU) [11] are used for NFR model construction. The implemented the best pair by [5]. BoW and variation of TF-IDF were used as fea-
classifiers have been applied to categorize NFR into pre-defined ture extraction strategies, where eight machine learning algorithms
labels. Finally, proposed classifier performance has been evaluated were applied as NFR classifier. The empirical results showed that
using PROMISE [12] dataset considering several statistical analyses. TF-IDF (character level) combined with SDG SVM performed best to
The investigated validation and testing results of proposed mod- categorise NFR into pre-defined levels. However, NFR classification
els denoted that LSTM performs highest score over CNN and GRU. efficiency could be improved incorporating RNN and word2vec.
The reported average precision, recall, f1-score for LSTM validation Functional and non-functional requirements were classified us-
are 0.973, 0.967 and 0.966 respectively which is high and indicate the ing supervised machine learning approach in [13], where both bi-
model was trained well. In the model, the boundary scores are 0.95 nary and multi-class classifiers have been introduced. The research
and 1.00, which indicate model’s minimum classification validity focused on accurately identification of four NFR named as usabil-
is 95% and maximum is 100%. The reported average accuracy for ity, security, operational and performance. Problem of imbalance
LSTM is always higher over GRU, where, LSTM’s lowest and high- dataset was handled using under and over sampling techniques. For
est accuracy are 0.60 and 0.80 respectively, that means minimum feature extraction, BoW strategies were used, where part of speech
60% and maximum 80% unseen requirements are correctly classified tags were identified as most informative features. The experimen-
by the model. The reported average precision, recall, f1-score and tation was performed using support vector machine classifier.
accuracy are 71.7%, 71.5%, 70% and 71.5% respectively. However, A machine learning technique was presented to classify app user
LSTM’s standard deviation is always lower than GRU. The low de- reviews in [7], where BoW, TF-IDF, CHI2 and AUR-BoW techniques
viation indicates model’s well stability, which are 0.081, 0.062, 0.057 were applied with NB, J48, and Bagging algorithms. The research
and 0.62 for precision, recall, f1-score and accuracy respectively. focused on four types of NFR names as reliability, usability, porta-
The main contribution of this paper is to investigate how well deep bility, and performance. The paper concluded that imbalance and
learning methods perform for multi-class NFR classification. smaller dataset effects badly on classification results in machine
The rest of this paper is organized as follows; Section II discusses learning environment. However, no direction has been given to
the existing work related to this research. Section III and IV illus- handle mentioned problem regarding dataset.
trates the proposed method and result analysis respectively. Finally, Winkler and vogelsang [1] presented convolutional neural net-
section V concludes this paper with future research direction. works based approach to automatically classify content elements
of a natural language requirements specification. Their approach
can be used to classify content elements in documents. To train the
2 RELATED WORK neural network, this model used 10K content elements extracted
Classifying non-functional requirements using analysis of textual from 89 requirements specifications from industry and reported
natural language is an emerging field in software engineering re- precision of 0.73 and a recall of 0.89. However, requirements are
search to evolve software quality. NFR may classify using informa- interconnected and one task happens after another, where RNN
tion retrieval based [2], linguistic knowledge based [4], CNN based may perform better than CNN.
[9], etc. approaches which are briefly discussed in this section.
26
Classifying Non-functional Requirements using RNN Variants for Quality Software Development MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia
3 METHODOLOGY
Multi-class NFR classification using RNN has been proposed in
this framework to facilitate quality software development. This
automatic NFR classification method considers word2vec algorithm
to extract features from requirement dataset. On the other hand,
twice RNN variants named as LSTM and GRU have been applied as
NFR classifier. Textual software requirements have been processed Figure 1: Overview of the Proposed Method
to eliminate unnecessary text, symbols, etc. and vectorized using
word2vec algorithm for RNN modeling. The whole working process
of this framework is divided into following four steps. 3.2 Word Vectorization
The deep learning model could not understand natural language
• Dataset Pre-processing
sentence to process, therefore, a word vectorization technique is
• Word Vectorization
required. This allows the model to recognize patterns, even if the
• RNN Model Construction
words occur in the pattern vary slightly. The requirement text has
• Model Training and Testing
been converted to word vectors using word2vec model with skip-
The details of above steps have been elaborately described in the gram feature extraction method. Word2vec maps a single word to
following sub-sections. a vector v ∈ R, where R is the set of real numbers. In word2vec
technique, vector distance of two given words is small if these two
words are used in similar context, otherwise it is large. The sentence
3.1 Dataset Pre-processing transformation in word2vec can be written as m ∈ Rn,l , where m,
The first step of this framework is cleaning and pre-processing R, n, and l represents the matrix, set of real number, embedding
software requirement dataset. This involves removing special char- vector size and length of sentence respectively.
acters, stop words, case-folding, lemmatization and tokenization. This strategy uses google pretrained word embedding model,
Special characters and symbols are usually non-alphanumeric char- where the model is trained on over 100 billion words. At the time
acters, which add extra noise to the experimental dataset. To remove of tokenization, each unique word is turned into a unique number,
noisy data, special characters have been removed using regular ex- which is used as the respective word index. The vector of a particular
pression. Words appear in upper-case or lowercase having similar word is extracted from the pretrained word embedding, and kept
meaning in Enlgish; therefore, case-folding is applied for unique in an embedding matrix, which is initialized by zero as elements.
case consideration. On the other hand, stop words usually refer Then, these elements are replaced by the embedding vectors of a
to the most common words such as "a", "an", "the", which do not particular word. Thus the embedding matrix contains all the word
influence the semantics of a software requirement. The most uses embedding vectors and some zero vectors for the words that are
natural language processing stop words corpus of nltk has been not found in the pretrained word embeddings.
used in this implementation. Lemmatization approach has been The unique word vectors were fetched into an embedding matrix
utilized to get base form of word. It also grouped the inflected which size is measured using vocabulary size and embedding di-
words together. Numbers are also converted to word in order to mension. The features of the documents are extracted as embedded
enrich dataset. Finally, tokenization process splits longer text into word vectors with dimension of 300 for each word in each sentence.
smaller piece or tokens. The requirements have been tokenized into The embedding vectors were imported as pre-trained word vectors
sentences which then split into distinct words. and transfer learning is used to learn the word embedding. The
27
MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia Md. Abdur Rahman, Md. Ariful Haque, Md. Nurul Ahad Tawhid, and Md. Saeed Siddik
embedding matrix is fed as weights in the embedding layer in the into different given categories. This process makes an effective pre-
neural network. diction based on constructing current model, calculating prediction
incorrectness, and updating network parameters to minimize this
3.3 RNN Model Construction error and make the model better. This process is repeated until the
The two RNN variants model has been implemented in this NFR model has converged and can no longer learn. Following parameters
classification framework using word2vec vectorization algorithm. have been used to tune this process.
The RNN variants LSTM and GRU algorithms are prominent in nat- Metric: This is used to measure performance of proposed model,
ural language domain especially for text classification [11], which where accuracy has been used as metric value for this framework.
are implemented in NFR classification framework. The sequential Loss function: This function is used to calculate a loss value
RNN architecture mainly consists of three layers named as input of the model in the training process. It attempts to minimize the
layer, hidden layer and output layer. value by tuning the network weights. This framework uses the loss
function which is suitable for multiclass categorical classification.
3.3.1 Input Layer: The first layer prepares word embeddings ac- Optimizer: This is a significant function in the model that de-
cording to model’s input requirement using sequence length and cides how the network weights will be updated based on the output
embedding dimension. The parameters in the embedding layer are of the loss function.
not trainable to ensure proper transfer learning. For training deep The actual training happens using the fit method. In each training
learning model, transfer learning is used in the first layer in order iteration, batch size, sample number and weight are updated single
to improve context learning of the word, where vector weights time. The training process completes an epoch, when the model has
remain unchanged. seen the entire training dataset. At the end of each epoch, validation
dataset is used to evaluate model’s learning accuracy. This process
3.3.2 Hidden Layer: LSTM and GRU algorithms have been used
is repeated for a predetermined number of epochs.
in hidden layer, which takes each word embedding in a time step
and processes to produce an output. The dropout and recurrent
dropout has been applied here to reduce over-fitting of the model.
4 EXPERIMENT AND RESULT ANALYSIS
The algorithms learn the context of the sentence and pass the last The dataset and experiment for RNN variants algorithms are dis-
time step output to the dense layer activation function. The leaky cussed in this section with the result analysis and efficiency com-
relu (1) activation function has been used in this case, where α is a parison of different techniques for automatic NFR classification.
small constant.
4.1 Dataset
z>0
z The OpenScience tera-PROMISE software requirement dataset [12]
LeakyReLU (z) = (1)
αz αz <= 0 has been used for experimental analysis containing both func-
3.3.3 Output Layer: This is the last layer of the model, where tional and non-functional (11 categories) requirements. The dataset
final output for each requirement has been produced. This dense is small in size which consists of 625 requirement sentences. In
layer contains softmax (2) activation function to help predicting these sentences 255 are functional and 370 are non-functional re-
the target class. The softmax function calculates the probabilities quirements. The NFR are labeled with eleven categories named
distribution of the requirement over pre-defined different categories. as availability, legal, look and feel, maintainability, operational,
This function produces unique probability for each class, where the performance, scalability, security, usability, fault tolerance, and
maximum probability will be used to predict the decision class. The portability. However, the dataset contains majority number of func-
produced probability value ranges between 0 to 1, and the sum of all tional requirements which indicates class distribution is imbalanced.
the probabilities will be equal to 1. The Equation (2) computes the Therefore, sampling strategies for dealing with imbalances in data
exponential (e-power) of the given input and the sum of exponential has been applied, as balanced distribution can improve the classi-
values of entire input set. Then the ratio of the exponential of the fication accuracy. Also, one category containing single value has
input value and the sum of exponential values lead the output of been removed to reduce class distribution biasness in the dataset.
the softmax function.
4.2 Experiment
exp(wyt x) In this framework, the word embedding matrix is used as weights
p(y|x) = Í t (2) in the embedding layer. The labels were converted to integers using
y exp(wy x)
a dictionary to hold unique classes which are used to convert the
The model was built using the sequential API of keras. This API labels into specific classes representing each number. The training
stacks the layers one over another and uses one-layer output as and validation sentences were tokenized to find all the unique
input of next layer. The dimensions for embedding, LSTM/GRU and words. After tokenizing the sentences, zero padding was applied
Leaky Relu were 300. On the other hand, the softmax dense layer according to the length of the longest sentence in the dataset.
was 10 which is equal to total number of targeted NFR category. The training, validation and testing requirement sentences were
converted to same length to ensure that same length sentences
3.4 Model Training and Testing are fed as input in the model. Where, the highest length of a sen-
The implemented LSTM and GRU models have been trained to tence over the dataset has been considered. Then the class labels
evaluate how effectively the proposed framework can identify NFR were encoded to integer in the range of the total classes minus
28
Classifying Non-functional Requirements using RNN Variants for Quality Software Development MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia
one. The testing dataset was selected manually to ensure that all and 0.73 respectively which covered by CNN experiment. On other
class instances are present. The training set is split using k-fold hand, LSTM’s lowest and highest score in Table 1 are 0.95 and 1.00,
cross validation process. Before training started, class weights were indicates the minimum classification validity is 95% and maximum
calculated to give the model more attention on the classes with low is 100%. The reported average precision, recall, f1-score for LSTM
number of instances. The model was fitted using the training and are 0.973, 0.967, and 0.966 respectively. Analysis of demonstrated
validation data, where the validation data size was equal to 1/k results showed that LSTM over performed than other models. In
of k-fold cross validation. Class weights were used to produce a addition, it presents a very well stability that are reflected in the low
model giving more attention to the underrepresented classes. At the standard deviations 0.015, 0.017 and 0.015 in precision, recall and
training time, the model takes labels as one-hot encoded vector for f1-score respectively. The validation performance was recalculated
producing separate probabilities for each class in output softmax after each epoch and weights were recursively updated.
function as Equation (2). The model was trained for several epochs
with a fixed batch size and output was kept in a history object.
The dropout probability was set to 0.5 in order to avoid overfit-
ting during training. The learning algorithm selected to this problem
was Adam [14] with learning rate λ = 0.001, β1 = 0.9, β2 = 0.999,
ϵ = 1e − 08 and the loss function was categorical cross-entropy.
The number of epochs to train the LSTM and GRU were fixed to
200. The trained model showed total 1,144,810 parameters, where
814,510 as trainable and 330,300 as non-trainable.
29
MaLTeSQuE ’19, August 27, 2019, Tallinn, Estonia Md. Abdur Rahman, Md. Ariful Haque, Md. Nurul Ahad Tawhid, and Md. Saeed Siddik
The testing results of two RNN variants GRU and LSTM have 5 CONCLUSION
been plotted in Table 2 and depicted in Figure 3. Testing results An effective requirement identification framework has been pro-
comparison has been carried out on GRU and LSTM only in Ta- posed using RNN variants in order to classify NFR into pre-defined
ble 2, because comparing with Ju et al. [9] the accuracy of CNN categories. This automatic NFR classification method considers
was unreported. The precision, recall, f1-score and accuracy were word2vec algorithm and RNN deep learning model for feature ex-
calculated from true and predicted labels. As stated in Table 2, aver- traction and text classification respectively. The requirement text
age scores are always higher for LSTM over GRU. Where, LSTM’s has been processed to eliminate unnecessary text, symbols, etc.
lowest and highest accuracy are 0.60 and 0.80 respectively that from the dataset. The processed documents were vectorized using
means the model can classify unseen requirements minimum 60% word2vec algorithm to fed on RNN. The standard PROMISE dataset
and maximum 80% correctly. The reported average precision, recall, has been used to evaluate the proposed model. The reported re-
f1-score and accuracy are 71.7%, 71.5%, 70% and 71.5% respectively. sults showed that the model’s minimum classification validity is
On the other hand, the LSTM’s standard deviation is always lower 95% and maximum is 100%. Where, the average testing precision,
than GRU, which are 0.081, 0.062, 0.057 and 0.062 for precision, re- recall, f1-score and accuracy are 71.7%, 71.5%, 70% and 71.5% re-
call, f1-score and accuracy respectively. The low deviations indicate spectively. The LSTM model reports a very well stability that are
LSTM model’s well stability. reflected in the low standard test score deviation (<0.082) in preci-
sion, recall and f-measure. According to the experimental results, it
can be concluded that deep learning algorithms especially LSTM
is useful on software non-functional requirements classification.
Incorporating large scale dataset and significant feature extraction
techniques such as glove, fastText, and ensemble classifier are the
future research direction of this framework.
REFERENCES
[1] Jonas Winkler and Andreas Vogelsang. Automatic classification of requirements
based on convolutional neural networks. In 2016 IEEE 24th International Require-
ments Engineering Conference Workshops (REW), pages 39–45. IEEE, 2016.
[2] J Cleland-Huang, R Settimi, X Zou, and P Solc. The detection and classification
of non-functional requirements with application to early aspects. In 14th Int.
Requirements Engineering Conference (RE’06), pages 39–48. IEEE, 2006.
[3] Mirza Rehenuma Tabassum, Md Saeed Siddik, Mohammad Shoyaib, and
Shah Mostafa Khaled. Determining interdependency among non-functional
requirements to reduce conflict. In 2014 International Conference on Informatics,
Electronics & Vision (ICIEV), pages 1–6. IEEE, 2014.
[4] I Hussain, L Kosseim, and O Ormandjieva. Using linguistic knowledge to classify
non-functional requirements in srs documents. In Int. Conf. on Application of
Natural Language to Information Systems, pages 287–298. Springer, 2008.
[5] Md. Ariful Haque, Md. Abdur Rahman, and Md Saeed Siddik. Non functional
requirements classification with machine learning: An empirical study. In Inter-
national Conference on Advances in Science, Engineering and Robotics Technology
(ICASERT), pages 629–633. IEEE, 2019.
[6] Y Matsumoto, S Shirai, and A Ohnishi. A method for verifying non-functional
Figure 3: RNN Test Result of GRU and RNN requirements. Procedia Computer Science, 112:157–166, 2017.
[7] Mengmeng Lu and Peng Liang. Automatic classification of non-functional re-
The variants of RNN model test results are graphically repre- quirements from augmented app user reviews. In 21st International Conference on
sented in Figure 3. Where, Y-axis and X-axis represents testing Evaluation and Assessment in Software Engineering, pages 344–353. ACM, 2017.
[8] Jenq-Haur Wang, Ting-Wei Liu, Xiong Luo, and Long Wang. An lstm approach
values and techniques respectively. The box whisker plots for sta- to short text sentiment classification with word embeddings. In Proceedings of the
tistical analysis named precision, recall, f1-score and accuracy have 30th Conference on Computational Linguistics and Speech Processing (ROCLING
been grouped by the RNN variants LSTM and GRU. According 2018), pages 214–223, 2018.
[9] N Almanza R, Reyes Ju, and Guillermo Licea. Towards supporting software
the plot, an outlier has been found in GRU, but the LSTM plotted engineering using deep learning: A case of software requirements classification.
as packed. The lower, middle and upper quartile starting range is In 5th International Conference in Software Engineering Research and Innovation
(CONISOFT), pages 116–120. IEEE, 2017.
always better for LSTM compared to GRU. [10] P Zhou, Zhenyu Qi, S Zheng, Jiaming Xu, H Bao, and Bo Xu. Text classification
This experiment covers three deep learning methods named CNN improved by integrating bidirectional lstm with two-dimensional max pooling.
[9], LSTM, and GRU for finding the optimal NFR classifier. After In 26th Int. Conference on Computational Linguistics, pages 3485–3495, 2016.
[11] Jakub Nowak, Ahmet Taspinar, and Rafał Scherer. Lstm recurrent neural networks
analyzing entire experiment, LSTM performs best for categorizing for short text and sentiment classification. In International Conference on Artificial
NFR to ensure quality software development. Intelligence and Soft Computing, pages 553–562. Springer, 2017.
[12] Openscience tera-promise software requirement (last accessed: July 01, 2019).
URL https://terapromise.csc.ncsu.edu/!/#repo/view/head/requirements/nfr.
4.5 Threats to Validity [13] Zijad Kurtanović and Walid Maalej. Automatically classifying functional and non-
functional requirements using supervised machine learning. In 25th International
The classification results have been obtained for 10 NFR categories. Requirements Engineering Conference (RE), pages 490–495. IEEE, 2017.
However, during this work it was evident the lack of publicly avail- [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
able large datasets to applied deep earning in software requirement International Conference on Learning Representations, 12 2014.
30