COSMIC FunctionalSizeClassification
COSMIC FunctionalSizeClassification
I. INTRODUCTION
Requirement-level software size measurement and judgment when counting code segments, applicable in a
estimation at the early phase of a software development later stage of the software development life-cycle, source
helps organizations and agile practitioners to have better instructions vary with programming languages,
project plan ahead of resources [1]. Source Lines of Code implementation methods, and with the programmer’s ability
(SLOC) is the most widely used metric for determining the [3].
size of a software application [2]. SLOC relies on human
in Fig. 1, these data movements are Entry (E), Exit (X),
Read (R) and Write (W). One data movement is mapped to
To alleviate the limitations of LOC, the software
one COFMIC Function Point (CFP). So, a functional
industry has developed a standardized function point based
process comprising three data movements has 3 CFP size.
functional size measuring approach called Common
Consider a functional process “the user selects an exam
Software Measurement International Consortium
from the list in the home screen and clicks on the button
(COSMIC) [4]. COSMIC measurement considers the size
‘update’ to update its data in the database. The system
of each of the functional processes independently. The size
provides error/confirmation messages”. The total number of
of a functional process is measured by counting four
data movements is 6 CFPs, which is shown in Fig. 2.
COSMIC elements also called data movements. As shown
Even though COSMIC has provided effective functional little or no work has been done on investigating COSMIC
size measurement approach, modern software development functional size estimation in agile software environments.
environments such as agile-based software development So, by exploiting the advantages from both sides (COSMIC
methods are not suitable for such functional size and agile), effective COSMIC-based functional size
measurement schemes [5]. To the best of our knowledge measurement and estimation can be developed for agile
1
developments. The work described in this paper downstream task. Using RE-BERT as a feature extraction
investigated new pre-trained language model called RE- and embedding model [9], deep learning models such as
BERT, which was further pre-trained with generic BERT LSTM [6], Bi-LSTM [7], and BERT Sequential
pre-trained model over requirement engineering domain Classifier[8] were used for conducting experiments on
texts. Using RE-BERT and deep learning classifier models, functional process classification.
we performed COSMIC functional process classification
The intended research questions that could be answered pre-trained domain-specific model outperform the generic
after conducting the experiments are: - RQ1. To what extent BERT model on the performance of the classification?
does the newly pre-trained model represent requirement RQ3. Which classifier model (among all classifiers used)
engineering domain vocabulary compared to the generic performs better?
BERT language model? RQ2. Does the use of the newly
2
FIGURE 3: PROPOSED APPROACH
Data collection and analysis: - Because there was no In order to increase volume of train-test data, we
publicly available historical data in the form required by our conducted COSMIC-based functional size measurement
method, the first step was developing datasets by collecting using human experts and Scope Master tool. 15000
different software project artifacts, such as agile user stories, requirements and user stories were measured using Scope
and requirement specifications from software projects of Master and 3000 with human experts. A total of 21, 990
various domains. After collecting the data, train-test and measured and more than 400, 000 unmeasured user
pre-training corpuses are prepared. We divided the major requirements and user stories were collected. The
sources of datasets as Functional User Requirements and unmeasured datasets were used for performing further pre-
User Stories. A total of 91,941 user stories and 8345 training and the measured datasets for performing training
functional user requirements were extracted from different the classifier and classier models. After removing outliers,
software repositories [14] such as Kaggle , PROMISE, the final set of data ready for model training were 20371.
PURE, ZENODO, Jira, and COSMIC websites and forums Due to the capacity of the running machine and available
[15]. From these collected user data, 6990 were found time, we forced to reduce the pre-training data to 204,027.
measured by following COSMIC principles. Table 1 shows the summary of extracted data (both train-
test and pre-training).
Details Total
Corpus group: Experts 3,000
Train-test Measu ScopeMaster 12,000 21,990
red by Previously 6,990
Corpus group: Unmeasured 78,296
Pre-train 356,286
Stack overflow 57,600
Training data/used without label 21,990
Miscellaneous 198400
#of outliers 1619 #of extracted pre-training dataset 356,286
#of train-test 20371 #of used pre-training dataset 204,027
Experiment and evaluation: - In order to build efficient generic BERT-base pre-trained model over requirement
model, we performed effective preprocessing on textual engineering corpus that contains about 200,000 requirement
requirements of both train-test dataset and pre-training and user story texts.
corpus. We performed basic data cleaning tasks such as
The sentences contain a variety of word lengths ranging
lower casing, removal of punctuations, URLs and HTML
from 7 to 250 words. The further pre-training took 6 days
tags, correction of spellings, stemming and lemmatization.
and 14 hours on a single Core-i9 CPU with 128 GB memory
After the data were cleaned, we tokenized and extracted
for 5 epochs. We tested and compared the results of the
textual features using the newly pre-trained RE-BERT word
newly pre-trained model against the generic BERT model.
embedding model. The first phase of the proposed approach
The pre-trained model (RE-BERT) achieved 74.60%
was creating a domain-specific pre-trained model in the area
objective prediction accuracy and 86.72% training
of requirement engineering. To do so, we further pre-trained
accuracy. RE-BERT is now available in Hugging Face
3
community at https://huggingface.co/yohannesSM/re-bert deep learning models for performing COSMIC functional
for further investigation of requirement engineering process classification and size estimation downstream tasks.
problems. By using the newly pre-trained model, we build
Functional Process Classification: this is the targeted texts which fall under one of these levels are annotated with
downstream task intended to be addressed in this study. The COSMIC measurement standards (i.e., during data
final set of train-test data with 80:20 train-test split ratio was collection). The trained model helps to predict the class or
used for performing the classification task. Using RE-BERT category of the unseen requirement text. The same set of
as tokenization and word-embedding model, we applied hyperparameter such as epochs, batch-size, and, learning-
deep learning models such as LSTM, Bi-LSTM, and BERT rate, maximum sequence-length, etc., were used for each
Sequential Classifier to conduct functional process deep learning classifier models. Fig. 4b shows the learning
classification experiments. Vector representation of textual curve of classifier models and Fig. 4a shows the overall
requirement together with their granularity level used as an summary of training and validation performances of the
input for the training. The granularity levels are small, classifier models.
medium, large, and complex. Requirement or user story
(a) (b)
FIGURE 4: LEARNING CURVE (a) AND SUMMARY OF EXPERIMENTAL EVALUATION RESULTS (b) OF CLASSIFIER MODELS
The experimental results show LSTM model using 92.73% validation accuracy and 0.18 loss using RE-BERT
BASE BERT provided 88.79 % validation accuracy and model. The BERT classifier model using a BASE BERT
0.48 loss. Whereas, using RE-BERT, it achieved 91.13% achieved 90.52 % validation accuracy and 0.27 loss, and it
validation accuracy and 0.17 loss using. The Bi-LSTM achieved 95.10% validation accuracy and 0.18 loss when
model using a BASE BERT pre-trained model achieved using RE-BERT. Figure 3 shows summary of experimental
90.23 % validation accuracy and 0.35 loss, and it achieved evaluation of all deep learning classifier models.
IV. DISCUSSIONS OF RESULTS LSTM, Bi-LSTM, and BERT Sequential classifier is 4.82%,
2.68%, and 1.40% respectively. RE-BERT Bi-LSTM
The newly further-pre-trained RE-BERT model
achieved moderate improvement over RE-BERT LSTM,
achieved 74.60% objective prediction accuracy. Comparing
this is because RE-BERT has sufficient vocabularies to
the newly pre-trained RE-BERT with the generic BASE
extract semantically similar features from the input domain
BERT on sample sentence pairs and masked word
texts and the bidirectional nature of Bi-LSTM helps to attain
predictins, RE-BERT has a 13.85% average improvement
better feature learning than its counterpart LSTM. Among
over BASE BERT. This shows pre-training towards a
each of the RE-BERT based classifier models, RE-BERT
particualr domain on a particular task can produce efficient
Sequential Classifier achieved higher accuracy than others,
language vocabulary for performing context-based learning
this is because the sequential classifier at the top of RE-
towards specific downstream task in that domain.
BERT helps to process long sequences simultaneously and
As shown in Table 2, The overall average improvement efficiently.
rate of RE-BERT based classifiers over the BASE BERT
4
In general, using a pre-trained model in a specific model on that particular task rather than using a generic pre-
domain (in our case requirement engineering domain) trained model.
provides domain-specific vocabulary for semantical
Contribution: - twofold contributions: The first is adding
relationship and context understanding towards a specific
some knowledge to the science. Since, research
downstream task in that domain.
investigations in the circle of software metrics, requirement
V. CONCLUSION AND RECOMMENDATIONS engineering, and software engineering at large are not well-
matured, future researchers will take advantages of this
Conclusion: - In this study, we conducted COSMIC- work and consolidate it for addressing further software
based functional process classification using domain- engineering problems. The second is deployment of the
specific pre-trained model called RE-BERT and deep models to real world software organizations. Early size
learning models. RE-BERT is used for word embedding, estimations help organizations to have better project plans
and feature extraction, as well as being used as a sequential ahead of resources, and this in turn reduce failure rate of
classifier itself. All the models have been trained and tested software projects.
using both BASE-BERT and RE-BERT pre-trained models.
Each of the classifier models were trained for 25 epochs Future Work and Recommendation: - the dataset used
with same configuration of hyperparameters. The for pre-training were not sufficient to produced much
evaluation results of the classifier models show that RE- efficient and wealthier domain vocabulary and need to
BERT based BERT Sequential Classifier has achieved 1.4- increase its volume, as well as update the vocabulary of RE-
4.8% average improvement over other classifier models. In BERT periodically whenever new technology emerges or
general, the performance of deep learning and NLP models change exists. In the future, we plan to incorporate nature-
towards a particular downstream task can be improved by inspired metaheuristics algorithms for optimal feature
developing and applying a domain-specific pre-trained selection and hyperparameter optimization so that the
performance of our models can be boosted.