0% found this document useful (0 votes)
29 views33 pages

(PRINT) DRIP - Segmenting Individual Requirements From Software Requirement Documents SoftwPractExp2023

Uploaded by

Sumaira Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views33 pages

(PRINT) DRIP - Segmenting Individual Requirements From Software Requirement Documents SoftwPractExp2023

Uploaded by

Sumaira Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Received: 3 April 2023 Revised: 18 October 2023 Accepted: 15 November 2023

DOI: 10.1002/spe.3303

RESEARCH ARTICLE

DRIP: Segmenting individual requirements from software


requirement documents

Ziyan Zhao Li Zhang Xiaoli Lian Heyang Lv

Beihang University, The State Key


Laboratory of Software Development Abstract
Environment (SKLSDE), Beijing, China Numerous academic research projects and industrial tasks related to software
engineering require individual requirements as input. Unfortunately, accord-
Correspondence
Xiaoli Lian, Beihang University, Beijing, ing to our observation, several requirements may be packed in one paragraph
China. without explicit boundaries in specification documents. To understand this
Email: [email protected]
problem’s prevalence, we performed a preliminary study on the open require-
Funding information ment documents widely used in the academic community over the last 10
National Natural Science Foundation of years, and found that 26% of them include this phenomenon. Several text
China, Grant/Award Numbers: 62102014,
62177003; State Key Laboratory of segmentation approaches have been reported; however, they tend to identify
Software Development Environment, topically coherent units which may contain more than one requirement. What
Grant/Award Number:
is more, they do not take the constitutions of semantic units of requirements
SKLSDE-021ZX-10.
into consideration. Here we report a two-phase learning-based approach named
DRIP to segment individual requirements from paragraphs. To be specific, we
first propose a Requirement Segmentation Siamese framework, which models
the similarity of sentences and their conjunction relations, and then detects
the initial boundaries between individual requirements. Then, we optimize the
boundaries heuristically based on the semantic completeness validation of the
segments. Experiments with 1132 paragraphs and 6826 sentences show that
DRIP outperforms the popular unsupervised and supervised text segmentation
algorithms with respect to processing different documents (with accuracy gains
of 57.65%–187.53%) and processing paragraphs of different complexity (with
average accuracy gains of 54.46%–158.68%). We also show the importance of
each component of DRIP to the segmentation.

KEYWORDS
deep learning, requirement items, software requirements, text segmentation

1 I N T RO DU CT ION

Sound individual software requirements are the prerequisite of most requirement-related tasks such as requirement clas-
sification,1 relationship extraction, evolution,2–4 quality analysis,5,6 and knowledge extraction.7,8 In addition they are the
Abbreviations: ASE, International conference on automated software engineering; BERT, Bidirectional encoder representation from transformers;
GELU, Gaussian error linear unit; ICSE, International conference on software engineering; MSC, Minimal semantic constitutions; NL, Natural
language; RS, Requirements specifications; RE, IEEE international requirements engineering conference; REJ, Requirements engineering journal;
REFSQ, Requirements engineering: Foundation for software quality; RS-Siamese, Requirement regmentation siamese; SRL, Semantic role labeling;
SBERT, Sentence-BERT.

Softw: Pract Exper. 2023;1–33. wileyonlinelibrary.com/journal/spe © 2023 John Wiley & Sons, Ltd. 1
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 ZHAO et al.

1. General requirement 2. Effecveness evaluaon

1.Detector data must be acquired and stored in the most effective way.
Effectiveness should be evaluated in terms of cost, space requirements,
longevity, and speed. This shall lead to the definition of a Gemini 8m
Telescopes standard, used on all instruments. In general, operational
overheads must be kept as low as possible, to maximize actual
observing times. Intermediate storage of raw data in memory on
different nodes and in different formats should be kept to a minimum.
However, there must be at least two copies - one to secure data as
acquired and one to do assessment of data quality on-line (this last
copy preferably on removable media).

3. Operaonal overheads 4. Intermediate storage overheads

FIGURE 1 One example showing the target of requirements segmentation.

basis of an effective software development process.9–11 Specifically, ISO/IEC/IEEE 29148:2018(E) suggests that each sys-
tem and system element requirement should possess the characteristic of “singular,” that is, “the requirement states a
single capability, characteristic, constraint or quality factor.” However, during our work using public software requirement
documents or those private ones from the industry, we found it common to describe software requirements in paragraphs
that contain several related requirements without explicit separators. Although some approaches like EARS suggest the
syntax of expressing high-quality single requirements,12 lots of groups still express requirements in this way due to their
established habits. It causes at least two problems. First, because no special marks or punctuation distinguish them, the
boundaries between different requirements are unclear, making it challenging to divide them. Second, because differ-
ent requirements are packed into one paragraph, performing a single requirement-related activity, for example, tracing,
testing, and change impact analysis, is difficult. We call this phenomenon “requirement adhesion.”
For better illustration, we show one example in Figure 1. The original paragraph in the center is selected from
the publicly available Gemini requirement document.13 There are four related but different requirements in this
paragraph. The first sentence represents the overall requirement, which depends on the second and third sentences
because they explain the effectiveness mentioned in the overall requirement. The fourth sentence describes the
third requirement, which is about the operational overheads and refinement of the first requirement. The fourth
requirement is about data storage, represented by the last two sentences, which is also a refinement of the first
requirement.
We wanted to know whether or not the phenomenon of requirement adhesion is specific to our involved datasets.
In other words, we were curious as to how often it occurs in requirement documents in general. Thus, we per-
formed a simple survey on publicly available requirement documents widely used in the requirement engineering
(RE) community to explore to what extent it appears in diverse requirements from different groups. To be specific,
we collected 132 available software requirement sets from papers published in the main RE forums, including
IEEE International Requirements Engineering Conference (RE), Requirements Engineering Journal (REJ), (REFSQ),
International Conference on Software Engineering (ICSE), International Conference on Automated Software Engi-
neering (ASE) and ACM SIGSOFT Conference on the Foundations of Software Engineering (FSE) in the past 10
years. These requirements are represented in the form of natural language (NL) software requirement specifica-
tion in PDF or Microsoft Word documents (approximately 72.72%), sets of individual requirements organized as text
or spreadsheets (approximately 19.7%), or models (approximately 7.58%). We focused on the NL requirements in
PDF/Word document because 79% of requirement documents are written in NL in engineering practice14 and these
PDF/Word documents are the original format of most requirements according to our survey. In these documents,
the phenomenon of requirement adhesion appears in approximately 26% of them. Hence this problem is particularly
common.
It is required to segment requirements-related paragraphs into individual requirements by many RE-related tasks.
Specifically, the requirements segmentation task is to group several neighboring sentences to ensure that each group
only represents one requirement and its supporting information. However, requirements segmentation is laborious and
error-prone. First, segmentation clearly requires extra time to perform. In addition to the requirements of their project,
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 3

engineers usually need to segment the requirements of many other projects to support tasks such as evolution analysis or
knowledge learning. However, they often do not allocate their time to this kind of work due to the time pressures. Second,
the number of requirements and the exact requirements discussed in each paragraph vary. Therefore, it is not easy to
ensure that each segment is meaningful and self-sufficient (a segment is self-sufficient if it conveys a single requirement).
Last but not least, in most cases, requirement analysts are often not very familiar with the corresponding domain, which
increases the difficulties of segmentation.
To the best of our knowledge, there is no existing research on automated software requirement segmentation. The
task of requirement identification or extraction seems to be similar, but we found that most such studies focus on identify-
ing requirement-related information from early project documentation.15–17 Moreover, the process tasks on requirement
documents, such as requirement classification,18 require the descriptions of individual software requirement as input,
or consider individual sentences as single units to determine whether a requirement is relevant.19 None of them need
to determine the description boundaries of single requirements information, and hence they cannot solve our problem.
While there are some techniques for text segmentation in the field of NL processing,20–23 they do not solve our problem
because of the special characteristics of software requirements. First, the description of software requirements con-
tains some mandatory semantic elements (i.e., subject and action for functional requirements).11 However, the current
approaches do not involve domain-oriented semantic completeness checking and optimization. Second, traditional text
segmentation approaches tend to group topically coherent sentences. Considering that complicated dependencies exist
among software requirements,24–26 which cause the related descriptions to be topically coherent. Therefore, there must
be lots of false segments produced by traditional text segmentation approaches. Thus, they are unlikely to work well for
our problem.
In this study, we propose a two-phase automatic approach named DRIP, that is, Divide Requirement Items in the
Paragraphs of requirement documents. To be specific, we first design a requirement segmentation Siamese (RS-Siamese)
learning framework based on a Bidirectional Encoder Representation from Transformers (BERT)27,28 and Siamese net-
work,29 in which we integrate the semantic relevancy and sentence conjunctions30–33 to measure sentence relatedness.
Then in the second phase, we propose a set of minimal semantic constitutions (MSC) rules to heuristically optimize
the segmentation results according to the completeness checking of semantic elements in each segment based on the
semantic role labeling (SRL) annotations.34,35 Experiments comparing the proposed method with four popular text seg-
mentation approaches demonstrate the promising performance of DRIP on the segmentation of different documents and
paragraphs with different level of complexity. DRIP obtains, considerable gains in accuracy, with average accuracy gains
of 57.65%-187.53% on documents, and 54.46%-158.68% on paragraphs. We also show the effectiveness of the two phases
of DRIP.

1.1 Contribution

The main contributions of this paper are as follows:

• We are the first to define the phenomenon of requirement adhesion and determine that approximately 26% of the open
requirement documents used in RE academic communities in the last 10 years exhibit this phenomenon.
• We propose the automated approach DRIP, which is based on the semantic similarity, coherence, and cohesion rela-
tions of single sentences as well as the semantic constitutions of software requirements, for automatically dividing
paragraphs into individual requirements in software documents.
• The experiments on eight public requirement documents with requirement adhesion show that DRIP outperforms the
popular text segmentation approaches.

The remainder of this paper consists of the following sections. Section 2 defines the problem of requirement adhesion.
Section 3 investigates the frequency of requirement adhesion in public requirement documents. In Section 4, we briefly
introduce the techniques of Siamese networks and sentence-BERT (SBERT), which are the basis of our approach. Then in
Section 5, we describe the DRIP approach, and in Section 6 we describe experiments for the evaluation of our approach.
The discussions for this study and the related work are presented in Sections 7 and 8, respectively. The final Section 9
concludes this work and discusses future work.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 ZHAO et al.

1. Reside requirement

The maximum number of heating or cooling units that can run concurrently shall reside in an initialization file
and the concurrently running units shall be read from an initialization file and stored in the THEMAS system.

2. Read requirement 3. Store requirement

FIGURE 2 Example: a sentence containing multiple requirements.

2 PROBLEM STAT EMENT

2.1 Requirements adhesion

The phenomenon of requirement adhesion occurs in requirement-related paragraphs, which can be from software
requirements specifications (RS) or other requirement-related documents, if (a) the requirements are described in NL
paragraphs, instead of tables, models or other (semi-)structured formats; and (b) usually more than one requirement is
packed into one paragraph, without explicit separators.
It is well known that an RS usually contains, in addition to the requirements, a variety of other information, such as the
system scope and environment, domain properties, and concept definitions.9 Here the requirement-related paragraphs
refer to the sections that describe functional or other quality requirements. Some sentences may not be requirements
in these paragraphs, but they must be highly related with requirements by providing additional explanations or other
supporting information.
We select the term of “adhesion” because the requirements in one paragraph are usually topically coherent and there
are no clear boundaries between adjacent requirements. The number of sentences describing one single requirement may
differ. Sometimes, a single sentence states one requirement. But it is also common that multiple neighbouring sentences
describe one requirement.

2.2 Requirements segmentation

The process of dividing one requirement paragraph into the set of single requirements is referred to as require-
ments segmentation. Essentially, this task is to identify all neighboring sentences that can be grouped. Finally, one
requirement-related paragraph can be grouped into several clusters, each indicating one requirement. In this work, we
treated both descriptive and explanatory statements as part of the same requirement.
Given that a requirement paragraph Req paragraph is composed of a sentence set S = {s1 , ..., sn } (n is the number
of sentences, n ≥ 1) and describes a requirement set R = {req1 , ..., reqk } (k is the number of requirements described in
Req paragraph, k ≥ 1), si and si+1 (i + 1 ≤ n) is the ith and (i + 1)th sentences in Req paragraph and requ is a require-
ment in R. We define the requirements segmentation problem with respect to sentence si , si+1 and requirement requ in
Req paragraph.
Given: Sentence si ∈ S and Sentence si+1 ∈ S (i + 1 ≤ n).
Predict: Whether si and si+1 should be merged as a part of the requirement requ ?
Once all of the neighboring sentences in R have been decided whether to group or not, Req paragraph has been
segmented then.
In practice, there exists another kind of adhesion, that is, multiple requirements described in one sentence. For
example, there are actually three requirements in the sentence in Figure 2. We will explore this type of segmentation in
future.

3 S U RV E Y O N T H E DIST R IB UTION OF REQUIREMENTS ADHESION

We performed a survey of the phenomenon of requirement adhesion in public requirement documents, which were
obtained from the RE-related academic work published in the past 10 years.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 5

We focused on the mainstream RE forums, including the conferences such as RE, ICSE, REFSQ, ASE and FSE, as well
as the REJ. We crawled papers from their records on the dblp bibliography (for conference, both the main conference and
poster papers are included), and extracted the relevant URLs automatically. Then, we manually checked that whether
the URLs were accessible and whether they were related to requirement adhesion. If so, we downloaded the requirement
documents. Considering that one document could be used by multiple projects, we removed the duplicated data in our
dataset. Finally, we obtained 132 sets of requirements. We did not specifically collect the popular requirement repositories
such as PURE,13 Coest,36 and Promise,37 because they are commonly used in RE research and they are already included
in our dataset. Our repository is probably incomplete and we cannot ensure we have collected all requirement documents.
However, we believe that our dataset is representative.
We scanned these datasets, and we classified them into three categories: those written in NL and stored in
PDF/Microsoft Word documents (hereafter called NL-Docs), sets of individual requirements organized as text or spread-
sheets (hereafter called single-item documents), and those represented using models stored in PDF, XML, or spreadsheets
(hereafter referred to as models). Finally, we counted the number of each type of documents in each forum and the related
references, as listed in Table 1. The first row of this table reveals that, in the past 10 years, there are 12 NL-Docs38 and 11
single-item documents5,7,39 were used in four ICSE papers. According to the second row, there are two NL-Docs,40,41 six
single-item documents5,42,43 and four models-based requirement sets42,44–46 were used in RE papers. The third row shows
that there are seven NL-Docs,47,48 seven single-item documents40,48–53 and fives models documents54–56 were used in REJ
papers. For the REFSQ papers, 75 NL-Doc,13,57,58 one single-item documents59 and one model60 are involved. Finally,
there is only one single-item document61 used in ASE papers. We observe that most of the requirements are in the NL-Doc
class (96/132 ≈ 72.72%). Moreover, approximately 16.7% of the documents are sets of single items, indicating that these
studies required individual requirements as input.
For the 96 NL-Docs, we scan the content and select the ones which present groups of requirements in paragraphs,
and no explicit separators exist between the requirements. Finally, the phenomenon of requirement adhesion exists in 25
documents, about 26% of all 96 NL-Docs.
To obtain a rough quantitative understanding of the NL-Docs, we defined the metric docComplexity to measure the
complexity of each document. This metric is the average complexity of all the paragraphs describing software require-
ments in one document. Let the number of requirement paragraphs in a document be parNum, and the complexity of
paragraph i be parComplexityi . Then formal definition of docComplexity is as follows.

i∈parNum parComplexityi
docComplexity = . (1)
parNum

wordNum ∗ conjunctionNum
parComplexity = . (2)
senNum ∗ avgSenLength2

Paragraph complexity (i.e., parComplexity) is measured by considering the number of words (i.e., wordNum), con-
junction words (i.e., conjunctionNum), involved sentences (i.e., senNum) and the average length of the sentences (i.e.,
avgSenLength), as shown in Equation (2). We define parComplexity following the intuition of human cognition. The more
words in a paragraph, the more complex it is. The more conjunction words, the more sophisticated the logic in it is. More-
over, when there are more sentences with a fixed number of words and conjunction words, the easier each sentence is to

T A B L E 1 The open requirement sets and their sources.


The format of requirements and their sources
NL-Docs Single items Models

ICSE 12 11 0
RE 2 6 4
REJ 7 7 5
REFSQ 75 1 1
ASE 0 1 0
Overall 96 26 10
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 ZHAO et al.

T A B L E 2 Paragraph distribution by the sentence number.


x1 x2 x3 x4 x5 x6

Number of paragraph (total 1132) 172 247 323 132 56 202

Paragraph complexity

Easy Middle Difficult Difficult plus

FIGURE 3 Paragraph distribution by the complexity level.

T A B L E 3 Document distribution by the complexity level.


Level E(C <0.1) M(0.1 < C <0.5) D(0.5 < C < 1) D+ (C > 1)

docNum 2 8 5 10

wordNum∗conjunctionNum
understand, on average. Thus, we design the formula of senNum
. This can also be interpreted as the average
length of the sentences (i.e., the number of words) in the paragraph times the number of conjunction words. Moreover, we
divide the measurement by the constant avgSenLength2 , where the avgSenLength is the average length of the requirement
sentences. This work uses the average length of all sentences in the 25 NL-Docs as the value of avgSenLength.
Using the above two metrics, we can measure the complexities of all requirement paragraphs and the corresponding
documents.
We selected the paragraphs on software requirements from the 25 NL-Docs by removing the unrelated parts, such
as the introduction section, and obtained 1132 paragraphs consisting of 6826 sentences. We first counted the number of
sentences in them, because this is the most intuitive characteristic of a paragraph. For the convenience of presentation,
we defined six intervals, x1 = [1], x2 = [2 − 3], x3 = [4 − 7], x4 = [8 − 11], x5 = [12 − 15], x6 = [> 15], for the number of
sentence in a paragraph. The distribution of paragraphs in these six intervals is shown in Table 2. We can see that most
paragraphs are mapped to the interval x3 = [4 − 7] and consist of four to seven sentences, followed by [2-3] sentences and
[>15] sentences.
Using Equation (2), we calculated the complexities of these paragraphs and defined four grades: E (easy), M (Medium),
D (Difficult), and D+ (Difficult Plus). These grades correspond to the respective ranges of “parComplexity”: less than 0.1
for E, between 0.1 and 0.5 for M, between 0.5 and 1 for D, and greater than 1 for D+. The distribution of the paragraph
complexities is shown in Figure 3. Most paragraphs (i.e., 390) have a complexity of M. Next most common is a complexity
of E (300 paragraphs) and D+ (240 paragraphs), and the least number of paragraphs (approximately 110) have a complexity
of D.
On this basis, we can calculate the complexities of the documents with using Equation (1). We adopted the same
intervals for parComplexity, and present the document distribution in Table 3. Most documents (i.e., 10) have a document
complexity of D+, followed by eight requirements documents with a complexity M. Two documents have a complexity D
and two document have a complexity of E.

4 BAC KG RO U N D

This section describes two key technologies related to this research: Siamese Network and the SBERT framework.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 7

FIGURE 4 Siamese network structure.

4.1 Siamese network

Although we collected a large number of requirement documents used in the RE academic community, our dataset
(i.e., 25 documents consisting 1132 paragraphs and 6826 sentences) is far from reaching the huge scale of dataset
required by neural networks. However, Siamese networks have shown adaptability to small-scale datasets, and a previ-
ous study62 has also achieved satisfactory results using smaller datasets in the Siamese network approach. A Siamese
neural network is a type of neural network architecture consisting of two or more identical subnetworks.63 “Identical”
here means that they have the same configuration, parameters and weights. Parameter updates are mirrored in both
sub-networks. A Siamese network is used to determine the similarity of input by comparing its feature vectors. This net-
work was originally used in image similarity recognition.63 Its advantage is that only a fewer data are needed to obtain
better similarity prediction. The Siamese network has become very popular in recent years for its ability to learn small
datasets.62,64
The characteristics of the Siamese network include: (1) Using two different inputs through two subnetworks
with the same architecture, parameters, and weights. (2) The two networks are mirrors of each other, like Siamese
twins. Therefore, any change to the architecture, parameters, or weights of any subnetwork can apply to the
other one as well. Two sub-networks are input encoders to calculate the difference before the inputs. (3) Siamese
network aims to use similarity scores to determine whether the two inputs are similar. The common loss func-
tions include Contrastive loss, Binary Cross Entropy loss, and Triplet loss. (4) Siamese network is a one-shot
classifier.
We show the structure of Siamese network in Figure 4. It takes pairs of instances as inputs (e.g., Input1 and
Input2 in our figure), and aims to decide whether they are similar. Particularly, the two inputs will be embedded
in two vectors GW (X1 ) and GW (X2 ) firstly. Then Siamese network learns vital characteristics of them and calculates
their distance. The vectors of similar instances will become closer and closer during the training process based on the
selected loss function. To determine whether the two inputs are similar or not, a threshold value of similarity needs to
be set.
Contrastive loss is widely used in Siamese network, defined as follows:

1 ∑ 2
N
L= yd + (1 − y) max (margin − d, 0)2 , (3)
2N n=1

where d represents the distance between the two instance vectors, and y is the label of whether the two instances are sim-
ilar or fit. y = 1 indicates that the two instances are similar; otherwise, y = 0 means that the two instances are dissimilar.
margin is the set threshold.
The popular distance metrics include Euclidean distance, Cosine distance and Exponential (EXP) distance, defined
as follows:
( )
F1 ⋅ F2 ‖F1 − F2 ‖2
dEuclidean = ‖ ′‖
‖fn − fn ‖2 dCosine = dEXP = exp − . (4)
|F1 ||F2 | 2𝜎 2
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 ZHAO et al.

FIGURE 5 Sentence-BERT network structure.28

4.2 SBERT

Fine-tuning a pretrained model for a specific domain and task is a recent trend in intelligent software engineering. The
pretrained model of BERT has been widely used in various text similarity-related tasks.64–67 However, it requires a high
number of computations to complete (i.e., n⋅(n−1)
2
) the sentence-pair regression task. Moreover, the most common config-
uration of BERT has been shown to generate poor sentence embeddings.68 To solve these two problems, Reimers et al.28
proposed SBERT, which is based on Siamese and triplet network structures to derive semantically meaningful sentence
embeddings. As the shown in Figure 5, SBERT adds a pooling operation to the BERT’s output to obtain a fixed-size
sentence embedding representation. There are three common pooling strategies: CLS-token, mean-strategy pooling and
max-strategy pooling. Although these strategies are different, the procedure of sentence embedding is the same. They
integrate the similarity measure of the two embeddings into the loss function. Moreover, in experiments, it has been
shown that the max-pooling strategy performs substantially worse than the mean-pooling or CLS token strategies.28
This finding informed our selection of pooling strategies. Three common loss functions are utilized in different tasks:
Mean-Square Error Loss Function for regression task, Cross-Entropy Loss Function, and Triplet Objective Function for
similarity calculation and classification task.
Essentially, SBERT is a Siamese network fine-tuned on BERT. So that it can synchronously adjust the parameters
of the embeddings of both sentences to ensure their embeddings share the same parameters and weights. In this way,
the computation time is substantially reduced. Another advantage is that the deeper semantics of the sentences can be
learned.63 This structure provided crucial insights for the design of our requirement segmentation framework (seen in
Section 5.2).

5 PROPOSED A PPROACH: DRIP

In this section, we first present an overview of our approach. Then, we describe the details of each step in turn.

5.1 The framework of our approach

Our approach, DRIP, consists of two main phases, the initial segmentation and segment optimization, as shown in
Figure 6.
For the first segmentation phase, we propose a framework called requirement segmentation Siamese (RS-Siamese)
network, by borrowing the idea of SBERT,28 to initially detect the requirement description boundaries from the para-
graphs. To be specific, we first perform sentence embedding by transforming textual paragraphs into the vectors of
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 9

5.2 Software requirement segmentation network


5.2.2 Sentences relatedness model
Sentences similarity model

768
5.2.1 Sentences embedding
… Similarity
Sentence embedding 1 loss
Sentence1
Requirement
documents Mean polling 768
Sentence2 Bert merge …
5.3 Segment optimization

Decision layer
Sentence embedding 2

Paragraph Heuristic
splitting CLS Merging with the
Sentence incompleteness
supporting neighbours
n–1 Sharing weights detecting
Sentences relationship model
Initial segments
Sentence Mean polling 768

n Bert merge …
Sentence embedding 1
CLS Softmax
768


Sentence embedding 2 Requirement segments

Phase I: Initial segmentation Phase II: Segment optimization

FIGURE 6 The procedure of our divide requirement items in the paragraphs.

Merge
Paragraph I. R-1. The location and storage format of the report shall be R-1,R-2,R-3
The location and storage format of the report shall R-1. The location and storage format of the report shall be
selected. Incomplete
be selected. The password of the report shall be set selected. The password of the report shall be set by the
by the operator. The permission to modify the
R-2. The password of the report shall be set by the operator.Complete
operator. The permission to modify the report should be set.
report should be set. R-3. The permission to modify the report should be set. Incomplete
RS-Siamese Optimizer
Paragraph II. R-4. Detector data must be acquired and stored in the R-2. Detector data must be acquired and stored in the
Detector data must be acquired and stored in the most effective way. Complete No change most effective way.
most effective way. Effectiveness should be R-5. Effectiveness should be evaluated in terms of cost, R-3. Effectiveness should be evaluated in terms of cost,
evaluated in terms of cost, space requirements, space requirements, longevity, and speed. This shall lead to space requirements, longevity, and speed. This shall lead
longevity, and speed. This shall lead to the to the definition of a Gemini 8m Telescopes standard,
the definition of a Gemini 8m Telescopes standard, used on
definition of a Gemini 8m Telescopes standard,
all instruments. Complete used on all instruments.
used on all instruments
Input data Initial segments Requirement segments
Phase I. Phase II.
Initial segmentation Segment optimization

FIGURE 7 One example showing the output of each phase of our divide requirement items in the paragraphs.

single sentences, as described in Section 5.2.1. We obtained the contextual word embedding of the sentences from the
pretrained BERT model and then we designed two pooling layers (i.e., mean-pooling and CLS token-based pooling)
to obtained the vector of each word and the CLS classification token of the whole sentence. We merge these vectors
to acquire the embedding of the whole sentence. Then in Section 5.2.2, we propose a multitask sentence relatedness
network composed of a sentence similarity model and a sentence relationship model to help determine the sentence
boundaries.
For the subsequent optimization phase, we designed a segment optimizer to heuristically adjust the ini-
tial segmentation results according to their semantic constitution (Section 5.3). We defined seven patterns
for the minimal semantic constitution of the requirements, and we use them to check the completeness of
each initial segment. For the incomplete ones, we adjust the boundaries according to the tactics we have
defined.
To better illustrate our DRIP, we show the output of each phase in Figure 7. For the input paragraph, our RS-Siamese
in the first phase will divide the input paragraph into five candidate requirements. Then in the second phase, our segment
optimizer will decide the completeness of each candidate requirement and perform the merging operation to generate
the final three requirements.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 ZHAO et al.

5.2 Requirement segmentation Siamese network (RS-Siamese)

5.2.1 Sentence embedding

The goal of sentence embedding is to transform NL requirement sentences into vectors for the subsequent sentence
relatedness-based segmentation.
We first preprocess the paragraphs by splitting paragraphs into sentences and tokenizing them into single words. We
also perform word lemmatization to ensure that different forms of a word can be analyzed as a single term. Then, we
obtain the word embeddings with the pre-trained BERT model.
Because of the variable lengths of the direct word embedding of BERT, we add a pooling layer to obtain a fixed-size
output representation for the following calculation. The common pooling strategies are mean-pooling, max-pooling and
the use of the output of the first token (the [CLS] token).28 Mean-pooling calculates the average value of all output vectors.
Max-pooling selects the maximum value of all output vectors, and CLS token inserts a specific classification token [CLS]
at the beginning of each sequence designed for the downstream classification tasks. We selected mean-pooling and CLS
token, inspired by experimental results revealing that both perform better than max-pooling for the classification task.28
To illustrate the working principle of mean-pooling layer, we consider an example. Suppose we have the sentences “I
love to eat pizza and ice cream.” and “I enjoy to drink juice.”, as shown in Figure 8. These two sentences have different
lengths, with eight and six words, respectively. After word embeddings, their vector sizes are [3,8] and [3,5].
Toward the varying vector sizes of sentences, we employ the Mean-Pooling operation, which selects the average value
along each dimension, resulting in a fixed-size vector representation, that is, [1,3] in the example.
We concatenate the vectors after these two pooling layers and obtain the embedding of each single requirement
sentence. We use the default value (768) of the BERT pretraining model as the vector size.

5.2.2 Sentences relatedness network

The decision to merge two neighboring sentences as one requirement item is made using our proposed sentence
relatedness network.
We model sentence relatedness from two aspects: semantic similarity and structural relationships (e.g., causal relation-
ships), in accord with our human habits and intuition. These two aspects are treated as two parallel models (i.e., a sentence
similarity model and sentence relationship model) to support the multitask training for the boundary determination in
the decision layer.
Sentence similarity model Similarity is a common and key metric for measuring the relatedness of two sentences.69
We also adopt it. Inspired by the consistently good performance of the vector space model (VSM) for computing similarity
across multiple software-related datasets,70 we also use cosine similarity, on the vectors generated from sentence embed-
dings. Let the embeddings of two sentences be encoded into d1 and d2. Then, their similarity relationship is denoted as
[d1 ⊕ d2]. We use 1 − [d1 ⊕ d2] as the regression objective function because the loss value needs to be as small as possible.
Sentence relationship model Sentence relationships such as the causal relationship are essential clues about sen-
tence relatedness.71 They are usually indicated by linking words between the neighboring sentences/clauses, such as the
term “therefore,” which indicates a causal relationship. Many documents can be retrieved with common search engine

Mean-pooling

Mean-pooling

Vector size = [3,8] Vector size = [3,5]

FIGURE 8 An example of using Mean-Pooing on the sentences with varied lengths.


1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 11

(e.g., Google) using the query “linking words,” and we collected additional linking words from the “Dictionary of Linking
Words: A Guide to Translation Studies” and the (gray) literature on rhetorical structure theory.30–33
According to rhetorical structure theory,71,72 we temporarily selected four common relation types for use in this present
work causal, supplementary, transitional, and interpretive. We selected them because they are more likely to indicate
that the same topic is discussed in the corresponding neighboring sentences. Further, some semantic information will
be lost if only one sentence is analyzed and the others with any of these relationships are ignored. Thus, we integrated
them into our network. For d1 and d2, we use [d1 ⊗ d2] to indicate that the two requirement statements have at least one
of the relationships we focus on. Some of these relationships are indicated with obvious keywords, but there may be no
obvious lexical features in many cases. We annotate and classify the textual relationship between the adjacent requirement
sentences and train the network using the softmax classification formula as the loss. In this study, we primarily focused
on four common relation types. However, we acknowledge the potential impact of other relationship types and plan to
investigate them in future research to further enhance our requirement segmentation approach.
Decision layer The tuples of the sentence embedding and two loss values, that is, <d1,d2,1 − [d1 ⊕ d2]>and
<d1,d2,[d1 ⊗ d2]>from the sentences similarity model and sentences relationship model, respectively, are the input of
the decision layer. We designed a merging decision mechanism based on the cosine distance of the two sentence vectors.
It controls the model training through a gradient descent algorithm.
Hyperparameter selection: We retained the original architecture of BERT, containing 12 Transformer layers. Each
layer integrates multihead self-attention mechanisms as well as feed-forward neural networks. The Gaussian Error Linear
Unit (GELU) served as the activation function, while the optimizer in use was Adam, and the embedding size was set at
768.
The learning rate was chosen from a range of options: 5e − 5, 4e − 5, 3e − 5, and 2e − 5. Ultimately, we settled on
the lower learning rate of 2e − 5 within this range due to its proven stability and effectiveness,28 especially during the
fine-tuning of pretrained models like BERT. The number of epochs was set to 30 after numerous experiments, ensuring
the model comprehensively learns data features while preventing overfitting of the training data.

5.3 Segment optimization

Sentences are represented within their context. Sometimes, it is impossible to interpret one sentence out of its context
because some syntax elements are missing (e.g., the imperative sentence) or the use of anaphora. Because we expect the
extracted requirements items to be self-sufficient, we assume that the segments that are hard to understand without their
supporting neighbors, which contain its missing or referenced semantic elements, should be merged with these neighbors. This
led to the design of our segment optimizer.
To be specific, our optimizer is composed of two steps: heuristic incompleteness detection and supporting neighbor
merging.

5.3.1 Heuristic incompleteness detection

To check for the completeness of individual requirement statements, we define seven patterns for the minimal semantic
constitution of requirements, according to three typical sentence structures. If one requirement statement matches with
a pattern, the lack of any element in this pattern indicates the incompleteness of this requirement.
To determine the completeness of all candidate requirement items automatically, we must first annotate their semantic
elements. In this work, we perform SRL34,35 using the AllenNLP tool.73 SRL is the task of recognizing arguments for a
given predicate or verb of a sentence and assigning semantic role labels. It concentrates on the predicates of sentences
and generates the semantic relationships between the components and predicates, that is, the predicate and argument
structure. It is an important intermediate step in many NL processing tasks, such as summary generation74 and special
information identification.75
Considering that there are multiple sentence structures in English expression, we hope to define a set of patterns
to support the automatic incompleteness checking. We randomly selected 940 requirement sentences and determined
that there are three common sentence structures: subject-verb-object active voice structure, passive voice structure, and
subject-link verb-predicative structure. These structures account for approximately 94.9% of structures of the considered
sentences.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 ZHAO et al.

We define the MSC rules for each sentence structure, and the sentence structure must perform some necessary seman-
tic role, such as the role of an agent or action. While checking for incompleteness, we retrieve the set of MSCs to check
whether any sentence contains an MSC. If so, this sentence is semantically complete; otherwise, it is incomplete.
The MSC are defined based on the semantic role labels and sentence patterns.
Subject-verb-object structure This is the most common structure.
MSC I: Arg0 + V + Arg1 The requirement represented using the subject-verb-object structure should contain one
agent (labeled as Arg0 in SRL), one operation (the predicate labeled as V), and at least one object (labeled as Arg1).
Example 1. [ARG0: Each thermostat] [ARGM-MOD: shall] [V: have] [ARG1: a unique identifier by which
that thermostat is identified in the THEMAS system].
In this example, all of the ingredients annotated as ARG0, V and ARG1 are required, and this requirement would be
incomplete if anyone is missing. Note that we claim that all three ingredients are required, but the string of each ingredient
may be simpler. For example, the object can be a unique identifier without the attributive clause.
MSC II: Arg0 + V + ArgM-MNR ArgM-MNR (i.e., the manner argument) specifies how an operation is performed. In
this pattern, the object can be the default, but the way of operation execution is critical, as seen in the following example.
Example 2. [ARG0: The data request method] [V: communicates] [ARGM-MNR: through an asynchronous
data access interface].
Passive voice structure A main action-verb or state-verb should be contained in the requirement specified in the
passive voice structure.
MSC III: Arg1 + main action-verb + by + Arg0 If the main verb is an action-verb (i.e., the verb expressing system’s
action), there should be an Arg0 indicating the agent of the requirement followed by the word “by.”
Example 3. [Arg1: The user report’s location and name] [ARGM-MOD: shall] be [V: selected] [ARG0: by
the operator].
MSC IV: Arg1 + be + state-verb If the main verb is a state-verb (e.g., “existing”, “required”, “done”), there should be
an Arg1 indicating the object of the state.
Example 4. [ARG1: Changes in the configuration] [ARGM-MOD: could] be [V: done] [ARGM-MNR: with-
out on server running.].
Subject-link verb-predicative structure There are two primary sentence patterns in this structure, that is, the
object representing with pronouns (e.g., “it” or “there”) or “not.”
“It/There is … that”: The MSC are defined on the subordinate clause, which can be detected by AllenNLP.
MSC V: MSCs I and II for the subordinate clause in active voice If the subordinate clause uses a subject-verb-object
structure, we use the subject-verb-object structure rules (i.e., MSCs I and II) to analyze the completeness.
MSC VI: MSC III and IV for the subordinate clause in passive voice If the subordinate clause is the passive voice
structure, we use the passive voice structure rules (i.e., MSC III and IV) to analyze the completeness.
Example 5. It [V: is] [ARG2: essential] [ARG1: that the LAN shall support the majority of the Gemini 8m
Telescopes system internal communication needs].
In this example, the main semantic part is in the coordinate clause. Thus, we perform SRL on this clause.
The subordinate clause of Example 5: [ARG0: the LAN] [ARGM-MOD: shall] [V: support] [ARG1: the majority of
the Gemini 8-m Telescopes system internal communication needs].
We can see that the subordinate clause uses a subject-verb-object structure, and it has been annotated as Arg0 + V +
Arg1. Thus, we use MSC I and decide that this requirement is complete.
“NN is … ”: Sometimes the object can be a noun (phrase) for the subject-link verb-predicative structure.
MSC VII: ARG1 is … [ARGM-LOC or ARGM-MNR or ARGM-TEM]. We believe that the [LOC or MNR or TEM]
is a property of the agent. Moreover, we regard a structure like this as a complete requirement.
Example 6. [ARG1: Users information] is [V: recorded] [ARGM-LOC: in a Microsoft Access database that
shall reside on the supervisor’s computer].
The Usage of MSC Rules: When using MSCs for checking the completeness of candidate requirements, we firstly
employ the AllenNLP tool to obtain the SRL tags for each unit in the requirement sentences. Then we match the tags of
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 13

each sentence against the seven pre-defined MSCs. We sequentially compare all MSC rules with the SRL tag sequences
of each annotated requirement sentence. A match occurs when the SRL results of a sentence include all roles specified
in a particular MSC rule, with their order preserved, although these roles do not necessarily have to appear consecutively
within the sentence. If the sentence’s semantic structure can be matched with any of the MSC rules, it is marked as
“Complete.” Conversely, it is marked as “Incomplete.”
Consider the Example 1 under MSC I, where the SRL results consist of four components: ARG0, ARGM-MOD, V, and
ARG1. Obviously, the three components of MSC I can be matched sequentially, that is, ARG0, V, and ARG1, we deemed
this sentence to be complete.

5.3.2 Merging with supporting segments

We need to merge incomplete candidate requirement descriptions with their supporting neighbors, which work as the
necessary context for better understanding.
For each incomplete sentence, we check whether it has been merged with other sentences. If not, we perform the
merging activities simply according to the position of the sentence in the paragraph. If the incomplete item is the first
sentence in the paragraph and there are other segments in this paragraph, we merge it with the following segment. If this
following segment is also incomplete, we will continue merging with its next one until a complete or the last candidate in
this paragraph is reached. If the incomplete item is not the first segment and there are others before it, we merge it with
its previous segment. Similarly, we perform an iterative merging with the previous segment until a complete or the first
candidate is reached.
To better illustrate our merging scenarios, we show one example here. We select the first three candidate requirements
from one paragraph labeled Cand-Req-1, Cand-Req-2, and Cand-Req-3. Their SRL annotations are as follows.
Cand-Req-1: [ARG1: The location and storage format of report] [ARGM-MOD: shall] be [V: selected]. (Match MSC
III: missing the Arg0. It is incomplete.).
Cand-Req-2: [ARG1: The password of a report] [ARGM-MOD: shall] be [V: set] [ARG0: by the operator]. (Match
MSC I. It is complete.).
Cand-Req-3: [ARG1: The permission to modify the report] [ARGM-MOD: should] be [V: set]. (Match MSC III:
missing the Arg0. It is incomplete.).
According to the MSCs we defined, we can determine that the three candidate requirements are incomplete, complete,
and incomplete with MSC III, MSC I, and MSC III, respectively. And we merge the three. We merge the beginning two
segments since the incomplete Cand-Req-1 is the first one in its paragraph, and the following one is complete. And we
merge Cand-Req-2 and Cand-Req-3 because the incomplete Cand-Req-3 is a middle segment in the paragraph, and its
previous one is complete.

6 EXPERIMENTAL EVALUATION

6.1 Research questions

• RQ1: Can our trained DRIP effectively segment the paragraphs from different requirement documents
automatically? The documents of different systems in diverse domains are written using different writing styles (e.g.,
active or passive voice) and with various levels of complexity by different groups. This may impact the effectiveness of
our approach. Thus, we would like to explore the performance of our trained model on these documents in comparison
with popular text segmentation approaches.
• RQ2: Can the trained DRIP effectively segment paragraphs of different levels of complexity automatically?
Paragraphs may vary in complexity; for instance, the number of sentences, sentence length, and the clues of the sen-
tence relationships can differ. Intuitively, this could impact the effectiveness of the trained DRIP. Therefore, we explore
the effectiveness of our approach by processing paragraphs with different levels of complexity.
• RQ3: To what extent does each facet of the design in DRIP impact the requirement segmentation? DRIP
includes two parts: initial segmentation and segment optimization. To determine whether both of them necessary for
the segmentation and if so, to what extent they impact the results, we implemented a partial DRIP system by excluding
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 ZHAO et al.

the segmentation optimization (DRIP-NonOpt). The parameter settings of the segmentation networks in DRIP and
DRIP-NonOpt are the same.

6.2 Experimental design

6.2.1 Data annotation

We recruited eight people to annotate the 25 software requirement documents that included the phenomenon of require-
ment adhesion (as described in Section 3). These annotators consisted of one lecturer, one PhD student and six computer
science master’s students. All of them could read and understand English technical documents smoothly. We divided
the 25 documents into four sets consisting of 11, 4, 4, and 6 documents according to annotator availability. Each set was
assigned to two annotators, and they annotated the documents independently. To train our framework, they were required
to provide the results of sentence segmentation and the relationships between the neighboring sentences in each seg-
ment. For the convenience of illustration, we show the annotations on the paragraph of Figure 1 in Figure 9. There are
six sentences with an ID of 1-* where the 1 is the ID of the paragraph and * indicates the sentence sequence. Moreover,
four requirements are described, separated by carriage returns. For the requirement composed of multiple sentences, we
illustrate the relationship between the neighboring sentences using the notation #relationship:flag# where the flag can
be “0,” “A,” “B,” and “C,” indicating no relationship, causality, adverse relationship, and a complementary relationship,
respectively.
Before the annotators began their annotations, we gave them a 30-min training session about the task, the four kinds
of sentence relationships used in this study, and one example for illustration. We manually extracted requirement sections
with the titles of * requirements (such as functional requirements and general requirements) and stored them in Microsoft
Word format (.docx). We compared the original PDFs and Word documents to ensure consistency in both format and
content. We sent the .docx documents to our annotators. Once we received both annotations for one set, the first author
and the two annotators for this set discussed in person any discrepancies in their annotations until final agreement was
achieved. All annotations lasted approximately nine days and took approximately 105 person-hours in total. To evalu-
ate the inner-agreement between the manual annotations, we calculated the Cohen’s Kappa coefficient76 of each pair of
annotations on each document and found perfect agreement has been achieved according to Viera et al.77 (the Kappa is
76.6%–95.2%, with the average of 86.45%).

6.2.2 Experimental settings

RQ1: In addition to the three documents used for the analysis of requirement-document sentence structures for the design
of segment optimizer in Section 5.3.1, we randomly selected 17 additional documents consisting of 712 paragraphs and
4813 sentences as the training set. The eight remaining documents (420 paragraphs and 2013 sentences) were used as the
test set. The ratio of these two sets was 7:3. For each document, we extracted all the text from the relevant paragraphs by
removing the paragraph marks and formed one large paragraph for the input of the trained model.
RQ2: We extracted the paragraphs in the eight test documents used to answer RQ1, and grouped them according to
the level of their parComplexity. Then we compared the performance of the baselines and DRIP on these groups. We built
similar single, large paragraphs for each level by removing the paragraph marks.

1-1. Detector data must be acquired and stored in the most effective way technology.
1-2. Effectiveness should be evaluated in terms of cost, space requirements, longevity, and speed. #relationship:0# 1-3.
This shall lead to the definition of a Gemini 8m Telescopes standard, used on all instruments.
1-4. In general, operational overheads must be kept as low as possible, to maximize actual observing times.
1-5. Intermediate storage of raw data in memory on different nodes and in different formats should be kept to a
minimum. #relationship:B# 1-6. However, there must be at least two copies - one to secure data as acquired and one
to do assessment of data quality on-line (this last copy preferably on removable media).

FIGURE 9 Example: the annotation on the paragraph in Figure 1.


1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 15

RQ3: We implemented the DRIP-NonOpt algorithm by removing the segment optimizer from DRIP. We ran
DRIP-NonOpt on the eight test documents for RQ1 and the paragraph groups for RQ2 and evaluated the impact of the
two stages on the segmentation.

6.2.3 Baselines

We selected four popular text segmentation approaches as the baselines, including two unsupervised and two supervised
ones. For each of the two supervised approaches, we implement two models, the first trained with the dataset in the
original paper and the second trained with our requirement documents.

• TextTiling78 determines the boundaries according to the subtopic shift. It calculates the similarity based on the
distribution of co-occurring words.
• GraphSeg23 constructs a semantic relatedness graph in which semantically similar sentences are connected. The sim-
ilarity integrates the lexical similarity and information content of two sentences. Coherent segments are obtained by
finding the maximal cliques in the graph.
• TopicTiling20 detects sentence segments based on the degree of cosine similarity between the topics obtained using
latent dirichlet allocation (LDA).79 Originally, it was trained with 71,986 Wikipedia documents. We refer to this model
as TopicTiling-Ori. Considering the domain gap between the general Wikipedia documents and software requirement
specifications, we retrained this LDA model on our dataset, that is, the 17 open source requirement documents. We
refer to this model as TopicTiling-Req.
• Koshorek et al.80 proposed a hierarchical neural network model based on bidirectional long short-term memory to label
the boundaries between sentences. They trained the model on the large corpus WIKI-727K, that is, 727,746 English
Wikipedia documents and their hierarchical segmentations. Similar to TopicTiling, we implement two baselines,
named HBiLSTM-Ori and HBiLSTM-Req.

6.2.4 Metrics

The purpose of this experiment was to evaluate the ability of the automated approaches to identify the requirement
boundaries from a given text correctly.
Since that requirements segmentation essentially aims to group the sentences that describe the same requirement,
we designed our metrics based on counting all correct and wrong two-neighboring-sentence groups. Therefore, we first
define true positive (TP), true negative (TN), false positive (FP), and false negative (FN).
TP: if two neighboring sentences si and sj should be grouped and they are be grouped automatically too.
TN: if two neighboring sentences si and sj should be grouped, but they aren’t be grouped automatically.
FP: if two neighboring sentences si and sj should not be grouped, but they are be grouped automatically.
FN: if two neighboring sentences si and sj should not be grouped and they aren’t be grouped automatically.
Based on these four atomic measures, we defined four metrics. We selected accuracy as the first metric. It mea-
sures the ratio of the number of automatically detected correct groups to the total number of human-annotated
groups.

TP + TN
Accuracy = . (5)
TP + TN + FP + FN

Because some requirements can be described by a single sentence, it is necessary to evaluate the extent that the auto-
mated approaches can correctly segment the requirements which are described with more than one adjacent sentences
(called the complex requirement for clarity). Thus, we designed three additional metrics: C-Recall (C-Rec), C-Precision
(C-Pre) and C-F-Measure (C-F). C-Rec calculates the rate of correctly identified two-neighboring-sentence groups auto-
matically in a given text (i.e., one or more paragraphs). C-Pre calculates the ratio of the number of correctly identified
two-neighboring-sentence groups to the number of all automatically identified groups. C-F is the harmonic mean of C-Rec
and C-Pre.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 ZHAO et al.

TP
C-Rec = . (6)
TP+FN
TP
C-Pre = . (7)
TP+FP
2 ∗ C − Rec ∗ C-Pre
C-F = . (8)
C-Rec+C-Pre

For better illustration, we give one example here. Suppose a requirements paragraph contains four sentences
labeled as a set R = {s1 , s2 , s3 , s4 }. If s1 and s2 should be merged into one requirement, and s3 and s4 are two
different requirements, then the true annotation of this paragraph is {s1 , s2 }, {s3 }, {s4 }. If the predicted result
is that all of the four sentences constitute one single requirement, it would be expressed as {s1 , s2 }, {s2 , s3 },
{s3 , s4 } (i.e., every two neighboring sentences should be grouped). Then the C-Rec is 1 (i.e., the only group
{s1 , s2 } is identified correctly), the C-Pre is 0.33 (i.e., in the identified three groups, only one is correct), and the
Accuracy is 0.33.

6.3 Experimental results and analysis

We designed the experiments to evaluate the effectiveness of DRIP on different documents (RQ1), on paragraphs with
different complexity levels (RQ2), and on paragraphs with different stages (RQ3). We address these three questions in
turn.

6.3.1 RQ1: effectiveness on different requirement documents

The purpose of this experiment was to evaluate the effectiveness of DRIP in segmenting the text from different require-
ment documents. We ran the six baselines and DRIP on each of the eight test documents, and obtained the results for
the four metrics (i.e., accuracy, C-Rec, C-Pre, and C-F). The results are listed in Table 4. For the convenience of compar-
ison, we list the docComplexity (dC) of each document. We also calculated the average values of these metrics for each
approach. The best results are highlighted in bold.
We observe that the best accuracy is achieved by DRIP on all eight documents with an average of 85.05%. This indicates
that our approach works better than all baselines on the boundary identification of individual requirements in these
different documents. For instance, for C-F, DRIP yields five of the best values for the eight runs. For the remaining three
documents, it obtains the second-best result that are close to the best values. For the last document, DRIP-NonOpt obtains
the best C-F. We discuss the improvement of DRIP over DRIP-NonOpt in Section 6.3.3. For the convenience of comparison,
we calculate the gains of DRIP beyond the baselines and DRIP-NonOpt for each of the four metrics, and present the
results in Table 5.
Of the two unsupervised approaches, TextTiling seems to work better than GraphSeg, although it is extremely intuitive.
It obtains better accuracy on six documents and achieves better results for all metrics on average. This indicates the efficacy
of lexical similarity between sentences to text segmentation. The design of our sentences similarity model conforms to
this idea (Section 5.2.2).
Generally, the supervised approaches trained with their original big dataset (i.e., TopicTiling-Ori and HBiLSTM-Ori)
work better than those trained with our requirement documents (i.e., TopicTiling-Req and HBiLSTM-Req) accord-
ing to the values of C-F, with the single exception of the Gemini document. This illustrates that although our dataset
is specific to the requirement segmentation task, a dataset for more general text segmentation training with a sub-
stantially large scale has an advantage. This is a reminder that DRIP can be improved in future by collecting more
training documents. Moreover, one observation interests us. TopicTiling-Req obtains better accuracy and C-Rec than
TopicTiling-Ori when processing most of the documents. After a deep analysis, we realize this illustrates an increase
in the ability of TopicTiling to identify single-sentence requirements and a reduction in its ability of identify multisen-
tences requirements. This is due to the unbalanced distribution of these two classes in the requirement documents.
This also explains why TopicTiling-Req performs better with respect to all four metrics when processing the Gemini
document, which has the lowest docComplexity, indicating that most of the requirements are described using single
sentences.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 17

T A B L E 4 Effectiveness comparison on the eight documents.


Documents (#par, #sens)
(dC, level) Approaches Accuracy C-Rec C-Pre C-F

gemini13 (118,235) (dC = 0.04,E) TextTiling 15.58 83.78 16.15 27.07


GraphSeg 55.27 59.45 33.17 42.58
TopicTiling-Ori 16.08 86.49 16.49 27.70
TopicTiling-Req 41.44 93.75 19.53 32.33
HBiLSTM-Ori 18.59 97.3 15.81 27.20
HBiLSTM-Req 41.99 92.5 20.67 33.79
DRIP-NonOpt 86.93 56.76 55.14 55.45
DRIP 91.96 70.27 32.91 44.85

themas13 (50,109) (dC = 0.07,E) TextTiling 54.54 77.78 47.73 59.16


GraphSeg 28.57 81.48 24.18 37.29
TopicTiling-Ori 60.04 87.04 50 63.51
TopicTiling-Req 68.83 98.15 31.55 47.75
HBiLSTM-Ori 70.13 90.0 50.47 64.67
HBiLSTM-Req 61.04 83.33 30.41 44.55
DRIP-NonOpt 81.82 74.07 52.63 61.54
DRIP 84.41 77.78 55.26 64.61
7 TextTiling 31.25 74.43 27.03 39.66
CM1 (18,55) (dC = 0.11,M)
GraphSeg 62.5 57.14 48.78 59.70
TopicTiling-Ori 37.5 85.71 23.53 36.92
TopicTiling-Req 40.63 92.86 17.11 28.89
HBiLSTM-Ori 43.75 92.86 32.55 48.20
HBiLSTM-Req 28.13 50.0 13.21 20.9
DRIP-NonOpt 87.5 71.43 41.47 52.42
DRIP 93.75 85.71 50.0 63.16

Water use13 (26,55) (dC = 0.13,M) TextTiling 37.5 85.71 20.0 32.43
GraphSeg 14.0 82.34 22.22 35.0
TopicTiling-Ori 60.0 88.23 42.25 57.14
TopicTiling-Req 52.15 88.23 26.55 40.82
HBiLSTM-Ori 68.0 97.06 43.33 59.90
HBiLSTM-Req 64.0 88.24 31.58 46.51
DRIP-NonOpt 72.0 58.82 47.62 52.63
DRIP 78.0 67.64 54.76 60.52
(Continues)
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 ZHAO et al.

T A B L E 4 (Continued)
Documents (#par, #sens)
(dC, level) Approaches Accuracy C-Rec C-Pre C-F

Harmon38 (70,435) (dC = 0.33,M) TextTiling 49.79 83.57 41.94 55.85


GraphSeg 18.52 51.72 15.57 23.94
TopicTiling-Ori 51.91 87.14 35.26 50.21
TopicTiling-Req 56.60 95.00 24.27 38.66
HBiLSTM-Ori 59.57 99.28 42.55 59.57
HBiLSTM-Req 46.81 77.14 24.88 37.63
DRIP-NonOpt 84.25 73.57 43.67 54.81
DRIP 88.93 81.43 70.06 75.31
38 TextTiling 39.09 84.82 36.12 50.66
Buyukozkan (51, 413) (dC = 0.87, D)
GraphSeg 17.69 76.78 15.47 25.75
TopicTiling-Ori 43.21 93.75 37.37 53.47
TopicTiling-Req 42.80 92.86 19.05 31.61
HBiLSTM-Ori 46.09 99.11 35.44 52.21
HBiLSTM-Req 39.51 83.93 20.09 32.41
DRIP-NonOpt 80.65 70.54 34.20 46.07
DRIP 82.30 75.0 60.06 66.70

Penzenstadler38 (36, 276) (dC = 1.13, D+ ) TextTiling 44.62 82.85 40.85 54.72
GraphSeg 20.77 77.14 17.64 28.72
TopicTiling-Ori 47.69 88.57 34.63 49.80
TopicTiling-Req 49.23 91.43 21.48 34.78
HBiLSTM-Ori 53.84 98.57 40.46 57.37
HBiLSTM-Req 46.92 84.29 23.60 36.99
DRIP-NonOpt 75.38 64.29 46.88 54.22
DRIP 80.77 75.71 50.96 60.92
38 + TextTiling 53.91 75.28 52.19 61.64
Makropoulos (50, 435) (dC = 1.64, D )
GraphSeg 19.34 54.02 15.82 24.48
TopicTiling-Ori 66.67 93.10 39.71 55.67
TopicTiling-Req 69.55 97.13 25.07 39.86
HBiLSTM-Ori 71.60 99.42 54.54 70.44
HBiLSTM-Req 62.55 86.21 37.04 51.81
DRIP-NonOpt 73.66 63.79 46.44 53.75
DRIP 80.24 77.01 53.72 63.29

Average TextTiling 40.79 81.03 35.25 47.65


GraphSeg 29.58 67.51 24.11 34.68
TopicTiling-Ori 47.89 88.75 34.91 49.30
TopicTiling-Req 52.65 93.68 23.08 37.03
HBiLSTM-Ori 53.95 96.7 39.39 54.95
HBiLSTM-Req 48.87 80.70 25.18 38.39
DRIP-NonOpt 80.27 66.66 46.01 53.86
DRIP 85.05 76.32 53.47 62.42
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 19

T A B L E 5 Gains of our DRIP on the documents processing (%).


Approaches Accuracy C-Rec C-Pre C-F

TextTiling 108.51 −5.81 51.69 31.0


GraphSeg 187.53 130.50 121.78 79.99
TopicTiling-Ori 77.59 −14.0 53.17 26.61
TopicTiling-Req 61.54 −18.53 131.67 68.57
HBiLSTM-Ori 57.65 −21.08 35.75 13.59
HBiLSTM-Req 74.03 −5.42 112.35 62.59
DRIP-NonOpt 5.95 14.49 16.21 15.89

The supervised approaches trained with their original big datasets perform better than the unsupervised ones. Of
the four supervised approaches (i.e., TopicTiling-Ori, HBiLSTM-Ori, DRIP-NonOpt, and DRIP), DRIP performs best
and DRIP-NonOpt is the second best according to the accuracy. This demonstrates the advantages of the RS-Siamese
network for requirement segmentation. However, if we focus on the values of C-F, HBiLSTM-Ori slightly outperforms
DRIP-NonOpt. This illustrates that HBiLSTM-Ori has advantages in identifying the sentences to be merged. One primary
reason is the huge gap in the size of the training corpus, that is, 727,746 documents for HBiLSTM-Ori versus 17 for DRIP.
We would like to obtain more corpora to improve the performance of DRIP in the future.
Looking at the gains achieved by DRIP over the baselines in Table 5, we note another interesting fact, which is that
the accuracy gains are the largest, the C-Pre gains are the second largest, and the C-Rec gains are the smallest. This
indicates two things. (1) DRIP outperforms the other methods with respect to the identification of the boundaries between
individual requirements that are described using either single sentences or a few adjacent ones. (2) The popular text
segmentation algorithms, unsupervised TextTiling, supervised TopicTiling-Ori, and HBiLSTM-Ori, identify too many
multisentence requirements. That is the reason why the C-Rec values are consistently high (low C-Rec gain) and the C-Pre
values are low (high C-Rec gain). This is a result of the principle of text segmentation, which is to divide a given text into
topically coherent segments. Moreover, the sentences in the same paragraph tend to describe a similar topic with various
dependencies. Furthermore, these general approaches are not aware of the semantic elements of requirements and have
no specific optimization, which is another reason for the low C-Pre results.
Summary: DRIP outperforms six popular text segmentation algorithms, including two unsupervised and four super-
vised ones, with accuracy gains of 57.65%–108.51%. Because traditional text segmentation approaches aim to group
topically coherent sentences, they achieve better C-Rec results, but lower C-Pre and accuracy results.

6.3.2 RQ2: effectiveness on paragraphs with different complexities

To evaluate the effectiveness of DRIP on the processing of paragraphs with different levels of complexity, we adopted
the parComplexity metric, defined in Section 3, and grouped the paragraphs in the eight test documents according to
their complexity level. We present the number of paragraphs and the related sentences at each complexity level as well
as the evaluation results of the four metrics in Table 6. In addition, we calculated the average metrics of the automated
approaches for each of the four levels of paragraph complexity and the gains81 of DRIP with respect to the baselines, and
present the results in Table 7.
The results reveal that DRIP yields the best accuracy for the four levels of parComplexity, and the average accuracy is
approximately 92.04%, with gains of 33.45%–56.46%. DRIP-NonOpt yields the second-best results.
Unsurprisingly, for the two unsupervised baselines, TextTiling outperforms GraphSeg. Especially on the 26 paragraphs
with D+ parComplexity, TextTiling achieves the best C-Pre and C-F results, as well as good C-Rec results. This indicates
the advantage of sentence similarity based on word frequency and distribution for dividing the complex paragraphs.
DRIP-NonOpt performs second best.
Moreover, for the two supervised baselines, HBiLSTM-Ori does not obtain substantially better performance than
TopicTiling-Ori, as in the processing of different documents, for either accuracy or C-Rec. Especially for the paragraphs
with D and D+ level complexities, TopicTiling-Ori obtains a better C-Rec score, indicating a better ability to detect the
sentences to be merged. This illustrates the positive impact of the similarity computation on detecting topically coherent
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
20 ZHAO et al.

T A B L E 6 Effectiveness comparison on the paragraphs.


Documents (#par, #sens)
(dC, level) Approaches Accuracy C-Rec C-Pre C-F

E(113,622) TextTiling 44.97 80.0 40.0 53.33


GraphSeg 26.34 83.16 25.48 39.12
TopicTiling-Ori 51.28 91.23 34.85 50.43
TopicTiling-Req 52.86 94.04 22.58 36.41
HBiLSTM-Ori 56.21 99.65 41.54 58.64
HBiLSTM-Req 35.09 94.65 17.69 29.81
DRIP-NonOpt 84.81 73.69 44.78 55.71
DRIP 87.17 77.89 47.33 58.88

M(261,1046) TextTiling 51.66 82.84 47.78 60.60


GraphSeg 21.48 82.37 34.52 48.65
TopicTiling-Ori 57.56 92.3 36.36 52.17
TopicTiling-Req 58.06 94.74 2.57 5.01
HBiLSTM-Ori 62.36 97.04 44.93 61.42
HBiLSTM-Req 36.06 93.17 18.40 30.74
DRIP-NonOpt 90.32 89.47 45.0 59.88
DRIP 96.77 94.74 50.0 65.46
D(20,81) TextTiling 48.39 78.95 45.45 57.69
GraphSeg 38.4 78.95 15.79 26.28
TopicTiling-Ori 58.06 94.73 30.5 46.14
TopicTiling-Req 55.00 100.00 11.22 20.18
HBiLSTM-Ori 61.29 89.47 42.5 59.88
HBiLSTM-Req 38.40 86.22 29.22 43.65
DRIP-NonOpt 90.32 89.47 35.79 51.13
DRIP 94.20 94.73 45.0 61.02

D+ (26,264) TextTiling 49.59 90.9 50.0 64.51


GraphSeg 56.09 81.82 22.2 34.92
TopicTiling-Ori 47.96 96.14 35.48 51.83
TopicTiling-Req 55.00 100.00 22.45 36.67
HBiLSTM-Ori 58.53 81.82 39.13 52.94
HBiLSTM-Req 40.15 82.78 32.75 46.93
DRIP-NonOpt 85.0 72.72 38.1 50.0
DRIP 90.0 81.82 42.86 56.26

Average TextTiling 48.65 83.17 45.81 59.08


GraphSeg 35.58 81.57 24.5 37.24
TopicTiling-Ori 53.72 93.6 34.3 50.14
TopicTiling-Req 55.23 97.19 14.71 25.55
HBiLSTM-Ori 59.59 92.25 42.03 58.27
HBiLSTM-Req 37.18 89.21 24.52 38.46
DRIP-NonOpt 87.61 81.34 40.92 54.18
DRIP 92.04 87.26 46.3 60.41

Note: The bold values signify that these numbers excel across all methods.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 21

T A B L E 7 Gains of our DRIP on the paragraphs processing(%).


Approaches Accuracy C-Rec C-Pre C-F

TextTiling 89.19 4.92 1.07 2.25


GraphSeg 158.68 6.98 88.98 62.22
TopicTiling-Ori 71.33 −6.77 34.99 20.48
TopicTiling-Req 66.65 −10.22 214.75 136.44
HBiLSTM-Ori 54.46 −5.41 10.16 03.67
HBiLSTM-Req 147.55 −2.19 88.82 57.07
DRIP-NonOpt 5.06 7.28 13.15 11.50

sentences, as in TextTiling, and indicates a potential improvement to our approach in the future. However, in general,
HBiLSTM-Ori performs better than TopicTiling-Ori.
Just like the results in Table 4, in general, the TopicTiling-Ori and HBiLSTM-Ori approaches outperform the
TopicTiling-Req and HBiLSTM-Req approaches according to the metric of C-F. Moreover, TopicTiling-Req yields better
accuracy and C-Rec results for most kinds of paragraphs but obtains a substantially lower C-Pre.
Besides, through comparing the results of Tables 6 and 4, we can have an interesting observation. In general, all
approaches reach better results in processing the paragraphs of different complexities. This may be caused by the more
obvious topic shift in the big paragraph composed of the paragraphs from different documents.
A comparison of the results of Tables 6 and 4 lead to an interesting observation. In general, all the approaches obtain
better results when processing the paragraphs of different complexities. This may be caused by the more obvious topic
shifts in the big paragraph composed of paragraphs from different documents.
In addition, we observe that the best performance of DRIP is achieved on processing paragraphs with M (medium)
parComplexity, following by the D (difficult), E (easy), and finally D+ (difficult plus) levels. This is interesting because
the worst performance on the D+ complexity is expected and the second worst results on E complexity is surprising. After
further analysis on the data and our approach, we find that this simply reflects the importance of text description and
conjunction words. Short sentences with fewer words weaken the performance of our sentence similarity model, and
fewer conjunction words weaken the performance of our sentence relationship model. That is why only moderate results
are obtained on easy paragraphs. Further, the most complex paragraphs are usually hard to understand, even for our
annotators.
Summary: For processing the paragraphs of different complexity levels, DRIP outperforms the baselines with accu-
racy gains of 32.45%–56.46%. Moreover, our approach yields the best performance on processing the paragraphs with M
complexity, followed by those with D complexity.

6.3.3 Addressing RQ3: the stage evaluation of DRIP

The purpose of this experiment was to evaluate the effectiveness of each of the proposed components of DRIP: the initial
segmentation based on the RS-Siamese network and the segment optimization. We evaluated the performance of the
RS-Siamese network by comparing it with the four popular text segmentation approaches. Moreover, the optimizer was
evaluated by comparing the results of DRIP with those of DRIP-NonOpt. The results can be found in Tables 4 and 6.
Looking at the values of accuracy, we observe that DRIP-NonOpt performs better than the four baselines when
compared on both different documents and paragraphs with different parComplexity. This indicates the ability of the
RS-Siamese network to detect the boundaries of individual requirements that are described using one or more sentences.
However, the results for the other metrics for the identification of complex requirements (i.e., those described using mul-
tiple sentences) reveal that its performance is worse than that of HBiLSTM. The primary reason is the huge gap in training
data size, and we expect the RS-Siamese network to perform better with more training data.
We evaluated the effectiveness of the segment optimizer by comparing DRIP-NonOpt and DRIP. From Tables 5 and
7, we can see that DRIP performs better, with an average accuracy gain of 5.95% on different documents and 4.43% on
420 paragraphs with four complexity levels. Moreover, the average C-F gains are larger (15.89% and 6.23%, respectively).
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
22 ZHAO et al.

This indicates the good ability of the optimizer to adjust the boundaries of complex requirements. While the optimizer
has demonstrated good performance, it does not guarantee that all generated segments are complete.
Summary: DRIP-Opt achieves better accuracy than the four text segmentation approaches, which indicates the gen-
eral effectiveness of the RS-Siamese network on requirement segmentation in documents from different domains and
paragraphs with different levels of complexity. Moreover, the gap between the metrics of DRIP and DRIP-Opt show the
effectiveness of our optimizer when processing different complex documents (with an average accuracy gains of 5.95%
and C-F gain of 15.89%) and processing paragraphs with different levels of complexity (with an average accuracy gain
of 4.43% and C-F gain of 6.23%). Although we defined seven MSCs based on the analysis of sentence structures in three
random documents and observed the effectiveness of the segment optimizer in experiments, the limited number of test
documents highlights the need for future evaluation using larger corpora to ensure broader generalization.

6.3.4 Error analysis

We aim to scrutinize the errors produced by our DRIP, thereby acknowledging its limitations while also shedding light on
future work. Our analysis chiefly targets two types of error: instances where a merge was necessary but did not transpire
(termed negative unmerged instances), and situations where merges were incorrectly performed (referred to as negative
merged). Given the construct of DRIP, we approached our analysis from three perspectives: the impact of sentence length
on DRIP’s performance, the influence of conjunction words, and the effect of neighboring sentence similarity.
Interestingly, we achieved three observations:

• The length of sentences does not substantially affect the accuracy of DRIP’s merging function. By contrasting the
average length (i.e., single words) of successfully merged sentences to those that were negatively unmerged or merged,
we found no significant differences, as illustrated in Figure 10.
• The presence of conjunctions is a pivotal factor in sentence merging. For instance, among the negative unmerged
pairs, 99% did not include any conjunctions related to the four relationships addressed by our DRIP. However, 22%
of the accurately merged pairs did contain conjunctions, highlighting their importance for merging sentences, as
depicted in Figure 11.
Moreover, we were interested in understanding whether conjunction words associated with other relationships not
addressed by our DRIP could serve as key indicators. Analyzing the ratio of these conjunctions within the negative
unmerged pairs revealed only 4% contained such words. This suggests these words could potentially aid in the merging
process. Yet, we found that 10% of correctly merged sentences included these types of conjunctions. Although we
did not explicitly employ these words, their presence indicates a potential area of improvement for enhancing DRIP’s

F I G U R E 10 Sentence lengths: a comparison between correctly merged sentences and negatively unmerged/merged sentences.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 23

Yes No
Yes No

F I G U R E 11 Presence of conjunctions covered by divide requirement items in the paragraphs: a comparison between correctly merged
sentences and negatively unmerged/merged sentences.

F I G U R E 12 One example showing too similar sentences tend to be merged.

merging accuracy. Therefore, future iterations of our model could benefit from incorporating these conjunctions to
refine the process further.
• High similarity between neighboring sentences tends to confuse DRIP, leading it to misjudge and erroneously merge
them. For example, some pairs exhibited strikingly similar sentence structures with quite high words overlap, as
demonstrated in the two sentences of Figure 12. This resemblance posed a challenge for the model, complicating
accurate differentiation and ultimately resulting in incorrect merging.

7 DISCUSSIONS

7.1 Discussion of practical significance with the task of requirements traceability

Good requirements should be singular,11 serving as the foundation for most requirement-related tasks such as require-
ments tracing, requirement-based development/testing, and change impact analysis. And “write the requirements in
a fine-grained fashion, rather than creating large paragraphs containing many individual functional requirements that
readers have to parse out.”82 The target of our DRIP is to segment these paragraphs into distinct requirements. Here, we
discuss the practical relevance of our DRIP in relation to the task of requirements tracing.
According to Reference 82, “Trace links allow you to follow the life of a requirement both forward and backward,
from origin through implementation.” Throughout the life-cycle of any software product, each requirement is crucial,
warranting its implementation and testing. However, multiple requirements are often bundled into single paragraphs
without explicit dividers in real-world scenarios, posing an obstacle for traceability tasks.
For example, we calculated the requirement number in the 25 requirements documents, collected from the published
RE papers. And we found that one average, one single paragraph contains 2.77 requirements. The largest paragraph even
covers 29 requirements, with 52 sentences in the documents of Ancillotti. And six documents contain the paragraphs,
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
24 ZHAO et al.

F I G U R E 13 Number of requirements per paragraph in the 25 selected documents.

with more than 10 requirements. The result is visually represented in Figure 13 using a box plot. Undoubtedly, one can
imagine how tedious and time-consuming manual segmentation of these paragraphs could be.
In practice, manual efforts are required to segment these paragraphs. During our annotation process, approximately
105 person-hours were consumed for annotating and validating 25 documents, averaging 4.2 h per document (reference
Section 6.2.1). On average, annotating a single paragraph took about 2.78 minutes. In stark contrast, our DRIP processes
one document in just 10 s on a computer equipped with an Intel(R) Core(TM) i7-10700K CPU operating at 3.8 GHz.
Considering these three aspects, our work stands validated in practical tasks. The adherence to established software
standards, the detailed analysis of sentence numbers within paragraphs, and the time required for the annotation process
collectively affirm the significance and applicability of our approach in real-world scenarios.

7.2 Threats to validity

Conclusion validity concerns the statistical analysis of the results and the composition of the subject.83 Our data consist
of only public requirements documents used in RE academic papers published in the four mainstream RE forums in
the last 10 years. Although these data are limited, they are widely used in the academic community and created in real
industry practice by different groups. Thus, we believe that our corpus is a representative one for observing requirement
representation and evaluating our approach.
We recruited people to manually annotate the segmented requirements rather than merging the existing isolated and
single ones to build the reference answer, because we would like to test in the real scenario and the latter is actually not.
To be specific, we aim to segment the requirements (information) packed into one paragraph and the sentences in one
paragraph usually express coherent topics and there are also some kind of adhesive relationships between them. Simply
putting the isolated requirements together would definitely lose these characteristics. However, we plan to imitate the
process of specifying requirements and organizing the related existing single ones in one paragraph with appropriate
connection relationships to build the reference answer in future.
Internal validity concerns the validity of the causal relations between the results and our approaches, that is, the
existence of experimental errors and biases. The biggest threat to internal validity is from the manual annotations of the
individual requirements. To mitigate the possible subjectivity, we recruited eight annotators. Two of them were assigned
the same documents, and they performed the annotations independently. Our author organized joint discussions to
eliminate any discrepancies (Section 6.2.1). The kappa values showed the perfect agreement on the annotations.
External validity concerns the generalization of the experimental results to other corpora. Although we systemati-
cally collected the public requirement documents used in RE research, our dataset is small (i.e., 25 documents consisting
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 25

of 1132 paragraphs and 6826 sentences), in which 17 documents (712 paragraphs and 4813 sentences) were used for train-
ing and eight were used for testing. Although these eight documents are from different groups with different writing
styles, we will need to test the generality of DRIP with more diverse documents.
Construct validity concerns on the suitability of the used metrics. The biggest challenge of requirement segmenta-
tion is that the number of sentences used for a requirement may vary. To evaluate the automated approaches, we used
the metric of accuracy to evaluate their overall ability to detect individual requirements, which may be described by one
or more sentences. Moreover, another challenge is to detect the multiple neighboring sentences that can be merged into
a self-sufficient unit. Thus, we designed three additional metrics, C-Rec, C-Pre, and C-F, to evaluate the effectiveness of
the identification of these blocks.

7.3 Implications

7.3.1 Advantage over the existing text segmentation approaches

The previous text segmentation approaches, either unsupervised or supervised ones, aim to detect the topically coherent
segments mainly based on the syntactic (e.g., related or similar words78,84 ) or semantically relationships (e.g., similarity
calculations20 or topic distribution20,85,86 ) between neighboring blocks. Towards this situation, one primary implication
suggested by this work is that both the correlations between neighboring sentences and the semantic constitutions of sin-
gle segmented units matter for domain-specific text segmentation. Besides, the rhetorical structure relationships also help.
Although we only temporarily used four common relation types in this work, experiments show that DRIP performs
better in the paragraphs with more conjunction words.

7.3.2 Practical impact

To the best of our knowledge, DRIP is the first public automated approach for requirements information segmentation.
It can produce the input for lots of tasks effectively, such as requirements change impact analysis, requirements tracing,
cost analysis and so on. Besides, DRIP can potentially segment more general requirements-related paragraphs, such as
user’s review and similar product descriptions.

7.4 Limitations

There are four main limitations in our study.

7.4.1 The use of the relation types between sentences is limited

In the of RS-Siamese network, we only focused on four sentence relationships (i.e., causal, supplementary, transitional,
and interpretive relationships), as described in Section 5.2.2 in this initial study, and we demonstrated the effectiveness
of this framework with experiments. However, there are 22 relations defined in basic rhetorical structure theory,71 and
the relations are an open set that continues to evolve. Other relations between sentences may also contribute to segment
detection. We would like to explore their impact using regression analysis in future.

7.4.2 The collection of MSCs may be limited

In this work, we analyzed the sentence structures of requirement description in three random documents and defined
seven MSCs. We assume that the sentence structures in the 940 related sentences are representative. Moreover, the
experiments revealed the effectiveness of the segment optimizer based on these MSCs (in Section 6.3.3). However, eight
test documents are insufficient generalization testing. We need to evaluate the proposed segment optimizer with larger
corpora in future.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
26 ZHAO et al.

7.4.3 We did not distinguish the statements representing requirements and those
illustrating/explaining requirements

Temporarily, we only divide the requirement-related paragraphs into several semantic groups, each of which represents
and explains one requirement. In other word, we did not distinguish requirements from non-requirements. Taking the
example in one work on demarcating requirements,19 we would put the sentence “When in remotely controlled mode,
the speed must be limited by the rover navigation controller to 3 km/h with a target speed of 5 km/h” and its following one
“This is because in the remote controlled mode, there is a likely delay ⩾20 ms in information transfer from the controller to
the operations centre” together because they actually describe the same requirement. While the work19 discards the latter
one since it does not represent but only explain one requirement.

7.4.4 The utilization of modest dataset in Siamese network

Although our dataset consists of a relatively small size (1132 paragraphs), our findings demonstrate satisfactory per-
formance of our model under these limited data conditions. This aligns with the positive outcomes reported by
another researcher who employed the Siamese network approach with smaller-scale datasets (1035 dialogues for
training purposes).62 Thus, despite the limited dataset size, our approach exhibits potential applicability and effec-
tiveness. However, it is essential to consider the impact of dataset size on model performance when applying this
method to larger-scale datasets or different domains, requiring further experimentation and validation to assess
its efficacy.

7.4.5 The segment optimizer cannot guarantee the completeness of each segment

Its aim is only to optimize the generated segments and its effectiveness has been shown in Section 6.3.3. However, it does
not guarantee that all generated segments are complete.

7.4.6 Handle the case of one sentence containing multiple requirements

We are considering the utilization of advanced natural language generation models in future. Specifically, these models
could be deployed to extract vital requirement information (such as operations and input/output data), deduce absent
semantics, and ultimately assemble coherent and logically consistent individual requirement statements.

7.4.7 Our method has not undergone real-world validation and impact analysis.

While we have demonstrated the effectiveness of our approach using publicly available datasets, further validation in
real-world scenarios is needed to assess its actual applicability in practical settings. In future work, we plan to conduct
thorough validations and enhance the practical relevance of our findings. Additionally, we will analyze the impact of item
segmentation in real-world projects (like requirement tracing and system testing tasks) to identify potential challenges
and benefits associated with its implementation.

8 RELATED WORK

To the best of our knowledge, there is no related work on the topic of software requirement segmentation. There-
fore, we focus on the research on text segmentation and software requirement identification. Text segmentation aims
to divide a given text into topically coherent segments, which is very close to our aim. Technically, the task of require-
ment identification is to identify sets of single requirements from different documents. This also seems close to
our aim.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 27

8.1 Text segmentation

The text segmentation methods can be divided into two categories: unsupervised and supervised methods.
Most unsupervised approaches are based on the use of lexical chain information.87 They leverage the fact that related
or similar words tend to occur in topically coherent segments and a change in vocabulary indicates segment boundaries.
Early unsupervised text segmentation methods are mainly based on lexical similarity, which assumes that similar
words converge in a fragment. The related work includes TextTiling by Hearst et al.,78 C99 by Reference 21, and U00 by
Reference 88. TextTiling was designed by following the principle that high vocabulary intersection between two adjacent
blocks indicates high coherence and vice versa. C99 is designed based on divisive clustering with a matrix-ranking schema.
U00 is a probabilistic approach using dynamic programming to determine a segmentation with minimum cost. In recent
years, most unsupervised text segmentation methods are topic model-based or graph-based segmentation methods. The
topic model-based approaches include TopicTiling.20,87,89,90 The graph-based approaches include Reference 23. TopicTil-
ing is based on TextTiling,20 and it calculates the similarity between two adjacent blocks based on topic IDs generated by
LDA-based topic modeling instead of words. The methods of References 87 and 89 are also based on the model of LDA to
compute the semantic information similarity. Reference 90 propose an ordering-based topic model that constrains latent
topic assignments by topic orderings to obtain the semantic information needed for segmentation. The graph-based seg-
mentation methods construct a graph based on the similarity between statements and extract the maximum clique in the
graph as the segmentation result.23
Supervised approaches learn from a set of annotations. Reference 80 presented a dataset for text segmentation and
developed a hierarchical neural network model based on BiLSTM to demonstrate the efficacy of this dataset. Reference
22 aimed to predict segment boundaries in spoken multiparty dialogue, and they used CRF to train a discourse segmenter
using lexical and syntactic features. Reference 85 proposed to predict the topic coherence of consecutive text segments
spanning multiple paragraphs in legal documents, based on Transformer. Reference 84 proposed to divide legal docu-
ments into semantically coherent units. They developed a multitask learning-based deep learning model with document
rhetorical role label shift as an auxiliary task to predict rhetorical roles in legal documents. Reference 86 proposed a
topic-based text segmentation method in relation extraction at the sentence level to segment documents into thematic
coherent segments. Reference 91 proposed to divide pages into homogeneous areas and they grouped text lines to fin-
ish the text segmentation based on the idea of Density-Based Spatial Clustering of Applications with Noise algorithm
(DBSCAN).
The advantages of these general text segmentation approaches have been shown with various datasets. However,
because of the unique characteristics of software requirement statements, they cannot solve our problem effectively, as
shown in the experiments (Section 6). First, a requirement statement should contain some essential and mandatory
semantic elements. However, the common approaches do not check the semantic constitutions of these segments. Fur-
thermore, complicated dependencies exist among requirements, which leads to topically coherent segments. This makes
traditional text segmentation approaches perform poorly on our task.

8.2 Requirement identification

Requirement identification research can be classified into two groups: identification of requirement-related information
from the early documents beyond requirement specifications, and demarcating requirements in free-form requirement
specifications.
Requirement information is primarily extracted from user manuals and project reports,16 various domain docu-
ments,15,92–94 requests for proposals,95 and early notes.17 Researchers have used techniques such as topic modeling (e.g.,
LDA),16 information retrieval,15,92,93 machine learning-based classification,17 or deep learning networks18 to identify the
information on requirements from early system documents, with the aim of assisting the generation of software require-
ments (specifications). Usually, it is necessary to decide whether each of the single sentences is relevant or not to the
requirements. They do not need to check whether the neighboring sentences describe the same requirement or not. Thus,
this work is essentially different from ours.
Another task is processing requirement specifications, such as requirement demarcation19 and requirement classi-
fication.18 The task of requirement demarcation is to determine whether each of the single sentences in a requirement
specification is requirement-relevant or not. However, the researchers did not determine if these sentences describe the
same requirement. The task of requirement classification aims to distinguish the requirement types, such as functional
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
28 ZHAO et al.

R 1-1.Detector data must be acquired and stored in the most effective way.
R 1-2.Effectiveness should be evaluated in terms of cost, space requirements, longevity, and speed.
1-3.This shall lead to the definition of a Gemini 8-m Telescopes standard, used on all instruments. N
R 1-4.In general, operational overheads must be kept as low as possible, to maximize actual observing times.
R 1-5.Intermediate storage of raw data in memory on different nodes and in different formats should be kept
to a minimum.
R 1-6.However, there must be at least two copies—one to secure data as acquired and one to do assessment of
data quality online (this last copy preferably on removable media).
1.Detector data must be acquired and stored in the Requirement
most effective way. Effectiveness should be
R N Nonrequirement Requirement extraction result
evaluated in terms of cost, space requirements,
longevity, and speed. This shall lead to the T1.Detector data must be acquired and stored in the most effective way. Effectiveness should be evaluated
definition of a Gemini 8-m Telescopes standard, in terms of cost, space requirements, longevity, and speed. This shall lead to the definition of a Gemini 8-m
used on all instruments. In general, operational Telescopes standard, used on all instruments. Topic 1 Topic 2
overheads must be kept as low as possible, to T2. In general, operational overheads must be kept as low as possible, to maximize actual observing times.
maximize actual observing times. Intermediate
T3. Intermediate storage of raw data in memory on different nodes and in different formats should be kept
storage of raw data in memory on different nodes
to a minimum. However, there must be at least two copies—one to secure data as acquired and one to do
and in different formats should be kept to a
assessment of data quality on-line (this last copy preferably on removable media).
minimum. However, there must be at least two Topic 3
copies—one to secure data as acquired and one to Traditional Text segmentation result
do assessment of data quality online (this last copy
preferably on removable media). R-1.Detector data must be acquired and stored in the most effective way.
Based_on R-2.Effectiveness should be evaluated in terms of cost, space requirements, longevity, and speed. This
Original requirements shall lead to the definition of a Gemini 8-m Telescopes standard, used on all instruments.
refines
R-3.In general, operational overheads must be kept as low as possible, to maximize actual observing times.
R-4.Intermediate storage of raw data in memory on different nodes and in different formats should be kept
refines
to a minimum. However, there must be at least two copies—one to secure data as acquired and one to do
assessment of data quality online (this last copy preferably on removable media).

Target of this study

F I G U R E 14 One example, selected from the gemini requirement,13 showing the difference between the tasks of requirement
extraction, traditional text segmentation,20 and our requirements segmentation.
19

and non-functional requirements. There is also some work aiming to cluster the same or similar requirements.96–98 They
all require the description of individual software requirements as input, either ready-made requirement items or single
sentences extracted from requirement documents. This is, in fact, the output of our present work.

8.3 Discuss on the related work

To better illustrate the distinctions among requirement extraction, traditional text segmentation, and our approach to
requirements segmentation, we use the paragraph example from the Introduction and display the results of these three
tasks in Figure 14. For simplicity, we have chosen one representative study for each of the first two tasks: work19 for
requirement extraction and work20 for traditional text segmentation. The results for these two work were manually
derived following their respective principles.
The purpose of requirement extraction is to check each sentence in this paragraph and decide whether it is a require-
ment. Thus, the third sentence will be discarded. Whether the neighboring sentences belong to the same requirement is
not a concern in this task. Traditional text segmentation usually tends to group topically coherent sentences. So they will
divide the original paragraphs into three clusters (i.e., topic 1: effectiveness, topic 2: data acquisition, and topic 3: data
storage). While we aim to identify every single requirement and its supporting sentence. So we segment the original para-
graph into four requirements (i.e., R-1 to R-3). The essential difference between these three tasks makes that the existing
approaches to these two tasks cannot work in our case.

8.4 Other work

Dialogue disentanglement of developer live chat messages is a hot topic in software engineering community, which is in a
way similar to our problem. Both are the tasks of dividing a given text into a few segments. But dialogue disentanglement
is to divide the dialogue into distinct conversations which are entangled,62,99 while we mean to segment the paragraphs
into individual software requirements with adjacent blocks.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 29

9 CO N C LU S I O N

This study was motivated by our observations during collaboration with several industrial partners of the phenomenon
of requirement adhesion in requirement documentation. To evaluate its frequency, we extensively collected open-source
requirement sets used in the RE academic community and built a repository of 132 documents. We presented a prelim-
inary study on these documents and found that 96 (≈ 72.72%) represent requirements with NL stored in PDF/Microsoft
Word. Of these 96 documents, approximately 26% have the phenomenon of requirement adhesion. Considering the
tedious and laborious work of human annotation and the lack of an automated requirement segmentation approach at
present, we proposed an automated approach called DRIP, which included the novel multi-task RS-Siamese network and
a segment optimizer. To be specific, the RS-Siamese network integrates the similarity of the neighboring sentences and
their diverse relationships. Our segment optimizer is designed to adjust the sentence merging according to the complete-
ness of the semantic constitutions as candidate requirements. We demonstrated its promising performance by comparing
it with four popular text segmentation approaches on eight public requirement documents randomly selected from our
corpus. The requirement corpus, requirement segmentations,* and the complete implementation of DRIP are publicly
available via GitHub.†
In future work, we will evaluate the usefulness of our approach in real project practice and explore its impact on the
downstream activities, such as requirement tracing and system testing. We will also try to improve our SR-Siamese by
integrating more types of sentence relationships and training it with bigger corpus.

AU THOR CONTRIBUTIONS:
Conceptualization: Z.Z., L.Z., and X.L.; methodology and software: Z.Z., and H.L.; validation: Z.Z., X.L., and H.L.; formal
analysis: Z.Z. and X.L.; resources: Z.Z.; data curation: Z.Z.; writing - original draft: Z.Z.; writing -review and editing: X.L.;
visualization: Z.Z.; supervision: L.Z. and X.L.; project administration: X.L.; funding acquisition: X.L. and L.Z. All authors
have read and agreed to the published version of the manuscript.

ACKNOWLEDGMENTS
Funding for this work has been provided by the National Science Foundation of China Grant numbers 62102014
and 62177003. It is also partially supported by State Key Laboratory of Software Development Environment number
SKLSDE-2021ZX-10.

DATA AVAILABILITY STATEMENT


The data that support the findings of this study are openly available in DRIP: Segmenting Individual Require-
ments from the Paragraphs at https://zenodo.org/record/5687413\LY1\textbackslash#.YY8pnWBByM8, reference num-
ber 10.5281/zenodo.5687413.

ORCID
Ziyan Zhao https://orcid.org/0000-0002-2728-669X

REFERENCES
1. Hey T, Keim J, Koziolek A, Tichy WF. NoRBERT: transfer learning for requirements classification. In: Breaux TD, Zisman A, Fricker S,
Glinz M, eds. 28th IEEE International Requirements Engineering Conference, RE 2020, Zurich, Switzerland, August 31 - September 4, 2020.
IEEE; 2020:169-179.
2. Deshpande G, Motger Q, Palomares C, et al. Requirements dependency extraction by integrating active learning with ontology-based
retrieval. In: Breaux TD, Zisman A, Fricker S, Glinz M, eds. 28th IEEE International Requirements Engineering Conference, RE 2020, Zurich,
Switzerland, August 31 - September 4, 2020. IEEE; 2020:78-89.
3. Rahimi M, Cleland-Huang J. Evolving software trace links between requirements and source code. Empir Softw Eng. 2018;23(4):2198-2231.
doi:10.1007/s10664-017-9561-x
4. Rahimi M, Goss W, Cleland-Huang J. Evolving requirements-to-code trace links across versions of a software system. Paper presented at:
2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). 2016; Raleigh, NC:99-109.

*
https://zenodo.org/record/5687413\LY1\textbackslash#.YY8pnWBByM8.

https://github.com/ZacharyZhao55/DRIP.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
30 ZHAO et al.

5. Ezzini S, Abualhaija S, Arora C, Sabetzadeh M, Briand LC. Using domain-specific corpora for improved handling of ambiguity in
requirements. Paper presented at: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 2021; Madrid,
ES:1485-1497.
6. Arora C, Sabetzadeh M, Briand LC. An empirical study on the potential usefulness of domain models for completeness checking of
requirements. Empir Softw Eng. 2019;24(4):2509-2539. doi:10.1007/s10664-019-09693-x
7. Schlutter A, Vogelsang A. Knowledge extraction from natural language requirements into a semantic relation graph. Paper presented at:
ICSEW’20: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, Seoul, Republic of Korea.
2020:373-379.
8. Arora C, Sabetzadeh M, Nejati S, Briand LC. An active learning approach for improving the accuracy of automated domain model
extraction. ACM Trans Softw Eng Methodol. 2019;28(1):4:1-4:34. doi:10.1145/3293454
9. Pohl K. Requirements Engineering. Springer-Verlag; 2010.
10. RTCA I. Software considerations in airborne systems and equipment certification. RTCA DO-178C. 2011:1-118.
11. ISO/IEC/IEEE. ISO/IEC/IEEE International Standard - Systems and software engineering – Life cycle processes – Requirements
engineering. ISO/IEC/IEEE 29148:2018(E). 2018:1-104. doi:10.1109/IEEESTD.2018.8559686
12. Mavin A, Wilkinson P, Harwood ARG, Novak M. Easy Approach to Requirements Syntax (EARS). IEEE Computer Society; 2009:317-322.
13. Ferrari A, Spagnolo GO, Gnesi S. PURE: a dataset of public Requirements documents. Paper presented at: 2017 IEEE 25th International
Requirements Engineering Conference (RE). 2017; Lisbon, Portugal:502-505.
14. Mich L, Franch M, Inverardi PLN. Market research for requirements analysis using linguistic tools. Require Eng. 2004;9:409656.
15. Lian X, Liu W, Zhang L. Assisting engineers extracting requirements on components from domain documents. Inf Softw Technol.
2020;118:106196. doi:10.1016/j.infsof.2019.106196
16. Li Y, Guzman E, Tsiamoura K, Schneider F, Bruegge B. Automated requirements extraction for scientific software. Proc Comput Sci.
2015;51:582-591. doi:10.1016/j.procs.2015.05.326
17. Cleland-Huang J, Settimi R, Zou X, Solc P. The Detection and Classification of Non-Functional Requirements with Application to Early
Aspects. IEEE Computer Society; 2006:36-45.
18. Winkler J, Vogelsang A. Automatic classification of requirements based on convolutional neural networks. Paper presented at: 2016 IEEE
24th International Requirements Engineering Conference Workshops (REW). 2016; Beijing, China;39-45.
19. Abualhaija S, Arora C, Sabetzadeh M, Briand LC, Vaz E. A machine learning-based approach for demarcating Requirements in textual
specifications. In: Damian DE, Perini A, Lee S, eds. 27th IEEE International Requirements Engineering Conference. RE 2019, Jeju Island,
Korea (South), September 23-27, 2019. IEEE; 2019:51-62.
20. Riedl M, Biemann C. TopicTiling: a text segmentation algorithm based on LDA. Paper presented at: Proceedings of ACL 2012 Student
Research Workshop. ACL’12. Association for Computational Linguistics. 2012; Jeju Island, Korea: 379642.
21. Choi F. Advances in domain independent linear text segmentation. Paper presented at: NAACL 2000: Proceedings of the 1st North
American Chapter of the Association for Computational Linguistics Conference. 2000; Seattle, Washington.
22. Hsueh PY, Moore JD, Renals S. Automatic segmentation of multiparty dialogue. Paper presented at: 2006 IEEE Spoken Language
Technology Workshop. 2006; Palm Beach, Aruba.
23. Glavaš G, Nanni F, Ponzetto SP. Unsupervised Text Segmentation Using Semantic Relatedness Graphs. Association for Computational
Linguistics; 2016:125–130.
24. Deshpande G, Motger Q, Palomares C, et al. Requirements dependency extraction by integrating active learning with ontology-based
retrieval. Paper presented at: 2020 IEEE 28th International Requirements Engineering Conference (RE). Zurich, Switzerland; 2020:78-89.
25. Pohl K. Process-Centered Requirements Engineering. John Wiley & Sons, Inc.; 1996.
26. Dahlstedt ÅG, Persson A. Requirements Interdependencies: State of the Art and Future Challenges. Springer Berlin; 2005:95-116.
27. Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR.
2018:abs/1810.04805.
28. Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using siamese BERT-networks. Paper presented at: Conference on
Empirical Methods in Natural Language Processing, Hong Kong, China. 2019:3982-3992. doi:10.18653/v1/D19-1410
29. Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R. Signature verification using a Siamese time delay neural network. In: Cowan JD,
Tesauro G, Alspector J, eds. Advances in Neural Information Processing Systems 6, [7th NIPS Conference, Denver, Colorado, USA, 1993].
Morgan Kaufmann; 1993:737-744.
30. CESUR ÖÜK. Dictionary of Linking Words: A Guide to Translation Studies. Paradigma Akademi Basin Yayin Dăgitim; 2018.
31. Thompson SA, Mann WC. Rhetorical Structure Theory: A Theory of Text Organization. The Structure of Discourse; 1987.
32. O’Brien E. List of conjunctions. 2021. Accessed April 1, 2022. https://www.english-grammar-revolution.com/list-of-conjunctions.html
33. 7esl.com. Linking words, connecting words: full list and useful examples. 2021. Accessed April 1, 2022. https://7esl.com/linking-words/
34. Palmer M, Gildea D, Kingsbury P. The proposition Bank: an annotated corpus of semantic roles. Comput Linguist. 2005;31(1):71-106.
doi:10.1162/0891201053630264
35. Fillmore CJ, Johnson CR, Petruck MR. Background to Framenet. Int J Lexicogr. 2003;16(3):235-250. doi:10.1093/ijl/16.3.235
36. CoEST. CoEST: Center of excellence for software traceability. 2002. Accessed July 1, 2022. http://www.CoEST.org
37. PROMISE. PROMISE: International workshop on predictor models in software engineering. Accessed May 1, 2022. http://promise.site.
uottawa.ca/
38. Venters CC, Seyff N, Becker C, et al. Characterising Sustainability Requirements: A New Species Red Herring or Just an Odd Fish? IEEE
Computer Society; 2017:3-12.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 31

39. Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J. A machine learning approach for tracing regulatory codes to product specific
requirements. In: Kramer J, Bishop J, Devanbu PT, Uchitel S, eds. Proceedings of the 32nd ACM/IEEE International Conference on Software
Engineering - Volume 1, ICSE 2010, Cape Town, South Africa, 1-8 May 2010. ACM; 2010:155-164.
40. Maalej W, Kurtanovic Z, Nabil H, Stanik C. On the automatic classification of app reviews. Requir Eng. 2016;21(3):311-331.
doi:10.1007/s00766-016-0251-9
41. Hayes JH, Dekhtyar A, Larsen J, Guéhéneuc Y. Effective use of analysts’ effort in automated tracing. Requir Eng. 2018;23(1):119-143.
doi:10.1007/s00766-016-0260-8
42. Carvallo JP, Franch X. An empirical study on the use of i* by non-technical stakeholders: the case of strategic dependency diagrams. Requir
Eng. 2019;24(1):27-53.
43. Salay R, Chechik M, Horkoff J, Sandro AD. Managing requirements uncertainty with partial models. Requir Eng. 2013;18(2):107-128.
doi:10.1007/s00766-013-0170-y
44. Kaufmann A, Riehle D. The QDAcity-RE method for structural domain modeling using qualitative data analysis. Requir Eng.
2019;24(1):85-102.
45. Moitra A, Siu K, Crapo AW, et al. Automating requirements analysis and test case generation. Requir Eng. 2019;24(3):341-364.
doi:10.1007/s00766-019-00316-x
46. Nguyen CM, Sebastiani R, Giorgini P, Mylopoulos J. Multi-objective reasoning with constrained goal models. Requir Eng.
2018;23(2):189-225. doi:10.1007/s00766-016-0263-5
47. Lucassen G, Dalpiaz F, van der Werf JMEM, Brinkkemper S. Improving agile requirements: the quality user story framework and tool.
Requir Eng. 2016;21(3):383-403. doi:10.1007/s00766-016-0250-x
48. Horkoff J, Yu E. Interactive goal model analysis for early requirements engineering. Requir Eng. 2016;21(1):29-61.
49. Lucassen G, Robeer M, Dalpiaz F, van der Werf JMEM, Brinkkemper S. Extracting conceptual models from user stories with visual narrator.
Requir Eng. 2017;22(3):339-358. doi:10.1007/s00766-017-0270-1
50. Shi L, Chen C, Wang Q, Boehm BW. Automatically detecting feature requests from development emails by leveraging semantic sequence
mining. Requir Eng. 2021;26(2):255-271. doi:10.1007/s00766-020-00344-y
51. Robinson M, Sarkani S, Mazzuchi T. Network structure and requirements crowdsourcing for OSS projects. Requir Eng. 2021;26(4):509-534.
52. Reinhartz-Berger I, Kemelman M. Extracting core requirements for software product lines. Requir Eng. 2020;25(1):47-65.
53. Villamizar H, Kalinowski M, Garcia AF, Méndez D. An efficient approach for reviewing security-related aspects in agile requirements
specifications of web applications. Requir Eng. 2020;25(4):439-468. doi:10.1007/s00766-020-00338-w
54. Tenbergen B, Weyer T, Pohl K. Hazard relation diagrams: a diagrammatic representation to increase validation objectivity of
requirements-based hazard mitigations. Requir Eng. 2018;23(2):291-329.
55. Pasquale L, Spoletini P, Salehie M, Cavallaro L, Nuseibeh B. Automating trade-off analysis of security requirements. Requir Eng.
2016;21(4):481-504. doi:10.1007/s00766-015-0229-z
56. Wang Y, Li T, Zhou Q, Du J. Toward practical adoption of i* framework: an automatic two-level layout approach. Requir Eng.
2021;26(3):301-323. doi:10.1007/s00766-021-00346-4
57. Sadi MH, Yu E. Modeling and analyzing openness trade-offs in software platforms: a goal-oriented approach. In: Grünbacher P, Perini A,
eds. Requirements Engineering: Foundation for Software Quality. REFSQ 2017. Lecture Notes in Computer Science. Vol 10153. Springer;
2017:33-49.
58. Chattopadhyay A, Niu N, Peng Z, Zhang J. Semantic frames for classifying temporal rRequirements: an exploratory study. In: Aydemir
FB, Gralha C, Daneva M, et al., eds. Joint Proceedings of REFSQ 2021 Workshops, OpenRE, Poster and Tools Track, and Doctoral Symposium
co-Located with the 27th International Conference on Requirements Engineering: Foundation for Software Quality (REFSQ 2021), Essen,
Germany, April 12, 2021. 2857 of CEUR Workshop Proceedings. CEUR-WS.org; 2021.
59. Li W, Brown D, Hayes JH. Truszczynski M. Answer-set programming in Requirements engineering. In: Salinesi C, Weerd dI, eds. Require-
ments Engineering: Foundation for Software Quality - 20th International Working Conference, REFSQ 2014, Essen, Germany, April 7-10,
2014. Proceedings. 8396 of Lecture Notes in Computer Science. Vol 2014. Springer;2014:168-183.
60. Azmeh Z, Mirbel I, Crescenzo P. Highlighting Stakeholder Communities to Support Requirements Decision-Making. Springer; 2013:190-205.
61. Li M, Shi L, Yang Y, Wang Q. A Deep Multitask Learning Approach for Requirements Discovery and Annotation from Open Forum. IEEE;
2020:336-348.
62. Shi L, Xing M, Li M, Wang Y, Li S, Wang Q. Detection of hidden feature requests from massive chat messages via deep siamese network.
In: Rothermel G, Bae D, eds. ICSE’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020.
ACM; 2020:641-653.
63. Dey S, Dutta A, Toledo JI, Ghosh SK, Lladós J, Pal U. SigNet: convolutional Siamese network for writer independent offline signature
verification. CoRR. 2017; abs/1707.02131.
64. Lin J, Liu Y, Zeng Q, Jiang M, Cleland-Huang J. Traceability transformed: generating more accurate links with pre-trained BERT models.
Paper presented: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 2021; Madrid, ES:324-335.
65. Li B, Zhou H, He J, Wang M, Yang Y, Li L. On the sentence embeddings from bert for semantic textual similarity. Paper presented at:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020:9119-9130.
66. Sakata W, Shibata T, Tanaka R, Kurohashi S. FAQ retrieval using query-question similarity and BERT-based query-answer relevance. In:
Piwowarski B, Chevalier M, Gaussier É, Maarek Y, Nie J, Scholer F, eds. Proceedings of the 42nd International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019. ACM; 2019:1113-1116.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
32 ZHAO et al.

67. Cai J, Zhu Z, Nie P, Liu Q. A pairwise probe for understanding BERT fine-tuning on machine Reading comprehension. In: Huang JX,
Chang Y, Cheng X, et al., eds. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020. ACM; 2020:1665-1668.
68. Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation. Association for Computational Linguistics;
2014:1532–1543.
69. Arora C, Sabetzadeh M, Briand LC, Zimmer F. Automated extraction and clustering of requirements glossary terms. IEEE Trans Softw
Eng. 2017;43(10):918-945. doi:10.1109/TSE.2016.2635134
70. Cleland-Huang SLSAAZJ. Improving trace accuracy through data-driven configuration and composition of tracing features. In: Meyer B,
Baresi L, Mezini M, eds. Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the
Foundations of Software Engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18-26, 2013. ACM; 2013:378-388.
71. Mann W, Thompson S. A Theory of Text Organization: Rhetorical Structure Theory. The Structure of Discourse; 1987.
72. Hou S, Zhang S, Fei C. Rhetorical structure theory: a comprehensive review of theory, parsing methods and applications. Expert Syst Appl.
2020;157:113421. doi:10.1016/j.eswa.2020.113421
73. Allen Institute. AllenNLP. 2021. Accessed October 1, 2022. https://allennlp.org/
74. Yan S, Wan X. SRRank: leveraging semantic roles for extractive multi-document summarization. IEEE/ACM Trans Audio Speech Lang
Proc. 2014;22(12):2048962058. doi:10.1109/TASLP.2014.2360461
75. Veizaga A, Alférez M, Torre D, Sabetzadeh M, Briand LC. On systematically building a controlled natural language for functional
requirements. Empir Softw Eng. 2021;26(4):79. doi:10.1007/s10664-021-09956-6
76. Cohen J. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull. 1968;7(4):213.
77. Viera A, Garrett J. Understanding Interobserver agreement: the kappa statistic. Fam Med. 2005;37:360-363.
78. Hearst MA. TextTiling: segmenting text into multi-paragraph subtopic passages. Comput Linguist. 1997;23(1):339664.
79. Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993961022.
80. Koshorek O, Cohen A, Mor N, Rotman M, Berant J. Text segmentation as a supervised learning task. In: Walker MA, Ji H, Stent A,
eds. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018. Vol 2 (Short Papers). Association for Computational Linguistics;
2018:469-473.
81. Hossen MK, Kagdi H, Poshyvanyk D. Amalgamating source code authors, maintainers, and change proneness to triage change requests.
Paper presented at: ICPC 2014: Proceedings of the 22nd International Conference on Program Comprehension, Hyderabad. 2014:130-141.
82. Wiegers K, Beatty J. Software Requirements. A division of Microsoft Corporation, One Microsoft Way, Microsoft Press; 2013:98052-96399.
83. Wohlin C, Runeson P, Host M, Ohlsson MC, Regnell B, Wesslen A. Experimentation in Software Engineering. Springer; 2012.
84. Malik V, Sanjay R, Guha SK, et al. Semantic segmentation of legal documents via rhetorical roles. CoRR. 2021;abs/2112.01836.
85. Aumiller D, Almasian S, Lackner S, Gertz M. Structural text segmentation of legal documents. In: Maranhão J, Wyner AZ, eds. ICAIL’21:
Eighteenth International Conference for Artificial Intelligence and Law, São Paulo Brazil, June 21 - 25, 2021. ACM; 2021:2-11.
86. Wang M, Xue P, Li Y, Wu Z. Distilling the documents for relation extraction by topic segmentation. In: Lladós J, Lopresti D, Uchida S,
eds. 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021,
Proceedings, Part I. 12821 of Lecture Notes in Computer Science. Vol 2021. Springer; 2021:517-531.
87. Misra H, Yvon F, Jose J, Cappé O. Text segmentation via topic modeling: an analytical study. Paper presented at: Proceedings of the 18th
ACM Conference on Information and Knowledge Management, CIKM 2009; November 2-6. 2009; Hong Kong, China: 1553-1556.
88. Utiyama M, Isahara H. A Statistical Model for Domain-Independent Text Segmentation. ACL/EACL; 2001.
89. Sun Q, Li R, Luo D, Wu X. Text Segmentation with LDA-Based Fisher Kernel. The Association for Computer Linguistics; 2008:269-272.
90. Du L, Pate JK, Johnson M. Topic segmentation with an ordering-based topic model. In: Bonet B, Koenig S, eds. Proceedings of the
Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015. AAAI Press; 2015:2232-2238.
91. Pham V, Pham X, Tran H, et al. A deep learning approach for text segmentation in document analysis. In: Lê L, Marchese M, Dao B,
Toulouse M, Dang TK, eds. International Conference on Advanced Computing and Applications, ACOMP 2020, Quy Nhon, Vietnam,
November 25-27, 2020. IEEE; 2020:135-139.
92. Lian X, Cleland-Huang J, Zhang L. Mining associations between quality concerns and functional Requirements. In: Moreira A, Araújo J,
Hayes J, Paech B, eds. 25th IEEE International Requirements Engineering Conference. RE 2017, Lisbon, Portugal, September 4-8, 2017. IEEE
Computer Society; 2017:292-301.
93. Lian X, Rahimi M, Cleland-Huang J, Zhang L, Ferrai R, Smith M. Mining Requirements Knowledge from Collections of Domain Documents.
IEEE Computer Society; 2016:156-165.
94. Slankas J, Williams L. Automated extraction of non-functional requirements in available documentation. Paper presented at: 2013 1st
International Workshop on Natural Language Analysis in Software Engineering (NaturaLiSE). 2013; San Francisco, CA:9-16.
95. Falkner AA, Palomares C, Franch X, Schenner G, Aznar P, Schoerghuber A. Identifying requirements in requests for proposal: a research
preview. In: Knauss E, Goedicke M, eds. Requirements Engineering: Foundation for Software Quality - 25th International Working Con-
ference, REFSQ 2019, Essen, Germany, March 18-21, 2019, Proceedings. 11412 of Lecture Notes in Computer Science. Vol 2019. Springer;
2019:176-182.
96. Reddivari S. Enhancing software requirements cluster labeling using Wikipedia. Paper presented at: 2019 IEEE 20th International
Conference on Information Reuse and Integration for Data Science (IRI). 2019; Los Angeles, CA:123-126.
97. Reddivari S, Chen Z, Niu N. ReCVisu: a tool for clustering-based visual exploration of requirements. Paper presented at: 2012 20th IEEE
International Requirements Engineering Conference (RE). 2012; Chicago, IL:327-328.
1097024x, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/spe.3303 by <Shibboleth>[email protected], Wiley Online Library on [15/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ZHAO et al. 33

98. Ferrari A, Gnesi S, Tolomei G. In: Doerr J, Opdahl AL, eds. Requiremrents Engineering: Foundation for Software Quality. REFSQ 2013.
Lecture Notes in Computer Science. Vol 7830. Springer; 2013:34-49.
99. Jiang Z, Shi L, Chen C, Hu J, Wang Q. Dialogue disentanglement in software engineering: how far are we? In: Zhou Z, ed. Proceedings
of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021.
ijcai.org; 2021:3822-3828.

AU THOR BIOGRAPHIES

Ziyan Zhao received his Master’s degrees in software engineering from Guangxi Normal Univer-
sity, in 2018. He is now a doctoral student in the State Key Laboratory of Software Development
Environment, Beihang University, Beijing. His research interests include requirements engi-
neering and natural language processing, such as requirement incompleteness detection and
requirement generation.

Li Zhang received her Bachelor, Master and PhD degrees in School of Computer Science and
Engineering, Beihang University, Beijing, China, respectively.She is now a full professor in Bei-
hang University. Her main research area is software engineering, with specific interest in the
intelligent software engineering such as software requirements elicitation and optimization, soft-
ware/system architecture recovery and model-based software engineering. She has more than
100 publications in multiple areas.

Xiao-Li Lian is an assistant research fellow in Beihang University in Beijing, China. Her research
interests mainly focus on software requirements engineering, such as requirement extraction,
requirement traceability and the quality validation. She also interests in the application of auto-
mated technologies, such as natural language processing to address the problems on documents,
such as detecting the ambiguity in policy documents.

He-Yang Lv holds a Master’s degree in computer science from Beijing University of Aeronau-
tics and Astronautics. His research interests include software engineering and natural language
processing, with a particular focus on large language models.His expertise lies in exploring the
potential applications and advancements of these models in various domains.

How to cite this article: Zhao Z, Zhang L, Lian X, Lv H. DRIP: Segmenting individual requirements from
software requirement documents. Softw Pract Exper. 2023;1-33. doi: 10.1002/spe.3303

You might also like