Automatic Selection of Test Cases For Regression Testing PDF
Automatic Selection of Test Cases For Regression Testing PDF
∗
Cláudio Magalhães Flávia Barros Alexandre Mota
Centro de Informática Centro de Informática Centro de Informática
Recife, Brazil Recife, Brazil Recife, Brazil
[email protected] [email protected] [email protected]
Eliot Maia
Motorola Mobility
Jaguariúna, Brazil
[email protected]
the testing campaign as a whole. So, the duplicated TCs duplications and irrelevant words (that is, stopwords). The
are prioritized in the final ordered Test Plan. The merging resulting set of words (vocabulary) will then be used to in-
strategy will be detailed and illustrated in Section 3.2. dex the documents in the base. It is worth mentioning that
The above process will be repeated every time new build the user is able to edit the stopwords list to add or remove
is tested and presents defects. However, the process starts any word term of interest.
from Phase 2. As said, Phase 1 will run only once per This module was implemented using the Apache Lucene
new product. open-source information retrieval software library6 . We de-
Figure 1 depicts the general architecture proposed for gen- ployed the PyLucene7 , a Phyton version available from the
eral process. Lucene website. PyLucene is built upon the Vector Space
Information Retrieval model [9]. In this algebraic model,
3.2 The Test Selection Tool Prototype each document in the base is represented as a vector in a
The automated process described above was implemented multidimensional space. Each dimension corresponds to one
in a tool prototype named as AutoTestPlan. It was imple- word/term in the vocabulary of the documents base.
mented using Django Bootstrap4 , a high-level Python Web In our system, the documents base corresponds to the
framework, bearing an MVC (Model-View-Controller) ar- Master Plan, a file containing Test Cases textual descrip-
chitecture. tions (each TC representing a document to be indexed). The
Aiming to respect Software Engineering principles of mod- vocabulary is obtained automatically by the indexing engine
ularity (to provide for extensibility and easy maintenance), during the indexation process, and it will keep the relevant
this prototype counts on three separate modules and one words/terms found in the TCs descriptions.
data base, the MP Index File. Finally, the vocabulary may still undergo a stemming pro-
cess, to reduce each word to its base form. The stemming
3.2.1 Module 1 - Indexing/Search engine may reduce the vocabulary size, since two inflected or de-
This module consists of an indexing and search engine. rived words may be mapped onto the same stem/base form
Therefore, it implements the two phases of the automated (e.g., frequently and infrequent will be reduced to frequent).
process related to Information Retrieval tasks: Phase 1 The PyLucene already provides a stemmer (the Snowball
(which creates the index file), and Phase 3 (which allows algorithm), which can be activated when desired.
the retrieval of TCs based on keywords queries). As said, the index file creation is executed only one per
new product, since there is only one MP per product.
Index file creation - Phase 1 of the automated process.
The Index File is a data structure created to facilitate the Index File consultation - Phase 3 of the automated pro-
retrieval of text documents based on keywords queries. In cess.
this structure, each document is indexed by the words/terms This module receives as input the CRs keyword represen-
appearing in its text5 . When a query is submitted to the tations and delivers one query per CR to be submitted to
index file, all documents containing the words in the query the search engine.
are retrieved. • Step 1 (query creation): The duplications are elimi-
The words used to index the documents constitute the nated, however the duplicated words are placed at the
index base vocabulary, which is clearly dependent upon the beginning of the query, so that they will have more
documents being indexed. But note that not all words/terms importance during the search process (see running ex-
appearing in the base are relevant to index documents (be- ample).
cause some words are too frequent/infrequent, or carry no
semantic meaning—for instance, prepositions, conjunctions). • Step 2 (Queries processing): Each query submit-
Thus, the initial set of words is pre-processed to eliminate ted to the search engine is represented as a vector in
4
the multidimensional vector space created in Phase
http://www.djangoproject.com/
5 6
This structure is also known as inverted index file, since https://lucene.apache.org/
7
the words index the documents where they appear https://lucene.apache.org/pylucene/index.html
Figure 2: The AutoTestPlan prototype
1. Following, the search engine measures the similar- • Step 3 (Stemming): this process must only be used
ity between the query vector and the vectors repre- here when it is also used in the generation of the in-
senting each document in the space to retrieve most dex base. Otherwise, the matching between keywords
relevant documents to the query. Here, relevance is will be wrongly reduced. The stemming may reduce
measured by the cosine of the angle between the vec- the list of keywords which represent a CR, since two
tor representing the query and the vectors represent- inflected or derived words may be mapped onto the
ing the documents (the smaller the angle between two same stem/base form. However, this process favors a
vectors, the higher the cosine value). This strategy re- higher number of matches between the query and the
turns an ordered list of documents based on the cosine index base documents, thus retrieving a larger quan-
measure. The obtained lists of TCs are given as input tity of TCs from the index base. Therefore, it should
to Module 3, which creates the final Test Plan. be optional and only used when necessary.
• Step 1 (the input lists are initially merged al- Table 2: 2nd experiment (427 Test Cases)
ternating, one by one, the TCs from each input
list (Figure 3): The idea is to prioritize TCs well gression campaign. These release notes have three CRs
ranked by the search engine for each CR query. This in the first release notes and 149 CRs in the other;
way, the better ranked TCs will be on the top of the
Test Plan list, regardless the CR which retrieved them. • Corpus selection: the test architects manually se-
The resulting list is passed onto Step 2, to eliminate lected 200 TCs from the general TC repository to de-
duplications. fine the Master Plan in the first experiment and 427
TCs to define the second Master Plan;
• Step 2 (the merged list will then be examined
to treat duplications): As said before, number of In Tables 1 and 2 we have data collected from experiments
occurrences of each TC will influence on its final rank. 1 and 2, respectively. Each of these tables has 6 columns,
TCs appearing tree times are positioned on the top of explained as follows:
the list, followed by the TCs with two and then one
• Architect - represents the test architects that per-
occurrence, maintaining the initial order (Figure 4).
formed the manual selection;
• Chosen - represents the TCs selected by the archi-
tects, according to the MP;
Figure 4: TC duplication elimination and reordering
• Elected - represents the TCs suggested by AutoTest-
Plan, where the numbers X(Y) in Tables 1 and 2 mean
that AutoTestPlan returned Y TCs originally where X
TCs coincided with those chosen by the test architect;
• Top - represents the amount of TCs, chosen by Au-
toTestPlan, appearing in the top of the list of TCs that
coincide with those chosen by test architects;
• Recall - measures the amount of intersection between
the TCs returned by AutoTestPlan and the architect
with respect to all TCs chosen by the architect;
• Match - measures the top TCs against the TCs chosen
The final merged list will constitute the Test Plan to be by the test architect.
executed in the regression test. This strategy was validated
via an experiment presented in the following section. In the first experiment (see Table 1), the test architect A
chose 27 TCs from the MP (a total of 200 TCs). AutoTest-
Plan returned 92 TCs, from which all 27 TCs chosen by the
4. EXPERIMENTS AND THREATS TO VA- architect A lie within this returned set. This yielded a 100%
LIDITY recall. From this 27 TCs, 20 TCs appeared in the top of
In the following sections we present some performed ex- the prioritized list of TCs. This resulted in a match of 74%
periments as well as the main threats to validity we have (that is, 20/27). For architect B, we had similar percent-
identified. ages: 100% for recall and 72% for match. This experiment
received positive acknowledgment from the test architects.
4.1 Performed Experiments In the second experiment, the MP was bigger and com-
The implemented prototype was tested in two initial ex- posed of 427 TCs, from which architect A has chosen 195
periments, described in this section. The results were vali- TCs and architect B 128 TCs. AutoTestPlan returned 315
dated by the test team, which provided a very positive eval- TCs, from which 148 TCs coincided with the selection made
uation of the system final output (the Test Plan). by architect A and only 100 TCs with architect B. This
The experiments were created with the help of the com- time the recall was not so good as the previous experiment,
pany test architects and test team. We have to observe the obtaining 76% (or, 148/195) for architect A and 78% (or,
following setup information. 100/128) for architect B. Match was not so good as well be-
cause the top TCs ranked by AutoTestPlan were not selected
• CRs selection: the test architects selected two re- by the test architects (59.5% [or, 116/195] for architect A
lease notes (one for each experiment) to execute a re- and 33.6% [or, 43/128] for architect B).
4.1.1 General discussion tomatization (automatic queries for TC selection from the
As we can observe from experiment 1 and 2, experiment MP, report generation, etc.) has provided a 75% gain in the
1 exhibited higher recall and match percentages than exper- effort taken by the architects. And with the current results,
iment 2. The main reason for this is related to the number we can attack the other 25% although we still need a vali-
of components involved in the release notes. In experiment dation from the test architects. This means that we do not
1, just one component was affected. Whereas in experiment have a full automatic solution but it decreased at least 90%
2, several components were affected. This is directly related of the test architects labor.
to our merge strategy because when we combine isolated The goal of the work reported in [14] is completely aligned
results we have had some lost in matching. Using a previ- to ours. In particular, its industrial setting is very similar to
ous merge strategy, we only had 40% (against the current ours and its use of natural language processing. However,
59.5%) and 23% (against 33.5%) in experiment 2. We need there it deals with source code to better find the related
to further investigate other merge strategies to improve this test cases (based on features) whereas we do not have ac-
percentage match. cess to source code. In our case, components are our main
Another point that deserves attention is that in exper- features, restricting considerably the amount of test cases
iment 1, the choices made by architect B are a subset of to be executed. The refinement to components come from
the choices of architect A. But in experiment 2 this is com- the keywords appearing simultaneously in the texts of the
pletely different. Just 20% of coincidence occurs between change requests and test cases. Another difference is that
the choices of architects A and B. This is too low and thus in [14], the authors deal with software product lines (SPL)
our low match cannot be criticized without a metric inde- and we do not address this problem specifically, although
pendent of the choices made by the architects. Currently our context is based on SPL as well.
we only get feedback from an EDA (Escaped Defect Anal- The Work reported in [6] inherits a similar strategy as
ysis) report provided by our industrial partner. According ours in the sense of using information retrieval and similarity
to this report, our selections were better than those of the analysis. However, that work uses a specific ranking function
architects in the sense of less escaped defects. But we need with particular weights (in the direction of a previous work
to investigate further this point. of ours [1]), etc., whereas we use the built-in functionality
provided by Lucene [9].
4.2 Threats to validity Concerning test selection in particular, an interesting re-
A first threat we identified is related to the MP’s size. If lated work is reported in [13]. In that work, test selection
a MP has less than 50 TCs, for example, the similarity algo- is indeed dealt with as test suite reduction where several
rithm cannot work with efficacy and the automatic selection criteria are used to reduce a test suite while retain as most
can result in the same 50 TCs. But in practice this is not a as possible of its original properties in terms of source code
problem because in general MP’s size is at least of 100 TCs. and requirements coverage, etc. Its approach is more gen-
Another threat is associated to the use of the Master Plan eral than ours where we are more closely related to what is
defined by the test team. By assuming such an MP, we are named requirements coverage in [13]. That is, we try to cre-
considering that it has the best TCs to be selected for a ate test plans that covers similar requirements (keywords)
particular regression campaign. But similarly to what we in change requests as well as in test cases procedures.
observe with the architect’s selection, such an MP cannot be Several works use mathematical models, indeed transition
the best one. One future work is exactly to avoid needing systems, to select regression test cases such as [3, 5]. The
to use this MP to see whether the results improve or remain work reported in [3] reduces a test plan by checking depen-
the same. dency in EFSM structures whereas [5] reduces by applying
We have another concern that is related to the availability a similarity algorithm on transition systems. Although both
of just one test team to work with. We are already collab- use some kind of similarity algorithm like ourselves, they use
orating with other teams to perform similar experiments. some formal notation. The main difference to our work is
This is mainly associated to the quality of the textual parts exactly that we do not use any mathematical model except
in the test cases as well as CRs. the similarity algorithm for natural language provided by
As pointed out in the previous threat, the text provided the Lucene tool. We think our similarity criteria is more
in CRs is totally informal and does not follow guidelines. convenient because it is guided by change requets whereas
Fortunately in test cases, this is not a concern because they in the work reported in [5] it is related to the mathemati-
are written carefully and reviewed to guarantee that they cal model itself. Thus this reduction simply discards similar
are clear and well defined. test cases. In our case we discard those test cases not related
to the current change requests.
5. CONCLUSION The closest work to ours is the one reported in [4]. Like
ours, the work [4] performs a similarity check based on test
In this work we presented an automatic TC selection and
cases and change descriptions (or requests). But differently
prioritization strategy based on information retrieval applied
to ours, source code is used as well. In our case only infor-
on test cases and CRs. We developed a prototype tool,
mal documents are used which complicates considerably the
named AutoTestPlan, which implemented the concepts pre-
similarity algorithm. We intend to consider source code in
sented in the paper. We were able to run some experiments
the future but this is not done currently.
with real data obtained from our industrial partner. And we
As future work we intend to perform further experiments,
also evaluated our results with real test architects. Although
focusing other features of the Motorola smartphones.
the results obtained in experiment 2 were not so good, over-
We also intend to make a continuous analysis of the results
all the proposed strategy and tool is seen positively by our
obtained with our tool against the architect’s selections to
industrial partner because it decreases considerably the ef-
test a statistical hypothesis to completely replace the manual
fort taken by the human beings. For instance, just the au-
selection with the automatic one, improving the daily test Trans. Softw. Eng. Methodol., 6(2):173–210, April
process of Motorola Mobility. 1997.
From the experiments, we need to try other variants of the [12] Ripon K. Saha, Lingming Zhang, Sarfraz Khurshid,
merge strategy to see whether we can obtain better match and Dewayne E. Perry. An information retrieval
results. approach for regression test prioritization based on
Finally we are trying to obtain access to the source-code program changes. In Proceedings of the 37th
related to the CRs at least. With this we can try to cal- International Conference on Software Engineering -
culate some kind of coverage to see whether our selections Volume 1, ICSE ’15, pages 268–279. IEEE Press, 2015.
are indeed better or not when compared to the architects. [13] August Shi, Alex Gyori, Milos Gligoric, Andrey
Currently we only have information from EDA reports. Zaytsev, and Darko Marinov. Balancing trade-offs in
test-suite reduction. In Proceedings of the 22Nd ACM
Acknowledgments. SIGSOFT International Symposium on Foundations
We would like to thank Alice Arashiro, Viviana Toledo of Software Engineering, FSE 2014, pages 246–256.
and Lucas Heredia from Motorola Mobility, and Virginia ACM, 2014.
Viana. This research is supported by Motorola Mobility. [14] Michael Unterkalmsteiner, Tony Gorschek, Robert
Feldt, and Niklas Lavesson. Large-scale information
6. REFERENCES retrieval in software engineering - an experience report
[1] Cláudio Magalh aes, Alexandre Mota, and Eliot Maia. from industrial application. Empirical Software
Automatically finding hidden industrial criteria used Engineering, pages 1–42, 2015.
in test selection. In 28th International Conference on
Software Engineering and Knowledge Engineering,
SEKE’16, San Francisco, USA, pages 1–4, 2016.
[2] G. Canfora and L. Cerulo. Impact Analysis by Mining
Software and Change Request Repositories. In 11th
IEEE International Software Metrics Symposium
(METRICS’05), page 29. IEEE, 2005.
[3] Paolo Tonella Cu D. Nguyen, Alessandro Marchetto.
Model based regression test reduction using
dependence analysis. In In Proceedings of the
International IEEE Conference on Software
Maintenance, pages 214–223. IEEE, 2002.
[4] Paolo Tonella Cu D. Nguyen, Alessandro Marchetto.
Test case prioritization for audit testing of evolving
web services using information retrieval techniques. In
Web Services (ICWS), 2011 IEEE International
Conference on, pages 636–643. IEEE, 2011.
[5] Patrı́cia Duarte de Lima Machado Francisco Gomes de
Oliveira Neto. Seleção automática de casos de teste de
regressão baseada em similaridade e valores. In
Revista de Informática Teórica e Aplicada:RITA,
v20(2), pages 139–154, 2013.
[6] Manisha Khattar, Yash Lamba, and Ashish Sureka.
Sarathi: Characterization study on regression bugs
and identification of regression bug inducing changes:
A case-study on google chromium project. In
Proceedings of the 8th India Software Engineering
Conference, pages 50–59. ACM, 2015.
[7] Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schütze. Introduction to Information
Retrieval. Cambridge University Press, New York, NY,
USA, 2008.
[8] Michael McCandless, Erik Hatcher, and Otis
Gospodnetic. Lucene in Action: Covers Apache
Lucene 3.0. Manning Publications Co., 2010.
[9] Michael McCandless, Erik Hatcher, and Otis
Gospodnetic. Lucene in Action: Covers Apache
Lucene 3.0. Manning Publications Co., 2010.
[10] Gregg Rothermel and Mary Jean Harrold. Analyzing
regression test selection techniques. IEEE Trans.
Softw. Eng., 22(8):529–551, August 1996.
[11] Gregg Rothermel and Mary Jean Harrold. A safe,
efficient regression test selection technique. ACM