Fundamental Approaches To Software Engineering: Leen Lambers Sebastián Uchitel
Fundamental Approaches To Software Engineering: Leen Lambers Sebastián Uchitel
Fundamental Approaches To Software Engineering: Leen Lambers Sebastián Uchitel
Fundamental Approaches
to Software Engineering
26th International Conference, FASE 2023
Held as Part of the European Joint Conferences
on Theory and Practice of Software, ETAPS 2023
Paris, France, April 22–27, 2023
Proceedings
Lecture Notes in Computer Science 13991
Founding Editors
Gerhard Goos, Germany
Juris Hartmanis, USA
Editors
Fundamental Approaches
to Software Engineering
26th International Conference, FASE 2023
Held as Part of the European Joint Conferences
on Theory and Practice of Software, ETAPS 2023
Paris, France, April 22–27, 2023
Proceedings
123
Editors
Leen Lambers Sebastián Uchitel
Brandenburg University of Technology CONICET/University of Buenos Aires
Cottbus-Senftenberg Buenos Aires, Argentina
Cottbus, Germany Imperial College London
London, UK
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
ETAPS Foreword
Welcome to the 26th ETAPS! ETAPS 2023 took place in Paris, the beautiful capital of
France. ETAPS 2023 was the 26th instance of the European Joint Conferences on
Theory and Practice of Software. ETAPS is an annual federated conference established
in 1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each
conference has its own Program Committee (PC) and its own Steering Committee
(SC). The conferences cover various aspects of software systems, ranging from theo-
retical computer science to foundations of programming languages, analysis tools, and
formal approaches to software engineering. Organising these conferences in a coherent,
highly synchronized conference programme enables researchers to participate in an
exciting event, having the possibility to meet many colleagues working in different
directions in the field, and to easily attend talks of different conferences. On the
weekend before the main conference, numerous satellite workshops took place that
attracted many researchers from all over the globe.
ETAPS 2023 received 361 submissions in total, 124 of which were accepted,
yielding an overall acceptance rate of 34.3%. I thank all the authors for their interest in
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con-
tributions, and in particular the PC (co-)chairs for their hard work in running this entire
intensive process. Last but not least, my congratulations to all authors of the accepted
papers!
ETAPS 2023 featured the unifying invited speakers Véronique Cortier (CNRS,
LORIA laboratory, France) and Thomas A. Henzinger (Institute of Science and
Technology, Austria) and the conference-specific invited speakers Mooly Sagiv (Tel
Aviv University, Israel) for ESOP and Sven Apel (Saarland University, Germany) for
FASE. Invited tutorials were provided by Ana-Lucia Varbanescu (University of
Twente and University of Amsterdam, The Netherlands) on heterogeneous computing
and Joost-Pieter Katoen (RWTH Aachen, Germany and University of Twente, The
Netherlands) on probabilistic programming.
As part of the programme we had the second edition of TOOLympics, an event to
celebrate the achievements of the various competitions or comparative evaluations in
the field of ETAPS.
ETAPS 2023 was organized jointly by Sorbonne Université and Université
Sorbonne Paris Nord. Sorbonne Université (SU) is a multidisciplinary,
research-intensive and worldclass academic institution. It was created in 2018 as the
merge of two first-class research-intensive universities, UPMC (Université Pierre and
Marie Curie) and Paris-Sorbonne. SU has three faculties: humanities, medicine, and
55,600 students (4,700 PhD students; 10,200 international students), 6,400 teachers,
professor-researchers and 3,600 administrative and technical staff members. Université
Sorbonne Paris Nord is one of the thirteen universities that succeeded the University of
Paris in 1968. It is a major teaching and research center located in the north of Paris. It
has five campuses, spread over the two departments of Seine-Saint-Denis and Val
vi ETAPS Foreword
d’Oise: Villetaneuse, Bobigny, Saint-Denis, the Plaine Saint-Denis and Argenteuil. The
university has more than 25,000 students in different fields, such as health, medicine,
languages, humanities, and science. The local organization team consisted of Fabrice
Kordon (general co-chair), Laure Petrucci (general co-chair), Benedikt Bollig (work-
shops), Stefan Haar (workshops), Étienne André (proceedings and tutorials), Céline
Ghibaudo (sponsoring), Denis Poitrenaud (web), Stefan Schwoon (web), Benoît Barbot
(publicity), Nathalie Sznajder (publicity), Anne-Marie Reytier (communication),
Hélène Pétridis (finance) and Véronique Criart (finance).
ETAPS 2023 is further supported by the following associations and societies:
ETAPS e.V., EATCS (European Association for Theoretical Computer Science),
EAPLS (European Association for Programming Languages and Systems), EASST
(European Association of Software Science and Technology), Lip6 (Laboratoire
d'Informatique de Paris 6), LIPN (Laboratoire d'informatique de Paris Nord), Sorbonne
Université, Université Sorbonne Paris Nord, CNRS (Centre national de la recherche
scientifique), CEA (Commissariat à l'énergie atomique et aux énergies alternatives),
LMF (Laboratoire méthodes formelles), and Inria (Institut national de recherche en
informatique et en automatique).
The ETAPS Steering Committee consists of an Executive Board, and representa-
tives of the individual ETAPS conferences, as well as representatives of EATCS,
EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saar-
brücken), Marieke Huisman (Twente, chair), Jan Kofroň (Prague), Barbara König
(Duisburg), Thomas Noll (Aachen), Caterina Urban (Inria), Jan Křetínský (Munich),
and Lenore Zuck (Chicago).
Other members of the steering committee are: Dirk Beyer (Munich), Luís Caires
(Lisboa), Ana Cavalcanti (York), Bernd Finkbeiner (Saarland), Reiko Heckel
(Leicester), Joost-Pieter Katoen (Aachen and Twente), Naoki Kobayashi (Tokyo),
Fabrice Kordon (Paris), Laura Kovács (Vienna), Orna Kupferman (Jerusalem), Leen
Lambers (Cottbus), Tiziana Margaria (Limerick), Andrzej Murawski (Oxford), Laure
Petrucci (Paris), Elizabeth Polgreen (Edinburgh), Peter Ryan (Luxembourg), Sriram
Sankaranarayanan (Boulder), Don Sannella (Edinburgh), Natasha Sharygina (Lugano),
Pawel Sobocinski (Tallinn), Sebastián Uchitel (London and Buenos Aires), Andrzej
Wasowski (Copenhagen), Stephanie Weirich (Pennsylvania), Thomas Wies (New
York), Anton Wijs (Eindhoven), and James Worrell (Oxford).
I would like to take this opportunity to thank all authors, keynote speakers, atten-
dees, organizers of the satellite workshops, and Springer-Verlag GmbH for their
support. I hope you all enjoyed ETAPS 2023.
Finally, a big thanks to Laure and Fabrice and their local organization team for all
their enormous efforts to make ETAPS a fantastic event.
This book contains the proceedings of FASE 2023, the 26th International Conference
on Fundamental Approaches to Software Engineering, held in Paris, France, in April
2023, as part of the annual European Joint Conferences on Theory and Practice of
Software (ETAPS 2023).
FASE is concerned with the foundations on which software engineering is built. We
solicited four categories of papers, research, empirical, new ideas and emerging results,
and tool demonstrations; all of which should make novel contributions to making
software engineering a more mature and soundly-based discipline.
The contributions accepted for presentation at the conference were carefully selected
by means of a thorough double-blind review process which included no less than 3
reviews per paper. We received 50 submissions, which after a reviewing period of nine
weeks and intensive discussion resulted in 16 accepted papers, representing a 32%
acceptance rate.
We also ran an artifact track where authors of accepted papers optionally submitted
artifacts described in their papers for evaluation. 10 artifacts were submitted for
evaluation, 8 of which were successfully evaluated.
In addition, FASE 2023 hosted the 5th International Competition on Software
Testing (Test-Comp 2023), which is an annual comparative evaluation of automatic
tools for test generation. A total of 13 tools participated this year, from seven countries.
The tools were developed in academia and in industry. The submitted tools and the
submitted system-description papers were reviewed by a separate program committee:
the Test-Comp jury. Each tool and paper was assessed by at least three reviewers.
These proceedings contain the competition report and one selected system description
of a participating tool. Two sessions in the FASE program were reserved for the
presentation of the results: the summary by the Test-Comp chair and of the partici-
pating tools by the developer teams in the first session, and the community meeting in
the second session.
We thank the ETAPS 2023 general chair, Marieke Huisman, the ETAPS 2023
organizers, Fabrice Kordon and Laure Petrucci, as well as the FASE SC chair, Andrzej
Wasowksi, for their support during the whole process. We thank our invited speaker,
Sven Apel, for his keynote. We thank all the authors for their hard work and will-
ingness to contribute. We thank all the Program Committee members, external
reviewers, who invested time and effort in the selection process to ensure the scientific
quality of the program. Last but not least, we thank the Test-Comp chair Dirk Beyer,
viii Preface
the artifact evaluation committee chairs, Marie-Christine Jakobs and Carlos Diego
Nascimento Damasceno, and their evaluation committees.
FASE—Steering Committee
Einar Broch Johnsen University of Oslo, Norway
Reiner Hähnle Technische Universität Darmstadt, Germany
Reiko Heckel University of Leicester, UK
Leen Lambers BTU Cottbus-Senftenberg, Germany
Tiziana Margaria University of Limerick, Ireland
Perdita Stevens University of Edinburgh, UK
Gabriele Taentzer Philipps-Universität Marburg, Germany
Sebastián Uchitel University of Buenos Aires, Argentina
and Imperial College London, UK
Heike Wehrheim Carl von Ossietzky Universität Oldenburg, Germany
Manuel Wimmer JKU Linz, Austria
FASE—Program Committee
Erika Abraham RWTH Aachen University, Germany
Rui Abreu University of Porto, Portugal
Domenico Bianculli University of Luxembourg, Luxembourg
Ana Cavalcanti University of York, UK
Stijn De Gouw The Open University, The Netherlands
Antinisca Di Marco University of L’Aquila, Italy
Sigrid Eldh Mälardalen University and Ericsson, Sweden
Carlo A. Furia USI Università della Svizzera Italiana, Switzerland
Alessandra Gorla IMDEA Software Institute, Spain
Einar Broch Johnsen University of Oslo, Norway
Axel Legay UCLouvain, Belgium
Lina Marsso University of Toronto, Canada
Marjan Mernik University of Maribor, Slovenia
Fabrizio Pastore University of Luxembourg, Luxembourg
Leila Ribeiro Universidade Federal do Rio Grande do Sul, Brazil
Gwen Salaün University of Grenoble Alpes, France
x Organization
FASE—Additional Reviewers
Sven Apel
Regular Contributions
Competition Contributions
Abstract. System goals are the statements that, in the context of soft-
ware requirements specification, capture how the software should be-
have. Many times, the understanding of stakeholders on what the sys-
tem should do, as captured in the goals, can lead to different problems,
from clearly contradicting goals, to more subtle situations in which the
satisfaction of some goals inhibits the satisfaction of others. These latter
issues, called goal divergences, are the subject of goal conflict analysis,
which consists of identifying, assessing, and resolving divergences, as part
of a more general activity known as goal refinement.
While there exist techniques that, when requirements are expressed for-
mally, can automatically identify and assess goal conflicts, there is cur-
rently no automated approach to support engineers in resolving identi-
fied divergences. In this paper, we present ACoRe, the first approach
that automatically proposes potential resolutions to goal conflicts, in
requirements specifications formally captured using linear-time tempo-
ral logic. ACoRe systematically explores syntactic modifications of the
conflicting specifications, aiming at obtaining resolutions that disable
previously identified conflicts, while preserving specification consistency.
ACoRe integrates modern multi-objective search algorithms (in par-
ticular, NSGA-III, WBGA, and AMOSA) to produce resolutions that
maintain coherence with the original conflicting specification, by search-
ing for specifications that are either syntactically or semantically similar
to the original specification.
We assess ACoRe on 25 requirements specifications taken from the lit-
erature. We show that ACoRe can successfully produce various conflict
resolutions for each of the analyzed case studies, including resolutions
that resemble specification repairs manually provided as part of conflict
analyses.
1 Introduction
Many software defects that come out during software development originate from
incorrect understandings of what the software being developed should do [24].
These kinds of defects are known to be among the most costly to fix, and thus it
is widely acknowledged that software development methodologies must involve
phases that deal with the elicitation, understanding, and precise specification
of software requirements. Among the various approaches to systematize this re-
quirements phase, the so-called goal-oriented requirements engineering (GORE)
methodologies [13,55] provide techniques that organize the modeling and analysis
of software requirements around the notion of system goal. Goals are prescrip-
tive statements that capture how the software to be developed should behave,
and in GORE methodologies are subject to various activities, including goal
decomposition, refinement, and the assignment of goals [3,13,15,39,55,56].
The characterization of requirements as formally specified system goals en-
ables tasks that can reveal flaws in the requirements. Formally specified goals
allow for the analysis and identification of goal divergences, situations in which
the satisfaction of some goals inhibits the satisfaction of others [9,16]. These di-
vergences arise as a consequence of goal conflicts. A conflict is a condition whose
satisfaction makes the goals inconsistent. Conflicts are dealt with through goal-
conflict analysis [58], which comprises three main stages: (i) the identification
stage, which involves the identification of conflicts between goals; (ii) the assess-
ment stage, aiming at evaluating and prioritizing the identified conflicts accord-
ing to their likelihood and severity; and (iii), the resolution stage, where conflicts
are resolved by providing appropriate countermeasures and, consequently, trans-
forming the goal model, guided by the criticality level.
Goal conflict analysis has been the subject of different automated tech-
niques to assist engineers, especially in the conflict identification and assessment
phases [16,18,43,56]. However, no automated technique has been proposed for
dealing with goal conflict resolution. In this paper, we present ACoRe, the first
automated approach that deals with the goal-conflict resolution stage. ACoRe
takes as input a set of goals formally expressed in Linear-Time Temporal Logic
(LTL) [45], together with previously identified conflicts, also given as LTL for-
mulas. It then searches for candidate resolutions, i.e., syntactic modifications to
the goals that remain consistent with each other, while disabling the identified
conflicts. More precisely, ACoRe employs modern search-based algorithms to
efficiently explore syntactic variants of the goals, guided by a syntactic and se-
mantic similarity with the original goals, as well as with the inhibition of the
identified conflicts. This search guidance is implemented as (multi-objective) fit-
ness functions, using Levenshtein edit distance [42] for syntactic similarity, and
approximated LTL model counting [8] for semantic similarity. ACoRe exploits
this fitness function to search for candidate resolutions, using various alternative
search algorithms, namely a Weight-Based Genetic Algorithm (WBGA) [29], a
Non-dominated Sorted Genetic Algorithm (NSGA-III) [14], an Archived Multi-
Objective Simulated Annealing search (AMOSA) [6], and an unguided search
approach, mainly used as a baseline in our experimental evaluations.
Our experimental evaluation considers 25 requirements specifications taken
from the literature, for which goal conflicts are automatically computed [16].
The results show that ACoRe is able to successfully produce various conflict
ACoRe: Automated Goal-Conflict Resolution 5
resolutions for each of the analysed case studies, including resolutions that re-
semble specification repairs manually provided as part of conflict analyses. In
this assessment, we measured their similarity concerning the ground-truth, i.e.,
to the manually written repairs, when available. The genetic algorithms are able
to resemble 3 out of 8 repairs in the ground truth. Moreover, the results show
that ACoRe generates more non-dominated resolutions (their finesses are not
subsumed by other repairs in the output set) when adopting genetic algorithms
(NSGA-III or WBGA), compared to AMOSA or unguided search, favoring ge-
netic multi-objective search over other approaches.
We also consider other typical connectives and operators, such as, ∧, 2 (al-
ways), 3 (eventually) and W (weak-until), that are defined in terms of the
basic ones. That is, φ ∧ ψ ≡ ¬(¬φ ∨ ¬ψ), 3φ ≡ trueU φ, 2φ ≡ ¬3¬φ, and
φWψ ≡ (2φ) ∨ (φU ψ).
The model counting problem consists of calculating the number of models that
satisfy a formula. Since the models of LTL formulas are infinite traces, it is often
the case that analysis is restricted to a class of canonical finite representation of
infinite traces, such as lasso traces or tree models. Notably, this is the case in
bounded model checking for instance [7].
For example, an LTL formula 2(p ∨ q) is satisfiable, and one satisfying lasso
trace is σ1 = {p}; {p, q}ω , wherein the first state p holds, and from the second
state both p and q are valid forever. Notice that the base in the lasso trace σ1
is the sequence containing both states {p}; {p, q}, while the state {p, q} is the
sequence in the loop part.
Since existing approaches for computing the exact number of lasso traces
are ineffective [25], Brizzio et. al [8] recently developed a novel model counting
approach that approximates the number (of prefixes) of lasso traces satisfying
an LTL formula. Intuitively, instead of counting the number of lasso traces of
length k, the approach of Brizzio et. al [8] aims at approximating the number of
bases of length k corresponding to some satisfying lasso trace.
Example 1 (Mine Pump Controller - MPC). Consider the Mine Pump Con-
troller (MPC) widely used in related works that deal with formal requirements
and reactive systems [16,35]. The MPC describes a system that is in charge of
activating or deactivating a pump (p) to remove the water from the mine, in the
presence of possible dangerous scenarios. The MP controller monitors environ-
mental magnitudes related to the presence of methane (m) and the high level of
water (h) in the mine. Maintaining a high level of water for a while may produce
8 L. Carvalho et al.
flooding in the mine, while the methane may cause an explosion when the pump
is switched on. Hence, the specification for the MPC is as follows:
Domain property Dom describes the impact into the environment of switch-
ing on the pump (p). For instance, when the pump is kept on for 2 unit times,
then the water will decrease and the level will not be high (¬h). Goal G1 ex-
presses that the pump should be off when methane is detected in the mine. Goal
G2 indicates that the pump should be on when the level of water is high.
Notice that this specification is consistent, for instance, in cases in which
the level of water never exceeds the high threshold. However, approaches for
goal-conflict identification, such as the one of Degiovanni et al. [16], can detect
a conflict between goals in this specification.
The identified goal-conflict describes a divergence situation in cases in which
the level of water is high and methane is present at the same time in the envi-
ronment. Switching off the pump to satisfy G1 will result in a violation of goal
G2 ; while switching on the pump to satisfy G2 will violate G1 . This divergence
situation clearly evidence a conflict between goals G1 and G2 that is captured
by a boundary condition such BC = 3(h ∧ m).
In the work of Letier et al. [40] two resolutions were manually proposed that
precisely describe what should be the software behaviour in cases where the
divergence situation is reached. The first resolution proposes to refine goal G2 ,
by weakening it, requiring to switch on the pump only when the level of water
is high and no methane is present in the environment.
Example 2 (Resolution 1 - MPC).
The resolution stage aims at removing the identified goal-conflicts from the
specification, for which it is necessary to modify the current specification formu-
lation. This may require weakening or strengthening the existing goals, or even
removing some and adding new ones.
Definition 8 (Goal-Conflict Resolution). Let G = {G1 , . . . , Gn }, Dom, and
BC be the set of goals, the domain properties, and an identified boundary con-
dition, respectively written in LTL. Let M : S1 × S2 7→ [0, 1] and ∈ [0, 1] be
ACoRe: Automated Goal-Conflict Resolution 9
Intuitively, the first condition states that the refined goals in R remain consistent
within the domain properties Dom. The second condition states that BC does
not lead to a divergence situation in the resolution R (i.e., refined goals in R
know exactly how to deal with the situations captured by BC). Finally, the last
condition aims at using a similarity metric M to control for the degree of changes
applied to the original formulation of goals in G to produce the refined goals in
resolution R.
Notice that the similarity metric M is general enough to capture similarities
between G and R of different natures. For instance, M (G, R) may compute the
syntactic similarity between the text representations of the original specification
of goals in G and the candidate resolution R, where the number of tokens edited
from G to R is the aim. On the other hand, M (G, R) may compute a semantic
similarity between G and R, for instance, to favour resolutions that weaken the
goals (i.e. G → R), or strengthen the goals (i.e. R → G) or that maintain most
of the original behaviours (i.e. #G − #R < ).
Precisely, ACoRe will explore syntactic modifications of goals from G, lead-
ing to newly refined goals in R, with the aim at producing candidate resolutions
that are consistent with the domain properties Dom and resolve conflict BC.
Assuming that the engineer is competent and the current specification is very
close to the intended one [19,1], ACoRe will integrate two similarity metrics in a
multi-objective search process to produce resolutions that are syntactically and
semantically similar to the original specification. Particularly, ACoRe can gen-
erate exactly the same resolutions for the MPC previously discussed, manually
developed by Letier et al. [40].
4.2
4.1
initial
identified
goal-conflicts
specification
BC1 , . . . , BCk
S = (Dom, G)
SAT solving.
1
f1: consistency
population
L. Carvalho et al.
0
1
Evaluation
Initial Population
3
Mutation
if G0 is unsatisfiable
(only in GAs)
Evolution Operators
if Dom ∧ G0 is satisfiable
Crossover
No
approach (such as the fitness function and selection criteria).
Stop Criterion?
Yes
uates if the refined goals G0 are consistent with the domain properties by using
search approach we use as a baseline. Let us first describe some common compo-
the search starts. ACoRe creates one or more individuals (depending on the
domain properties and G0 the refined system goals. Notice that domain proper-
Each individual cR = (Dom, G0 ), representing a candidate resolution, is a LTL
ACoRe guides the search with four objectives that check for the validity of
original set of goals G to obtain the new set of refined goals G0 that potentially
specification over a set AP of propositional variables, where Dom captures the
Rn = (Dom, Gn )
ACoRe: Automated Goal-Conflict Resolution 11
4.3
sha1_base64="HrIkFjHWcSoRAiQoCSzIQlIZKPs=">AAAB73icbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DGoB48RzAOSJczOziZD5rHOzAoh5Ce8eFDEq7/jzb9xkuxBEwsaiqpuuruilDNjff/bW1ldW9/YLGwVt3d29/ZLB4dNozJNaIMornQ7woZyJmnDMstpO9UUi4jTVjS8mfqtJ6oNU/LBjlIaCtyXLGEEWye1u7cMCyXjXqnsV/wZ0DIJclKGHPVe6asbK5IJKi3h2JhO4Kc2HGNtGeF0UuxmhqaYDHGfdhyVWFATjmf3TtCpU2KUKO1KWjRTf0+MsTBmJCLXKbAdmEVvKv7ndTKbXIVjJtPMUknmi5KMI6vQ9HkUM02J5SNHMNHM3YrIAGtMrIuo6EIIFl9eJs1qJTivVO8vyrXrPI4CHMMJnEEAl1CDO6hDAwhweIZXePMevRfv3fuYt654+cwR/IH3+QPu+I/m</latexit>
latexit
<
⌃
sha1_base64="0BKB2au6Sf4Ssy37AiupcVPpi1w=">AAAB63icbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHEi8cI5gHJEmYns8mQeSwzs2II+QUvHhTx6g9582+cTfagiQUNRVU33V1Rwpmxvv/tra1vbG5tF3aKu3v7B4elo+OWUakmtEkUV7oTYUM5k7RpmeW0k2iKRcRpOxrfZn77kWrDlHywk4SGAg8lixnBNpN6dfXUL5X9ij8HWiVBTsqQo9EvffUGiqSCSks4NqYb+IkNp1hbRjidFXupoQkmYzykXUclFtSE0/mtM3TulAGKlXYlLZqrvyemWBgzEZHrFNiOzLKXif953dTGN+GUySS1VJLFojjlyCqUPY4GTFNi+cQRTDRztyIywhoT6+IpuhCC5ZdXSataCS4r1furcq2ex1GAUziDCwjgGmpwBw1oAoERPMMrvHnCe/HevY9F65qXz5zAH3ifP/ASjis=</latexit>
latexit
<
⇤
M5
true M4
M3
M2
M1
G1
large number of common behaviors.
sha1_base64="+wSBPeL8nxBdvzPXA2qswhGhfpg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNNwI1kkUwygQrB2Mb2d++4kpzWP5YCYJ8yMcSh5yisZKrR6KZIT9csWtunOQVeLlpAI5Gv3yV28Q0zRi0lCBWnc9NzF+hspwKti01Es1S5COcci6lkqMmPaz+bVTcmaVAQljZUsaMld/T2QYaT2JAtsZoRnpZW8m/ud1UxNe+xmXSWqYpItFYSqIicnsdTLgilEjJpYgVdzeSugIFVJjAyrZELzll1dJq1b1Lqq1+8tK/SaPowgncArn4MEV1OEOGtAECo/wDK/w5sTOi/PufCxaC04+cwx/4Hz+AIzPjxw=</latexit>
latexit
<
↵
sha1_base64="HrIkFjHWcSoRAiQoCSzIQlIZKPs=">AAAB73icbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DGoB48RzAOSJczOziZD5rHOzAoh5Ce8eFDEq7/jzb9xkuxBEwsaiqpuuruilDNjff/bW1ldW9/YLGwVt3d29/ZLB4dNozJNaIMornQ7woZyJmnDMstpO9UUi4jTVjS8mfqtJ6oNU/LBjlIaCtyXLGEEWye1u7cMCyXjXqnsV/wZ0DIJclKGHPVe6asbK5IJKi3h2JhO4Kc2HGNtGeF0UuxmhqaYDHGfdhyVWFATjmf3TtCpU2KUKO1KWjRTf0+MsTBmJCLXKbAdmEVvKv7ndTKbXIVjJtPMUknmi5KMI6vQ9HkUM02J5SNHMNHM3YrIAGtMrIuo6EIIFl9eJs1qJTivVO8vyrXrPI4CHMMJnEEAl1CDO6hDAwhweIZXePMevRfv3fuYt654+cwR/IH3+QPu+I/m</latexit>
latexit
<
⌃
sha1_base64="0BKB2au6Sf4Ssy37AiupcVPpi1w=">AAAB63icbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHEi8cI5gHJEmYns8mQeSwzs2II+QUvHhTx6g9582+cTfagiQUNRVU33V1Rwpmxvv/tra1vbG5tF3aKu3v7B4elo+OWUakmtEkUV7oTYUM5k7RpmeW0k2iKRcRpOxrfZn77kWrDlHywk4SGAg8lixnBNpN6dfXUL5X9ij8HWiVBTsqQo9EvffUGiqSCSks4NqYb+IkNp1hbRjidFXupoQkmYzykXUclFtSE0/mtM3TulAGKlXYlLZqrvyemWBgzEZHrFNiOzLKXif953dTGN+GUySS1VJLFojjlyCqUPY4GTFNi+cQRTDRztyIywhoT6+IpuhCC5ZdXSataCS4r1furcq2ex1GAUziDCwjgGmpwBw1oAoERPMMrvHnCe/HevY9F65qXz5zAH3ifP/ASjis=</latexit>
latexit
<
⇤
sha1_base64="YCRLqY3FYFMPr169IIibebwVZHg=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZtIvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukXat6F9Va87JSv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucP22eM+A==</latexit>
<latexit
p
¬
sha1_base64="EpwadKl79nuFFVBzWqBryCC0A38=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGCaQttKJvtpl262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB23TJJpxn2WyER3Qmq4FIr7KFDyTqo5jUPJ2+H4bua3n7g2IlGPOEl5ENOhEpFgFK3k90KOtF+pujV3DrJKvIJUoUCzX/nqDRKWxVwhk9SYruemGORUo2CST8u9zPCUsjEd8q6lisbcBPn82Ck5t8qARIm2pZDM1d8TOY2NmcSh7YwpjsyyNxP/87oZRjdBLlSaIVdssSjKJMGEzD4nA6E5QzmxhDIt7K2EjaimDG0+ZRuCt/zyKmnVa95lrf5wVW3cFnGU4BTO4AI8uIYG3EMTfGAg4Ble4c1Rzovz7nwsWtecYuYE/sD5/AHFJI6o</latexit>
<latexit
sha1_base64="IezQjN8SOfnNKzDUTcvTnS5opc4=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWw2m3btZjfsboQS+h+8eFDEq//Hm//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo2WqCG0TyaXqBVhTzgRtG2Y47SWK4jjgtBtMbnO/+0SVZlI8mGlC/RiPBIsYwcZKnQHHIqwMqzW37s6BVolXkBoUaA2rX4NQkjSmwhCOte57bmL8DCvDCKezyiDVNMFkgke0b6nAMdV+Nr92hs6sEqJIKlvCoLn6eyLDsdbTOLCdMTZjvezl4n9ePzXRtZ8xkaSGCrJYFKUcGYny11HIFCWGTy3BRDF7KyJjrDAxNqA8BG/55VXSadS9i3rj/rLWvCniKMMJnMI5eHAFTbiDFrSBwCM8wyu8OdJ5cd6dj0VrySlmjuEPnM8f/8+Ovw==</latexit>
latexit
<
^
sha1_base64="z++TlsekDA2NilLknCtrrLt6ZgE=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmIzFYkTs00ZLEQu0wipDAhewtc7Bhb++yu2dCCD/BxkJjbP1Fdv4bF7hCwZdM8vLeTGbmBYng2rjut5NbWV1b38hvFra2d3b3ivsHjzpOFcMGi0WsWgHVKLjEhuFGYCtRSKNAYDMYXk395hMqzWP5YEYJ+hHtSx5yRo2V7q/L5W6x5FbcGcgy8TJSggz1bvGr04tZGqE0TFCt256bGH9MleFM4KTQSTUmlA1pH9uWShqh9sezUyfkxCo9EsbKljRkpv6eGNNI61EU2M6ImoFe9Kbif147NeGlP+YySQ1KNl8UpoKYmEz/Jj2ukBkxsoQyxe2thA2ooszYdAo2BG/x5WXyWK14Z5Xq3XmpdpvFkYcjOIZT8OACanADdWgAgz48wyu8OcJ5cd6dj3lrzslmDuEPnM8fYCSNOA==</latexit>
<latexit
Fig. 3: Crossover operator.
G00
sha1_base64="kZLuCEgXBOeASWCDPCXvrWvLTLE=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGCaQttKJvtpF262YTdjVBKf4MXD4p49Qd589+4aXPQ1gcDj/dmmJkXpoJr47rfztr6xubWdmmnvLu3f3BYOTpu6SRTDH2WiER1QqpRcIm+4UZgJ1VI41BgOxzf5X77CZXmiXw0kxSDmA4ljzijxkp+T+Kw3K9U3Zo7B1klXkGqUKDZr3z1BgnLYpSGCap113NTE0ypMpwJnJV7mcaUsjEdYtdSSWPUwXR+7IycW2VAokTZkobM1d8TUxprPYlD2xlTM9LLXi7+53UzE90EUy7TzKBki0VRJohJSP45GXCFzIiJJZQpbm8lbEQVZcbmk4fgLb+8Slr1mndZqz9cVRu3RRwlOIUzuAAPrqEB99AEHxhweIZXeHOk8+K8Ox+L1jWnmDmBP3A+fwA/245Q</latexit>
latexit
<
sha1_base64="YCRLqY3FYFMPr169IIibebwVZHg=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZtIvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukXat6F9Va87JSv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucP22eM+A==</latexit>
<latexit
p
¬
sha1_base64="HrIkFjHWcSoRAiQoCSzIQlIZKPs=">AAAB73icbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DGoB48RzAOSJczOziZD5rHOzAoh5Ce8eFDEq7/jzb9xkuxBEwsaiqpuuruilDNjff/bW1ldW9/YLGwVt3d29/ZLB4dNozJNaIMornQ7woZyJmnDMstpO9UUi4jTVjS8mfqtJ6oNU/LBjlIaCtyXLGEEWye1u7cMCyXjXqnsV/wZ0DIJclKGHPVe6asbK5IJKi3h2JhO4Kc2HGNtGeF0UuxmhqaYDHGfdhyVWFATjmf3TtCpU2KUKO1KWjRTf0+MsTBmJCLXKbAdmEVvKv7ndTKbXIVjJtPMUknmi5KMI6vQ9HkUM02J5SNHMNHM3YrIAGtMrIuo6EIIFl9eJs1qJTivVO8vyrXrPI4CHMMJnEEAl1CDO6hDAwhweIZXePMevRfv3fuYt654+cwR/IH3+QPu+I/m</latexit>
latexit
<
⌃
please refer to the complementary material for a detailed formal definition.
sha1_base64="0BKB2au6Sf4Ssy37AiupcVPpi1w=">AAAB63icbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHEi8cI5gHJEmYns8mQeSwzs2II+QUvHhTx6g9582+cTfagiQUNRVU33V1Rwpmxvv/tra1vbG5tF3aKu3v7B4elo+OWUakmtEkUV7oTYUM5k7RpmeW0k2iKRcRpOxrfZn77kWrDlHywk4SGAg8lixnBNpN6dfXUL5X9ij8HWiVBTsqQo9EvffUGiqSCSks4NqYb+IkNp1hbRjidFXupoQkmYzykXUclFtSE0/mtM3TulAGKlXYlLZqrvyemWBgzEZHrFNiOzLKXif953dTGN+GUySS1VJLFojjlyCqUPY4GTFNi+cQRTDRztyIywhoT6+IpuhCC5ZdXSataCS4r1furcq2ex1GAUziDCwjgGmpwBw1oAoERPMMrvHnCe/HevY9F65qXz5zAH3ifP/ASjis=</latexit>
latexit
<
⇤
from each individual, i.e. G1 ∈ G1 and G2 ∈ G2 , and generates a new goal G00
New individuals are generated through the application of the evolution operators.
0. As this value gets closer to 1, both specifications characterize an increasingly
and cR2 = (Dom, G2 ), and produces a new candidate resolution cR00 = (Dom, G00 )
On the contrary, the crossover operator takes two individuals cR1 = (Dom, G1 )
ACoRe: Automated Goal-Conflict Resolution 13
where weights α = 0.1, β = 0.7, γ = 0.1, and δ = 0.1 are defined by default
(empirically validated), but these can be configured to other values if desired. In
each iteration, WBGA sorts all the individuals according to their fitness value
(descending order) and selects best ranked individuals to survive to the next
14 L. Carvalho et al.
iteration (other selectors can be integrated). Finally, WBGA reports all the
resolutions found during the search.
NSGA-III. ACoRe also implements the Non-Dominated Sorting Genetic Al-
gorithm III (NSGA-III) [14] approach. It is a variant of a genetic algorithm that
also uses mutation and crossover operators to evolve the population. In each
iteration, it computes the fitness values for each individual and sorts the pop-
ulation according to the Pareto dominance relation. Then it creates a partition
of the population according the level of the individuals in the Pareto dominance
relation (i.e., non-dominated individuals are in Level-1, Level-2 contains the in-
dividuals dominated only by individuals in Level-1, and so on). Thus, NSGA-III
selects only one individual per non-dominated level with the aim of diversifying
the exploration and reducing the number of resolutions in the final Pareto-front.
ACoRe also implements an Unguided Search algorithm that does not
use any of the objectives to guide the search. It randomly selects individuals
and applies the mutation operator to evolve the population. After generating a
maximum number of individuals (a given parameter of the algorithm), it checks
which ones constitute a valid resolution for the goal-conflicts given as input.
5 Experimental Evaluation
RQ2 How able is ACoRe to generate resolutions that match with resolutions
provided by engineers (i.e. manually developed)?
To answer RQ2, we check if ACoRe can generate resolutions that are equiv-
alent to the ones manually developed by the engineer.
Finally, we are interested in analyzing and comparing the performance of the
four search algorithms integrated by ACoRe. Thus, we ask:
RQ3 What is the performance of ACoRe when adopting different search algo-
rithms?
6 Experimental Results
Figures 4 and 5 show the boxplots for each quality indicator. NSGA-III ob-
tains on average much better HV and IGD than the rest of the algorithms.
Precisely, it obtains on average 0.66 of HV (while higher the better) and 0.34 of
IGD (while lower the better), outperforming the other algorithms.
To confirm this result we compare the quality indicators in terms of non-
parametric statistical tests: (i) Kruskal–Wallis test by ranks and (ii) the Mann-
Whitney U-test. The α value defined in the Kruskal-Wallis test by ranks is
0.05 and the Mann-Whitney U-test is 0.0125. Moreover, we also complete our
assessment by using Vargha and Delaney’s Â12 , a non-parametric effect size
measurement. Table 4 summarises the results when we compare pair-wise each
one of the approaches. We can observe that NSGA-III in near 80% of the cases
obtains resolutions with better quality indicators than AMOSA and Unguided
search (and the differences are statistically significant). We can also observe that
NSGA-III obtains higher HV (IGD) than WBGA in 66% (65%) of the cases.
From Table 4 we can also observe that WBGA outperforms both AMOSA and
unguided search. Moreover, we can observe that AMOSA is the worse performing
algorithm according to the considered quality indicators.
Overall, both statistical tests evidence that NSGA-III leads to a set of reso-
lutions with better quality indicators (HV and IGD) than the rest of the al-
gorithms. WBGA is the one in the second place, outperforming the unguided
search and AMOSA. While AMOSA shows the lowest performance based on
the quality indicators, even worse than the unguided search in several cases.
7 Related Work
Several manual approaches have been proposed to identify inconsistencies be-
tween goals and resolve them once the requirements were specified. Among them,
Murukannaiah et al. [49] compares a genuine analysis of competing hypotheses
against modified procedures that include requirements engineer thought process.
The empirical evaluation shows that the modified version presents higher com-
pleteness and coverage. Despite the increase in quality, the approach is limited
to manual applicability performed by engineers as well previous approaches [56].
Various informal and semi-formal approaches [28,32,33], as well as more for-
mal approaches [21,23,26,30,51,53], have been proposed for detecting logically
20 L. Carvalho et al.
8 Conclusion
In this paper, we presented ACoRe, the first automated approach for goal-
conflict resolution. Overall, ACoRe takes a goal specification and a set of
conflicts previously identified, expressed in LTL, and computes a set of reso-
ACoRe: Automated Goal-Conflict Resolution 21
lutions that removes such conflicts. To assess and implement ACoRe that is a
search-based approach, we adopted three multi-objective algorithms (NSGA-III,
AMOSA, and WBGA) that simultaneously optimize and deal with the trade-off
among the objectives. We evaluated ACoRe in 25 specifications that were writ-
ten in LTL and extracted from the related literature. The evaluation showed
that the genetic algorithms (NSGA-III and WBGA) typically generate more
(non-dominated) resolutions than AMOSA and an Unguided Search we im-
plemented as a baseline in our evaluation. Moreover, the algorithms generate
on average between 1 and 8 resolutions per specification, which may allow the
engineer to manually inspect and select the most appropriate resolutions. We
also observed that the genetic algorithms (NSGA-III and WBGA) outperform
AMOSA and Unguided Search in terms of several quality indicators: number of
(non-dominated) resolutions and standard quality indicators (HV and IGD) for
multi-objective algorithms.
References
1. Allen Troy Acree, Timothy Alan Budd, Richard A. DeMillo, Richard J. Lipton,
and Frederick Gerald Sayward. Mutation analysis. techreport GIT-ICS-79/08,
Georgia Institute of Technology, Atlanta, Georgia, 1979.
2. Dalal Alrajeh, Antoine Cailliau, and Axel van Lamsweerde. Adapting require-
ments models to varying environments. In Proceedings of the 42nd International
Conference on Software Engineering, ICSE 2020, Seoul, South Korea, May 23-29,
2020, 2020.
3. Dalal Alrajeh, Jeff Kramer, Alessandra Russo, and Sebastin Uchitel. Learning
operational requirements from goal models. In Proceedings of the 31st International
Conference on Software Engineering, ICSE ’09, pages 265–275, Washington, DC,
USA, 2009. IEEE Computer Society.
4. Rajeev Alur, Salar Moarref, and Ufuk Topcu. Counter-strategy guided refinement
of GR(1) temporal logic specifications. In Formal Methods in Computer-Aided
Design, FMCAD 2013, Portland, OR, USA, October 20-23, 2013, pages 26–33,
2013.
5. Andrea Arcuri and Lionel Briand. A practical guide for using statistical tests to
assess randomized algorithms in software engineering. In Proceedings of the 33rd
International Conference on Software Engineering, ICSE ’11, page 1–10, New York,
NY, USA, 2011. Association for Computing Machinery.
6. Sanghamitra Bandyopadhyay, Sriparna Saha, Ujjwal Maulik, and Kalyanmoy Deb.
A simulated annealing-based multiobjective optimization algorithm: AMOSA.
IEEE Trans. Evol. Comput., 12(3):269–283, 2008.
7. Armin Biere, Alessandro Cimatti, Edmund M. Clarke, and Yunshan Zhu. Symbolic
model checking without bdds. In Proceedings of the 5th International Conference
on Tools and Algorithms for Construction and Analysis of Systems, TACAS ’99,
pages 193–207, London, UK, UK, 1999. Springer-Verlag.
22 L. Carvalho et al.
8. Matı́as Brizzio, Renzo Degiovanni, Maxime Cordy, Mike Papadakis, and Nazareno
Aguirre. Automated repair of unrealisable LTL specifications guided by model
counting. CoRR, abs/2105.12595, 2021.
9. Antoine Cailliau and Axel van Lamsweerde. Handling knowledge uncertainty in
risk-based requirements engineering. In 23rd IEEE International Requirements En-
gineering Conference, RE 2015, Ottawa, ON, Canada, August 24-28, 2015, pages
106–115, 2015.
10. Davide G. Cavezza and Dalal Alrajeh. Interpolation-based GR(1) assumptions
refinement. CoRR, abs/1611.07803, 2016.
11. Krishnendu Chatterjee, Thomas A. Henzinger, and Barbara Jobstmann. Envi-
ronment assumptions for synthesis. In Franck van Breugel and Marsha Chechik,
editors, CONCUR 2008 - Concurrency Theory, pages 147–161, Berlin, Heidelberg,
2008. Springer Berlin Heidelberg.
12. Carlos A. Coello Coello and Margarita Reyes Sierra. A study of the parallelization
of a coevolutionary multi-objective evolutionary algorithm. In Raúl Monroy, Gus-
tavo Arroyo-Figueroa, Luis Enrique Sucar, and Humberto Sossa, editors, MICAI
2004: Advances in Artificial Intelligence, pages 688–697, Berlin, Heidelberg, 2004.
Springer Berlin Heidelberg.
13. Anne Dardenne, Axel van Lamsweerde, and Stephen Fickas. Goal-directed require-
ments acquisition. In SCIENCE OF COMPUTER PROGRAMMING, pages 3–50,
1993.
14. Kalyanmoy Deb and Himanshu Jain. An evolutionary many-objective optimiza-
tion algorithm using reference-point-based nondominated sorting approach, part
i: Solving problems with box constraints. IEEE Transactions on Evolutionary
Computation, 18(4):577–601, 2014.
15. Renzo Degiovanni, Dalal Alrajeh, Nazareno Aguirre, and Sebastián Uchitel. Au-
tomated goal operationalisation based on interpolation and sat solving. In ICSE,
pages 129–139, 2014.
16. Renzo Degiovanni, Pablo F. Castro, Marcelo Arroyo, Marcelo Ruiz, Nazareno
Aguirre, and Marcelo F. Frias. Goal-conflict likelihood assessment based on model
counting. In Proceedings of the 40th International Conference on Software Engi-
neering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pages 1125–
1135, 2018.
17. Renzo Degiovanni, Facundo Molina, Germán Regis, and Nazareno Aguirre. A
genetic algorithm for goal-conflict identification. In Proceedings of the 33rd
ACM/IEEE International Conference on Automated Software Engineering, ASE
2018, Montpellier, France, September 3-7, 2018, pages 520–531, 2018.
18. Renzo Degiovanni, Nicolás Ricci, Dalal Alrajeh, Pablo F. Castro, and Nazareno
Aguirre. Goal-conflict detection based on temporal satisfiability checking. In Pro-
ceedings of the 31st IEEE/ACM International Conference on Automated Software
Engineering, ASE 2016, Singapore, September 3-7, 2016, pages 507–518, 2016.
19. Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. Hints on test
data selection: Help for the practicing programmer. IEEE Computer, 11(4):34–41,
1978.
20. Nicolás D’Ippolito, Vı́ctor A. Braberman, Nir Piterman, and Sebastián Uchitel.
Synthesizing nonanomalous event-based controllers for liveness goals. ACM Trans.
Softw. Eng. Methodol., 22(1):9, 2013.
21. Christian Ellen, Sven Sieverding, and Hardi Hungar. Detecting consistencies and
inconsistencies of pattern-based functional requirements. In Proc. of the 19th
Intl. Conf. on Formal Methods for Industrial Critical Systems, pages 155–169, 2014.
ACoRe: Automated Goal-Conflict Resolution 23
22. E. Allen Emerson and Edmund M. Clarke. Using branching time temporal logic to
synthesize synchronization skeletons. Sci. Comput. Program., 2(3):241–266, 1982.
23. Neil A. Ernst, Alexander Borgida, John Mylopoulos, and Ivan J. Jureta. Ag-
ile requirements evolution via paraconsistent reasoning. In Proc. of the 24th
Intl. Conf. on Advanced Information Systems Engineering, pages 382–397, 2012.
24. Daniel Méndez Fernández, Stefan Wagner, Marcos Kalinowski, Michael Felderer,
Priscilla Mafra, Antonio Vetro, Tayana Conte, M.-T. Christiansson, Des Greer,
Casper Lassenius, Tomi Männistö, M. Nayabi, Markku Oivo, Birgit Penzenstadler,
Dietmar Pfahl, Rafael Prikladnicki, Günther Ruhe, André Schekelmann, Sagar
Sen, Rodrigo O. Spı́nola, Ahmet Tuzcu, Jose Luis de la Vara, and Roel Wieringa.
Naming the pain in requirements engineering - contemporary problems, causes,
and effects in practice. Empirical Software Engineering, 22(5):2298–2338, 2017.
25. Bernd Finkbeiner and Hazem Torfah. Counting models of linear-time temporal
logic. In Adrian Horia Dediu, Carlos Martı́n-Vide, José Luis Sierra-Rodrı́guez,
and Bianca Truthe, editors, Language and Automata Theory and Applications -
8th International Conference, LATA 2014, Madrid, Spain, March 10-14, 2014.
Proceedings, volume 8370 of Lecture Notes in Computer Science, pages 360–371.
Springer, 2014.
26. David Harel, Hillel Kugler, and Amir Pnueli. Synthesis revisited: Generating stat-
echart models from scenario-based requirements. In Formal Methods in Software
and Systems Modeling: Essays Dedicated to Hartmut Ehrig on the Occasion of His
60th Birthday, pages 309–324, 2005.
27. Mark Harman, S. Afshin Mansouri, and Yuanyuan Zhang. Search-based software
engineering: Trends, techniques and applications. ACM Comput. Surv., 45(1):11:1–
11:61, December 2012.
28. J.H. Hausmann, R. Heckel, and G. Taentzer. Detection of conflicting functional
requirements in a use case-driven approach. In ICSE, pages 105–115, 2002.
29. John H. Holland. Adaptation in Natural and Artificial Systems: An Introductory
Analysis with Applications to Biology, Control, and Artificial Intelligence. MIT
Press, 1992.
30. Anthony Hunter and Bashar Nuseibeh. Managing inconsistent specifications: Rea-
soning, analysis, and action. ACM TOSEM, 7(4):335–367, 1998.
31. Daniel Jackson. Software Abstractions - Logic, Language, and Analysis. MIT Press,
2006.
32. M. Kamalrudin. Automated software tool support for checking the inconsistency
of requirements. In ASE, pages 693–697, 2009.
33. Massila Kamalrudin, John Hosking, and John Grundy. Improving requirements
quality using essential use case interaction patterns. In ICSE, pages 531–540,
2011.
34. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated anneal-
ing. SCIENCE, 220(4598):671–680, 1983.
35. J. Kramer, J. Magee, M. Sloman, and A. Lister. CONIC: an integrated approach
to distributed computer control systems. Computers and Digital Techniques, IEE
Proceedings E, 130(1):1+, 1983.
36. Jan Kretı́nský, Tobias Meggendorfer, and Salomon Sickert. Owl: A library for
ω-words, automata, and LTL. In Shuvendu K. Lahiri and Chao Wang, editors,
Automated Technology for Verification and Analysis - 16th International Sympo-
sium, ATVA 2018, Los Angeles, CA, USA, October 7-10, 2018, Proceedings, vol-
ume 11138 of Lecture Notes in Computer Science, pages 543–550. Springer, 2018.
37. William H Kruskal and W Allen Wallis. Use of ranks in one-criterion variance
analysis. Journal of the American statistical Association, 47(260):583–621, 1952.
24 L. Carvalho et al.
38. Maciej Laszczyk and Pawel B. Myszkowski. Survey of quality measures for multi-
objective optimization: Construction of complementary set of multi-objective qual-
ity measures. Swarm and Evolutionary Computation, 48:109–133, 2019.
39. Emanuel Letier. Goal-oriented elaboration of requirements for a safety injection
control system. Technical report, Université catholique de Louvain, 2002.
40. Emmanuel Letier. Reasoning about Agents in Goal-Oriented Requirements Engi-
neering. PhD thesis, Université catholique de Louvain, 2001.
41. Jianwen Li, Geguang Pu, Lijun Zhang, Yinbo Yao, Moshe Y. Vardi, and Jifeng
He. Polsat: A portfolio LTL satisfiability solver. CoRR, abs/1311.1602, 2013.
42. Miqing Li and Xin Yao. Quality evaluation of solution sets in multiobjective
optimisation: A survey. ACM Comput. Surv., 52(2), mar 2019.
43. Weilin Luo, Hai Wan, Xiaotong Song, Binhao Yang, Hongzhen Zhong, and
Yin Chen. How to identify boundary conditions with contrasty metric? In
43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021,
Madrid, Spain, 22-30 May 2021, pages 1473–1484. IEEE, 2021.
44. H. B. Mann and D. R. Whitney. On a Test of Whether one of Two Random
Variables is Stochastically Larger than the Other. The Annals of Mathematical
Statistics, 18(1):50 – 60, 1947.
45. Zohar Manna and Amir Pnueli. The Temporal Logic of Reactive and Concurrent
Systems. Springer-Verlag New York, Inc., New York, NY, USA, 1992.
46. Zohar Manna and Pierre Wolper. Synthesis of communicating processes from
temporal logic specifications. ACM Trans. Program. Lang. Syst., 6(1):68–93, 1984.
47. Shahar Maoz and Jan Oliver Ringert. On well-separation of GR(1) specifications.
In Proceedings of the 24th ACM SIGSOFT International Symposium on Founda-
tions of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18,
2016, pages 362–372, 2016.
48. Shahar Maoz, Jan Oliver Ringert, and Rafi Shalom. Symbolic repairs for GR(1)
specifications. In Proceedings of the 41st International Conference on Software
Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, pages 1016–
1026, 2019.
49. P.K. Murukannaiah, A.K. Kalia, P.R. Telangy, and M.P. Singh. Resolving goal
conflicts via argumentation-based analysis of competing hypotheses. In Proc. 23rd
IEEE Int. Requirements Engineering Conf., pages 156–165, 2015.
50. Antonio J. Nebro, Juan J. Durillo, and Matthieu Vergne. Redesigning the jmetal
multi-objective optimization framework. In Proceedings of the Companion Publi-
cation of the 2015 Annual Conference on Genetic and Evolutionary Computation,
GECCO Companion ’15, page 1093–1100, New York, NY, USA, 2015. Association
for Computing Machinery.
51. Tuong Huan Nguyen, Bao Quoc Vo, Markus Lumpe, and John Grundy. KBRE: a
framework for knowledge-based requirements engineering. Software Quality Jour-
nal, 22(1):87–119, 2013.
52. A. Pnueli and R. Rosner. On the synthesis of a reactive module. In Proceedings
of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming
Languages, POPL ’89, pages 179–190, New York, NY, USA, 1989. ACM.
53. George Spanoudakis and Anthony Finkelstein. Reconciling requirements: a method
for managing interference, inconsistency and conflict. Annals of Software Engineer-
ing, 3(1):433–457, 1997.
54. Ryoji Tanabe and Hisao Ishibuchi. An analysis of quality indicators using approxi-
mated optimal distributions in a 3-d objective space. IEEE Trans. Evol. Comput.,
24(5):853–867, 2020.
ACoRe: Automated Goal-Conflict Resolution 25
55. Axel van Lamsweerde. Requirements Engineering - From System Goals to UML
Models to Software Specifications. Wiley, 2009.
56. Axel van Lamsweerde, Robert Darimont, and Emmanuel Letier. Managing conflicts
in goal-driven requirements engineering. IEEE Trans. Software Eng., 24(11):908–
926, 1998.
57. Axel van Lamsweerde and Emmanuel Letier. Integrating obstacles in goal-driven
requirements engineering. In Proceedings of the 20th International Conference on
Software Engineering, ICSE ’98, pages 53–62, Washington, DC, USA, 1998. IEEE
Computer Society.
58. Axel van Lamsweerde and Emmanuel Letier. Handling obstacles in goal-oriented
requirements engineering. IEEE Trans. Softw. Eng., 26(10):978–1005, October
2000.
59. András Vargha and Harold D. Delaney. A critique and improvement of the ”cl”
common language effect size statistics of mcgraw and wong. Journal of Educational
and Behavioral Statistics, 25(2):101–132, 2000.
60. Kaiyuan Wang, Allison Sullivan, and Sarfraz Khurshid. Arepair: A repair frame-
work for alloy. In 2019 IEEE/ACM 41st International Conference on Software
Engineering: Companion Proceedings (ICSE-Companion), pages 103–106, 2019.
61. Jiahui Wu, Paolo Arcaini, Tao Yue, Shaukat Ali, and Huihui Zhang. On the
preferences of quality indicators for multi-objective search algorithms in search-
based software engineering. Empirical Softw. Engg., 27(6), nov 2022.
62. Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M. Fonseca, and Vi-
viane Grunert da Fonseca. Performance assessment of multiobjective optimizers:
An analysis and review. IEEE Transactions on Evolutionary Computation, 7:117–
132, 2003.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
A Modeling Concept for Formal Verification of
OS-Based Compositional Software
Availability of Artifacts
All Uppaal models and queries are available at https://doi.org/10.6084/
m9.figshare.21809403. Throughout the paper, model details are omitted for
the sake of readability or due to space constraints. In such cases, the symbol 6
indicates that details can be found in the provided artifacts.
1 Introduction
Embedded systems are everywhere, from simple consumer electronics (wearables,
home automation, etc.) to complex safety-critical devices. e.g., in the automotive,
aerospace, medical, and nuclear domains. While bugs on non-critical devices are
at most inconvenient, errors on safety-critical systems can lead to catastrophic
ments are specified. Still, the provided models and requirements are sufficient to
demonstrate the proposed concept.
The remainder of this paper is organized as follows: in Section 2 we present
relevant concepts for our proposed approach. In Section 3 we describe our ap-
proach to model the software layers modularly. In Section 4 we introduce abstract
tasks and discuss the verification of OS requirements. In Section 5, we analyze
and evaluate the proposed concept. In Section 6 we present related work. Finally,
Section 7 summarizes this paper and shows potential future work.
2 Background
2.1 Real-Time Operating System (RTOS)
Complex OSes quickly lead to state explosion when model-checking. Therefore,
we focus on a small common set of features of modern RTOSes that enables real-
time behavior, namely preemptive multitasking, priority-driven scheduling, task
synchronization, resource management, and time management. Priority inheri-
tance protocols are not addressed in this paper, because they are not necessary
to demonstrate our proposed concept. However, they can be integrated by mod-
ifying the related syscalls.
Tasks are the basic execution unit of RTOS-based software. They run in user
mode and have fewer privileges than the kernel, which runs in kernel mode. Tasks
have individual priorities and execute concurrently, and interact with the OS via
syscalls. Tasks can be in one of the four states shown in Table 1. Specific imple-
mentations might not contain all states. For example, in this paper we model
tasks as infinite loops, which never terminate. Thus, they have no suspended
state. RTOSes commonly contain an idle task, which runs when no other task
is in the ready state.
The Kernel is responsible for providing services to tasks and for interacting
with the hardware. It initializes the system on startup and switches between tasks
at runtime. Kernel execution can be triggered by tasks or interrupts through a
fixed interface only.
Syscalls and Interrupt Service Routines (ISRs) are special functions that
are exclusively provided by the kernel and define its interface. While user mode
software can only interact with the OS through syscalls, ISRs can only be trig-
gered by the hardware. The modeled syscalls and ISR are covered in Section 3.
Time Management is an important feature of RTOSes.The kernel (1) main-
tains an internal timeline to which all tasks can relate, and (2) allows tasks to
specify timing requirements.
Modeling of OS-Based Compositional Software 29
2.2 Uppaal
For modeling and verification, we choose the model-checking tool Uppaal [23],
in which systems are formalized as a network of timed automata with addi-
tional functions and data structures that are executed and changed on edges.
Since we model preemptive tasks, we use Uppaal 4.1, which supports stopwatch
automata[10] and enables the elegant modeling of preemption. While a formal
definition of timed automata is provided in [7], we still describe the features
relevant for this work. Examples in this section refer to Fig. 1.
Timed automata are composed of (labeled) locations and edges. In Uppaal,
timed automata are specified with the concept of templates, which are similar
to classes in object-oriented programming. For the verification, the templates
are instantiated into processes (analogous to objects). All instantiated processes
execute concurrently in a Uppaal model. However, they can still be modeled in
a fashion that executes them sequentially, which we adopted in our models.
Locations. Standard locations are represented by a circle (L2_NAME). The ini-
tial location (L1_NAME) is represented by a double circle. Committed locations
(L3_NAME) have a letter “C” within the circle, and they are used to connect
multi-step atomic operations. Different from standard locations, time does not
pass while any automata are in a committed location. Locations can have names
and invariants. Location names can be used in requirement specifications, and
ease the readability of automata. A location invariant (e.g., _clk<100) is an ex-
pression that must hold while the automaton is in that corresponding location.
Edges connect locations in a directional manner. Edge transitions are instan-
taneous, i.e., they introduce zero time overhead. Edges can have a select state-
ment(selectVar : Range), a guard (_guard()), a synchronization (_synch!),
and an update operation (_update()). A select statement non-deterministically
chooses a value from a range of options and assigns it to a variable. A guard
controls whether or not its edge is enabled. An update operation is a sequence of
30 L. Batista Ribeiro et al.
Application
User Mode
Syscalls
Interrupts
Kernel Interface
Kernel Mode
Operating System
The kernel interface must offer all possibilities to switch from user to kernel
mode, modeled with communication channels. Triggering such channels from
automata in the application layer represents a syscall in the real code.
Fig. 4 depicts our modeled kernel interface. A context switch (_kernelEntry!)
occurs either upon syscalls, if the parameters are valid (valid 6), or upon a timer
interrupt (_timerInt). Supporting more interrupts (or syscalls) can be achieved
by adding their corresponding automata, and respective edges into the kernel
interface.
Kernel Execution and Kernel Overhead. Our modeling approach can pre-
cisely reflect the runtime overhead introduced in a preemptive system by the OS
kernel itself. This allows a more accurate verification of the behavior of embed-
ded systems compared to approaches that abstract away the OS layer. While
different types of OS overhead can be modeled, we initially focus on timing.
Therefore, the kernel interface in Fig. 4 triggers a separate automaton for
the kernel timing (execute[KERNEL]!), as shown in Fig. 3. The execution time
interval [bcet, wcet] contains the time required to enter the kernel, process the
invoked syscall or ISR, execute further kernel functions (e.g., the scheduler), and
exit the kernel. This concentrated timing computation is possible because the
kernel executes atomically (in contrast to the preemptive tasks).
Next, after taking kernel timing into consideration (execDone[KERNEL]?),
we trigger the automata for the functional part of the actual syscall or ISR.
The variable sid in _syscall[sid]! is updated along the syscall edges 6 and
identifies the ID of the invoked syscall. The same approach can be used for
modeling multiple interrupts.
The OS model must contain the internal data structures as well as the Uppaal
templates for the scheduler and for all syscalls. For this paper, we created the
OS model based on the SmartOS [28] implementation.
Data Structures and Tight Bounds. We must declare all OS variables and
arrays with data types of the tightest possible boundaries, according to the
system parameters. Listing 1.2 shows a few examples from our OS model.
A beneficial consequence is a strict verification that does not tolerate any
value out of range. In such cases, the verification immediately fails and aborts.
Modeling of OS-Based Compositional Software 33
1 // 1 - System Parameters
2 const int NTASKS , NEVENTS , NRESOURCES , MMGR ;
3 // 2 - Type Definitions
4 typedef struct {
5 int [0 , NTASKS ] qCtr ; // the number of tasks in ready queue
6 ExtTaskId_t readyQ [ NTASKS ]; // the ready queue containing all tasks
7 // in ready state sorted by priority
8 } SCB_t ; // Scheduler Control Block
9 typedef int [0 , NTASKS - 1] TaskId_t ;
10 // 3 - Declaration of Control Blocks
11 TCB_t _TCB [ NTASKS ]; // Task CBs
12 RCB_t _Res [ NRESOURCES ]; // Resource CBs
13 SCB_t _sched ; // Scheduler CB
(a) (b)
(c) (d)
Fig. 7. Modeled task slices 6: (a) Task Execution, (b) Task Timeout, (c) Task Real-
Time, (d) Task States.
Task Execution Time. This task slice represents the user-space execution
time of (code blocks within) a task. It abstracts away the code functionality,
but allows the modeling of a [bcet, wcet] range. While the specification of the
range itself is shown in Section 3.4, the helper template is shown in Fig. 7(a).
Its structure is similar to the kernel execution time template in Fig. 3. However,
we cannot assure that the execution of code in user mode is atomic, and must
therefore consider preemption: If a _kernelEntry! occurs while a task is in the
Executing location, it goes to Preempted, where the task execution is paused,
i.e., the execution time clock et is paused (et’==0).
Task Timeout. This task slice is responsible for handling timeouts of syscalls
(e.g., sleep), and thus it must trigger timer interrupts. Our version is depicted
in Fig. 7(b)4 . The clock c is used to keep track of elapsed time. The loca-
tion Waiting can be left in two different ways: either the timeout expires (edge
with c==timeout), or the task receives the requested resource/event (edge with
_schedule?) before the timeout. If c==timeout, a timer interrupt is generated
(_timerInt!) if the system is not in kernel mode. Otherwise, we directly proceed
to the next location, where we wait for a signal from the scheduler ( _wakeNext?)
indicating that the task can be scheduled again. Finally, we instruct the scheduler
to insert the current task into the ready queue with _schedule!.
4
In our model, all syscalls with a timeout internally use _sleep[id] 6. Other ap-
proaches might require multiple outgoing edges from the initial state.
Modeling of OS-Based Compositional Software 35
The use of task slices is an extensible modeling concept: Extra task slices can
be added to enable the verification of other (non-)functional requirements, e.g.,
energy/memory consumption.
The OS model, kernel interface, and task slices are designed with a common
goal: Simplify the modeling of application tasks and make the overall system
verification more efficient. With our concept, task models just need to use the
provided interfaces (channels) and pass the desired parameters.
In summary, a task can be modeled with three simple patterns, as exemplified
in Fig. 8:
Ê syscalls: invocation by triggering the corresponding channel, then waiting for
dispatch[id]? (from the scheduler),
Ë execution of regular user code between execute[id]! and execDone[id]?
(from Task Execution Time task slice),
Ì specification of real-time blocks between startRealTime! and endRealTime!.
As an example, Fig. 8 models the task source code from Listing 1.3 as a
Uppaal task. The variables p1 and p2 are used to pass data between different
processes, e.g., for syscall parameters.
For Ê and Ë, the use of the guard amIRunning(id) is crucial for the correct
behavior of the task. It allows a task to proceed only if it is Running. The
absence of this guard would allow any task to execute, regardless of priorities or
task states.
For Ì, this guard is not necessary when starting or ending real-time blocks,
though. If a task reaches the beginning of a real-time block, the response time
5
In our approach, an Uppaal deadlock indicates a modeling mistake.
36 L. Batista Ribeiro et al.
1 3
2
1 2
4.2 OS Requirements
The OS requirements refer to OS properties that must always hold (invariants),
regardless of the number of tasks in the system or of how these tasks interact
with the OS (or with each other through the OS). As described in Section 3.3,
the OS model is composed of data structures and multiple Uppaal templates,
which must be consistent at all time (general requirement). For example, if a
task is in the Waiting location in the task timeout task slice, it must also be in
the Waiting location in the task states task slice. In Uppaal, we can verify this
requirement with the query:
A[] forall (Tasks) TaskTimeout.Waiting imply TaskStates.Waiting 6
Modeling of OS-Based Compositional Software 37
This example shows an important point when extending our concept: When-
ever new task slices are added to verify other (non-)functional requirements of
the application, additional OS requirements must be specified to verify the con-
sistency of the new task slice with pre-existing parts of the OS model.
For a given software (i.e., OS and application), we can prove correctness w.r.t. the
OS and composition requirements by verifying all associated queries. However,
we cannot yet claim that the OS model is correct in general (i.e., independent
from the task composition), because we do not know if all possible OS operations
were considered in all possible scenarios during the verification. Therefore, a
complete re-verification of both layers is required in case the application changes.
To avoid the repeated and resource-expensive re-verification of the OS re-
quirements for each task set, we must prove that the OS model is correct in
general. We can then limit the re-verification to the application layer. To achieve
this goal, we need to make sure that all possible OS operations are verified in
all possible scenarios and execution orders. One possible strategy is to create
different task sets to reach different scenarios, similar to test case generation.
However, this strategy requires the prior identification of relevant scenarios, and
the creation of the corresponding task sets. Additionally, it is hard to guarantee
that all scenarios were indeed identified. Therefore, we introduce a new concept
that inherently covers all scenarios: abstract tasks. They unite all possible be-
haviors of concrete tasks, i.e., they can trigger any action at any time. A task set
with N abstract tasks thus represents the behavior of all possible task sets with
N (concrete) tasks. Thus, by definition, all possible scenarios will be reached
(Uppaal exhaustive approach).
Abstract Tasks. Real tasks, as exemplified in Listing 1.3, are strictly sequential.
Thus, a (concrete) task model is a predefined sequence of steps, as discussed
in Section 3.4, and shown in Fig. 8. Their key characteristic is that only one
outgoing edge is enabled in any location at any point in time.
The abstract task is depicted in Fig. 9. Unlike a concrete task, it has multiple
outgoing edges enabled, which open all possible options to progress: Ê syscalls
with valid parameters and Ë user code execution (execute[id]!). Thus, the
behavior of any concrete task can also be achieved with the abstract task.
While different actions are performed by taking different edges, the param-
eters are non-deterministicaly chosen in the select statements for each syscall.
The Uppaal state space exploration mechanisms guarantee that all values of
the select statements are considered for each edge.
Select statements are not necessary for the timing parameters EX_TIME and
SL_TIME. Fixed values have less impact on the state space, and are enough to
fire all edges from the task execution and task timeout (Fig. 7(a) and Fig. 7(b),
respectively). We define the timing parameters 6 in a way that all edges are
eventually fired and the state space remains small enough for a feasible verifica-
tion.
38 L. Batista Ribeiro et al.
A single set of abstract tasks provides a reliable way of verifying scenarios that
could otherwise only be reached with numerous concrete task sets. To fully verify
the OS model, we must compose the abstract task set so that it triggers all OS
operations in all possible scenarios (covering all corner cases).
Within our model, we can control four system parameters that affect the OS
verification: NTASKS, NEVENTS, NRESOURCES, and MMGR7 , cf. Listing 1.2. We use
a short notation to represent the system configuration. For example, 5-3-4-2
represents a configuration with NTASKS = 5 (idle task + 4 others), NEVENTS = 3,
NRESOURCES = 4, and MMGR = 2. The goal is to find the minimal configuration
that reaches all possible scenarios, and thus allows the complete verification of
the OS model with minimal verification effort.
6
Unless the OS offers guarantees by design, e.g., if it implements the Highest Locker
Protocol (HLP), task deadlock scenarios must not be reachable.
7
Maximum multiple getResource, i.e., the upper limit of the resource counter.
Modeling of OS-Based Compositional Software 39
Model Coverage. In order to cover the whole model, the verification must
traverse all edges, and entirely cover the C-like code of update operations.
Edge Coverage. If there is at least one edge in the model that is not traversed
during verification, the model is surely not fully verified; unreachable edges could
also indicate design flaws in the model. Therefore, the first step of the verification
addresses the edge coverage. We add boolean markers in strategic edges, which
are set to true when the corresponding edge is taken. We then verify if all markers
are ever true:
E<> forall (i : int [0, NEDGES-1]) edge[i]==true
For the update operations, we ensure that an edge is traversed with all possible
parameter values via select statements, which cover all valid parameter values.
The functions demand a more careful analysis. It is necessary to identify all
corner cases, and verify their reachability. For example, to verify the corner
cases of a list insertion, we can use the following queries:
E<> InsertLocation and firstPosInsertion
E<> InsertLocation and lastPosInsertion
E<> InsertLocation and intermediatePosInsertion
The deadlock scenario reveals that 4-1-1-2 is not sufficient to reach all
composition scenarios, since at least two resources are required to cause it. For
the modeled OS features, all composition scenarios are reachable with 4-1-2-2.
40 L. Batista Ribeiro et al.
A[]true OS Requirements
Configuration State space Time Memory Time Memory
C-(51-50-2-2) 574,266 1.5 640 76.0 738
4-1-2-2 5,369,534 0.6 470 30.7 470
4-1-2-4 14,963,367 2.5 1,787 124.3 1,788
5-1-2-2 85,077,164 13.2 6,655 644.2 6,656
4-3-2-2 116,606,955 14.7 10,189 689.0 10,189
4-1-4-2 570,284,574 75.8 47,800 3,774.6 47,800
75 200
an important role, because they are stored/read into/from memory during the
verification. In our OS model, NTASKS, NEVENTS, and NRESOURCES contribute to
the state size, since bigger values increase the size/amount of data structures.
6 Related Work
it using the Z/Eves theorem prover[26], but, unlike our approach, they do not
address timing, resource sharing, or interrupts.
On a less abstract level, closer to the real implementation, seL4 [20] proves the
functional correctness of the C code of the kernel. Furthermore, it guarantees that
the binary code correctly reflects the semantics of the C code. Hyperkernel [27]
formally verifies the functional correctness of syscalls, exceptions and interrupts.
The verification is performed at the LLVM intermediate representation level [32]
using the Z3 SMT solver[9]. CertikOS[16] is the first work that formally verifies a
concurrent OS kernel. They use the Coq proof assistant[2], a C-like programming
language, and a verified compiler [15]. These approaches focus exclusively on the
functional correctness of the OS kernel.
We have not found a work that can verify timing, resource sharing, task
synchronization, and interrupts in a compositional context. That is what our
work enables, after proving the correctness of the OS model.
7 Conclusions and Future Work
In this paper, we presented a Uppaal modeling approach for verifying com-
positional software, exemplified with an OS model containing a common set of
features present in modern RTOSes. Since the proposed techniques and patterns
are general, they can be used to model any concrete OS. We showed how to
model the OS aiming to simplify the modeling of application tasks (Section 3).
We also introduced separate OS requirements and composition requirements, and
showed how they can be formally specified (Section 4) to decouple the verification
of the OS and the application layer. We then proposed the concept of abstract
tasks (Section 4.3) and reasoned that the OS model can be fully verified with
a minimal set of such tasks, which interact through OS primitives (e.g., events
and shared resources) and thus trigger all OS functions in all possible scenarios
(Section 4.4). Finally, we evaluated the resource consumption of the verification
process, reasoned about the sufficiency of the used minimal configuration, and
analyzed the benefits of the proposed concept (Section 5).
With the OS model proven correct, there is no need to re-verify it when the
upper layers are modified, which saves time and resources on the verification of
concrete task sets. We consider this as particularly beneficial for developing and
maintaining highly dependable systems, where, e.g., the task composition and
functionality may change during updates. Another benefit of our approach is the
potential use on test case generation for the application software.
This work opens a variety of directions for future work. We currently work
on task slices to verify further (non-)functional requirements. Besides, we con-
tinuously improve the model design for a better trade-off between abstraction
level and verification overhead, including the avoidance of potential state space
explosions. Tools to convert between source code and Uppaal templates shall
reduce the modeling gap, i.e., the discrepancy between the formal model and the
actual implementation. While our models allow the verification of applications
on top of an OS, a limitation is that model correctness does not yet mean im-
plementation correctness. For that, the full path from models to machine code
must be verified.
44 L. Batista Ribeiro et al.
References
1. FreeRTOS. https://freertos.org/. [Online; accessed 20-January-2023].
2. The Coq proof assistant. https://coq.inria.fr/. [Online; accessed 20-January-
2023].
3. Jean-Raymond Abrial. Modeling in Event-B: system and software engineering.
Cambridge University Press, 2010.
4. Benny Akesson, Mitra Nasri, Geoffrey Nelissen, Sebastian Altmeyer, and Robert I
Davis. A comprehensive survey of industry practice in real-time systems. Real-
Time Systems, 2021.
5. Eman H Alkhammash et al. Modeling guidelines of FreeRTOS in Event-B. In
Shaping the Future of ICT. CRC Press, 2017.
6. Mike Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K. Rus-
tan M. Leino. Boogie: A Modular Reusable Verifier for Object-Oriented Programs.
In Formal Methods for Components and Objects, Berlin, Heidelberg, 2006. Springer
Berlin Heidelberg.
7. Gerd Behrmann, Alexandre David, and Kim G Larsen. A tutorial on Uppaal.
Formal methods for the design of real-time systems, 2004.
8. Aimee Borda, Liliana Pasquale, Vasileios Koutavas, and Bashar Nuseibeh. Compo-
sitional Verification of Self-Adaptive Cyber-Physical Systems. In 2018 IEEE/ACM
13th Int’l Symposium on Software Engineering for Adaptive and Self-Managing
Systems (SEAMS), 2018.
9. Robert Brummayer and Armin Biere. Boolector: An Efficient SMT Solver for Bit-
Vectors and Arrays. In Tools and Algorithms for the Construction and Analysis of
Systems, Berlin, Heidelberg, 2009.
10. Franck Cassez and Kim Larsen. The impressive power of stopwatches. In Interna-
tional Conference on Concurrency Theory. Springer, 2000.
11. Shu Cheng, Jim Woodcock, and Deepak D’Souza. Using formal reasoning on a
model of tasks for FreeRTOS. Formal Aspects of Computing, 27(1), 2015.
12. Holger Giese et al. Towards the Compositional Verification of Real-Time UML
Designs. In 9th European Software Engineering Conference Held Jointly with 11th
ACM SIGSOFT Int’l Symposium on Foundations of Software Engineering, New
York, NY, USA, 2003.
13. Mario Gleirscher, Simon Foster, and Jim Woodcock. New Opportunities for Inte-
grated Formal Methods. ACM Comput. Surv., 52(6), oct 2019.
14. Tomás Grimm, Djones Lettnin, and Michael Hübner. A Survey on Formal Verifi-
cation Techniques for Safety-Critical Systems-on-Chip. Electronics, 7(6), 2018.
15. Ronghui Gu et al. Deep Specifications and Certified Abstraction Layers. ACM
SIGPLAN Notices, 50(1), jan 2015.
16. Ronghui Gu et al. CertiKOS: An Extensible Architecture for Building Certified
Concurrent OS Kernels. In 12th USENIX Symposium on Operating Systems De-
sign and Implementation (OSDI 16), Savannah, GA, November 2016. USENIX
Association.
17. Pujie Han, Zhengjun Zhai, Brian Nielsen, and Ulrik Nyman. A Compositional
Approach for Schedulability Analysis of Distributed Avionics Systems. In 1st Int’l
Workshop on Methods and Tools for Rigorous System Design (MeTRiD@ETAPS),
Greece, EPTCS, 2018.
18. Chris Hawblitzel et al. Ironclad Apps: End-to-End Security via Automated Full-
System Verification. In 11th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 14), Broomfield, CO, October 2014. USENIX Associ-
ation.
Modeling of OS-Based Compositional Software 45
19. Joseph Herkert, Jason Borenstein, and Keith Miller. The Boeing 737 MAX: Lessons
for engineering ethics. Science and engineering ethics, 26(6), 2020.
20. Gerwin Klein et al. SeL4: Formal Verification of an OS Kernel. In ACM SIGOPS
22nd Symposium on Operating Systems Principles, SOSP ’09, New York, NY, USA,
2009.
21. John C. Knight. Safety Critical Systems: Challenges and Directions. In 24th Int’l
Conference on Software Engineering, ICSE ’02, New York, NY, USA, 2002.
22. Kim G. Larsen, Marius Mikučionis, Marco Muñiz, and Jiří Srba. Urgent partial
order reduction for extended timed automata. In Dang Van Hung and Oleg Sokol-
sky, editors, Automated Technology for Verification and Analysis, pages 179–195,
Cham, 2020. Springer International Publishing.
23. Kim G Larsen, Paul Pettersson, and Wang Yi. UPPAAL in a nutshell. Int’l
journal on software tools for technology transfer, 1997.
24. Thierry Lecomte et al. Applying a Formal Method in Industry: A 25-Year Tra-
jectory. In Formal Methods: Foundations and Applications, Cham, 2017. Springer
International Publishing.
25. K. Rustan M. Leino. Dafny: An Automatic Program Verifier for Functional Cor-
rectness. In Logic for Programming, Artificial Intelligence, and Reasoning, Berlin,
Heidelberg, 2010. Springer Berlin Heidelberg.
26. Irwin Meisels and Mark Saaltink. The Z/EVES reference manual (for version 1.5).
Reference manual, ORA Canada, 1997.
27. Luke Nelson et al. Hyperkernel: Push-Button Verification of an OS Kernel. In 26th
Symposium on Operating Systems Principles, SOSP ’17, New York, NY, USA, 2017.
Association for Computing Machinery.
28. Tobias Scheipel, Leandro Batista Ribeiro, Tim Sagaster, and Marcel Baunach.
SmartOS: An OS Architecture for Sustainable Embedded Systems. In Tagungsband
des FG-BS Frühjahrstreffens 2022, Bonn, 2022. Gesellschaft für Informatik e.V.
29. Abhishek Singh, Meenakshi D’Souza, and Arshad Ebrahim. Conformance Testing
of ARINC 653 Compliance for a Safety Critical RTOS Using UPPAAL Model
Checker. New York, NY, USA, 2021.
30. UNECE. UN Regulation No. 156 – Uniform provisions concerning the approval
of vehicles with regards to software update and software updates management
system. [online] https://unece.org/sites/default/files/2021-03/R156e.pdf.
31. Virginie WIELS et al. Formal Verification of Critical Aerospace Software.
Aerospace Lab, May 2012.
32. Jianzhou Zhao et al. Formalizing the LLVM Intermediate Representation for Veri-
fied Program Transformations. In 39th Annual ACM SIGPLAN-SIGACT Sympo-
sium on Principles of Programming Languages, POPL ’12, New York, NY, USA,
2012. Association for Computing Machinery.
46 L. Batista Ribeiro et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Compositional Automata Learning of
Synchronous Systems
Thomas Neele1( )
and Matteo Sammartino2,3
1
Eindhoven University of Technology, Eindhoven, The Netherlands
[email protected]
2
Royal Holloway University of London, Egham, UK
University College London, London, UK
[email protected]
1 Introduction
In this paper we consider the case when the system to be learned consists
of several concurrent components that interact in a synchronous way; the com-
ponents themselves are not accessible, but their number and respective input
alphabets are known. It is well-known that the composite state-space can grow
exponentially with the number of components. If we use L∗ to learn such a system
as a whole, it will take a number of queries that is proportional to the whole state-
space – many more than if we were able to apply L∗ to the individual components.
Since in practice queries are implemented as tests performed on the system (in
the case of equivalence queries, exponentially many tests are required), learning
the whole system may be impractical if tests take a non-negligible amount of
time, e.g., if each test needs to be repeated to ensure accuracy of results or when
each test requires physical interaction with a system.
In this work we introduce a compositional approach that is capable of learning
models for the individual components, by interacting with an ordinary teacher
for the whole system. This is achieved by translating queries on a single com-
ponent to queries on the whole system and interpreting their results on the
level of a single component. The fundamental challenge is that components are
not independent: they interact synchronously, meaning that sequences of actions
in the composite system are realised by the individual components performing
their actions in a certain relative order. The implications are that: (i) the answer
to some membership queries for a specific component may be unknown if the
correct sequence of interactions with other components has not yet been dis-
covered; and (ii) counter-examples for the global system cannot univocally be
decomposed into counter-examples for individual components, therefore some of
them may result in spurious counter-examples that need to be corrected later.
To tackle these issues, we make the following contributions:
2 Preliminaries
Notation and terminology. We use Σ to denote a finite alphabet of action sym-
bols, and Σ ∗ to denote the set of finite sequences of symbols in Σ, which we
call traces; we use ϵ to denote the empty trace. Given two traces s1 , s2 ∈ Σ ∗ ,
we denote their concatenation by s1 · s2 ; for two sets S1 , S2 ⊆ Σ ∗ , S1 · S2 de-
notes element-wise concatenation. Given s ∈ Σ ∗ , we denote by Pref (s) the set
of prefixes of s, and by Suff (s) the set of its suffixes; the notation lifts to sets
S ⊆ Σ ∗ as expected. We say that S ⊆ Σ ∗ is prefix-closed (resp. suffix-closed )
whenever S = Pref (S) (resp. S = Suff (S)). The projection σ↾Σ ′ of σ on an
alphabet Σ ′ ⊆ Σ is the sequence of symbols in σ that are also contained in Σ ′ .
Finally, given a set S, we write |S| for its cardinality.
In this work we represent the state-based behaviour of a system as a labelled
transition system.
Some actions in Σ may not be allowed from a given state. We say that an action
a a
a is enabled in s, written s −→, if there is t such that s −
→ t. This notation is also
∗ σ σ
extended to traces σ ∈ Σ , yielding s − → t and s − →. The language of L is the
set of traces enabled from the starting state, formally:
σ
L(L) = {σ ∈ Σ ∗ | ŝ −
→} .
From here on, we only consider deterministic LTSs. Note that this does not
reduce the expressivity, in terms of the languages that can be encoded.
Remark 1. Languages of LTSs are always prefix-closed, because every prefix of
an enabled trace is necessarily enabled. Prefix-closed languages are accepted by
a special class of deterministic finite automata (DFA), where all states are final
except for a sink state, from which all transitions are self-loops. Our implemen-
tation (see Section 4) uses these models as underlying representation of LTSs.
We now introduce a notion of parallel composition of LTSs, which must
synchronise on shared actions.
a s 1 , t0
L1 = s0 s1 a b
c
L1 ∥ L2 = s0 , t0 s 1 , t1
a
b c
L2 = t0 t1 t2 b
s 0 , t1 s 1 , t2
c
Here a and b are local actions, whereas c is synchronising. Note that, despite L1
being able to perform c from its initial state s0 , there is no c transition from
(s0 , t0 ), because c is not initially enabled in L2 . First L2 will have to perform b
to reach t1 , where c is enabled, which will allow L1 ∥ L2 to perform c. ⊔
⊓
We sometimes also apply parallel composition to sets of traces: ∥i Si is equiv-
alent to ∥ Ti , where each Ti is a tree-shaped LTS that accepts exactly Si , i.e.,
L(Ti ) = Si . In such cases, we will explicitly mention the alphabet each Ti is
assigned. This notation furthermore applies to single traces: ∥i σi = ∥i {σi }.
2.1 L∗ algorithm
We now recall the basic L∗ algorithm. Although the algorithm targets DFAs, we
will present it in terms of deterministic LTSs, which we use in this paper (these
are a sub-class of DFAs, see Remark 1). The algorithm can be seen as a game in
which a learner poses queries to a teacher about a target language L that only
the teacher knows. The goal of the learner is to learn a minimal deterministic
LTS with language L. In practical scenarios, the teacher is an abstraction of the
target system we wish to learn a model of. The learner can ask two types of
queries:
Compositional Automata Learning of Synchronous Systems 51
ϵ b
ϵ 1 0
b 0 0
S
a 1 1 a
ab 1 0
10 b 11
ba 0 0
bb 0 0
S·Σ\S aa 0 0
aba 1 1
abb 0 0
(a) (b)
Fig. 1: A closed and consistent observation table and the LTS that can be con-
structed from it.
Let rowT : S ∪S ·Σ → (E → {0, 1}) denote the function rowT (s)(e) = T (s·e)
mapping each row of T to its content (we omit the subscript T when clear
from the context). The crucial observation is that T approximates the Nerode
congruence [28] for L as follows: s1 and s2 are in the same congruence class only
if row(s1 ) = row(s2 ), for s1 , s2 ∈ S. Based on this fact, the learner can construct
a hypothesis LTS from the table, in the same way the minimal DFA accepting
a given language is built via its Nerode congruence:3
In order for the transition relation to be well-defined, the table has to satisfy the
following conditions:
The algorithm works in an iterative fashion: starting from the empty table,
where S and E only contain ϵ, the learner extends the table via membership
queries until it is closed and consistent, at which point it builds a hypothesis
and submits it to the teacher in an equivalence query. If a counter-example
is received, it is incorporated in the observation table by adding its prefixes
to S, and the updated table is again checked for closedness and consistency.
The algorithm is guaranteed to eventually produce a hypothesis H such that
L(H) = L, for which an equivalence query will be answered positively, causing
the algorithm to terminate.
w1 ∈?
L (M Teacher
1)
L∗i
?
H1 =? w ∈ L(M )
M1 A
d
.. CEX
σ1 a M=
. p M1 ∥ · · · ∥ Mn
? n) t
L (M ?
wn ∈ e H=M
r
Mn? CEX σ
L∗i Hn =
σn
CEX
c c
c
c c a
L1 = L2 = b L= b b b
a
a
Fig. 3: Running example consisting of two LTSs L1 and L2 and their parallel
composition L. The respective alphabets are {a, c}, {b, c} and {a, b, c}.
As sketched above, our adapter answers queries on each of the LTSs Mi , based
on information obtained from queries on M . However, the application of the
parallel operator causes loss of information, as the following example illustrates.
We will use the LTSs below as a running example throughout this section.
We thus need to relax the guarantees on the answers given by the adapter in
the following way:
1. Not all membership queries can be answered, the adapter may return the
answer ‘unknown’.
2. An equivalence query for component i can be answered with a spurious
counter-example σi ∈ L(Hi ) ∩ L(Mi ).
The procedures that implement the adapter are stated in Listing 1. For each
1 ≤ i ≤ n, we have one instance of each of the functions Member i and Equiv i ,
used by the ith learner to pose its queries. Here, we assume that for each compo-
nent i, a copy of the latest hypothesis Hi is stored, as well as a set Pi which con-
tains traces that are certainly in L(Mi ). Membership and equivalence queries on
M will be forwarded to the teacher via the functions Member (σ) and Equiv (H),
respectively.
4
This assumption can be satisfied in practice by using a lexicographical ordering on
the conformance test suite the teacher generates to decide equivalence.
56 T. Neele and M. Sammartino
Example 7. Again considering our running example (Figure 3), suppose the two
learners call in parallel the functions Equiv 1 (H1 ) and Equiv 2 (H2 ). The provided
hypotheses and their parallel composition are as follows:
b b
c c
H1 = H2 = bc H1 ∥ H 2 =
a a
Example 8. Suppose now that the hypotheses and their composition are:
c c
c b ac
H1 = H2 = H1 ∥ H2 = b b
ac
a
3.2 L∗ extensions
As explained in the previous section, the capabilities of our adapter are limited
compared to an ordinary teacher. We thus extend L∗ to deal with the answer
‘unknown’ to membership queries and to deal with spurious counter-examples.
now use the new relation ≈. We say an LTS M is consistent with T iff for all
s ∈ Σ ∗ such that T (s) is defined, we have T (s) = 1 iff s ∈ L(M ).
As discussed earlier, Angluin’s original L∗ algorithm relies on the fact that,
for a closed and consistent table, there is a unique minimal DFA (or, in our case,
LTS) that is consistent with T . However, the occurrence of wildcards in the
observation table may allow multiple minimal LTSs that are consistent with T .
Such a minimal consistent LTS can be obtained with a SAT solver, as described
in [19].
Similar to Angluin’s original algorithm, this extension comes with some cor-
rectness theorems. First of all, it terminates outputting the minimal LTS for the
target language. Furthermore, each hypothesis is consistent with all membership
queries and counter-examples that were provided so far. Lastly, each subsequent
hypothesis has at least as many states as the previous one, but never more than
the minimal LTS for the target language.
Example 9. Refer again to the LTSs of our running example in Figure 3. Consider
the situation after proposing the hypotheses of Example 8 and receiving the
counter-example ccc, which is spurious for the second learner.
In the next iteration, Member 2 can answer some membership queries, such
as cbc, necessary to expand the table of the second learner. This is enabled by
the fact that P1 contains cc from the positive counter-example of Example 7
(line 2 of Listing 1). The resulting updated hypotheses are as follows.
c c c c
H1′ = H2′ =
a b
b b
Now the counter-example to composite hypothesis H1′ ∥ H2′ is cacc. The projec-
tion on Σ2 is ccc, which directly contradicts the counter-example received in the
previous iteration. This spurious counter-example is thus repaired by backtrack-
ing in the second learner. The invocation of Equiv 1 (H1′ ) by the first learner does
not return this counter-example, since H1′ ∥ H2′ and H1′ do not agree on cacc, so
the check on line 17 of Listing 1 fails.
Finally, in the next iteration, the respective hypotheses coincide with L1 and
L2 and both learners terminate. ⊔
⊓
Compositional Automata Learning of Synchronous Systems 59
3.3 Correctness
As a first result, we show that our adapter provides correct information on each
of the components when asking membership queries. This is required to ensure
that information obtained by membership queries does not conflict with counter-
examples. Proofs are omitted for space reasons.
Lemma 1. Every counter-example obtained from Equiv (H) is valid for at least
one learner.
The next lemma shows that even if a spurious counter-example occurs, this
does not induce divergence, since it is always repaired by a corresponding positive
counter-example in finite time.
Remark 2. We cannot claim the stronger result that Hi = Mi for all i, since dif-
ferent component LTSs can result in the same parallel composition. For example,
consider the below LTSs, both with alphabet {a}:
H1 = H2 = a
3.4 Optimisations
There are a number of optimisations that can dramatically improve the practical
performance of our learning framework. We briefly discuss them here.
First, finding whether there is a trace σ ′ ∈ Π (line 2 of Listing 1) can quickly
become expensive once the sets Pi grow larger. We thus try to limit the size
of each Pi without impacting the amount of information it provides on the
60 T. Neele and M. Sammartino
4 Experiments
1010
103
L∗ (monolithic)
107
102
104
101
101
100 0
101 103 105 107 109 10 101 102 103
Coal (compositional) Coal (compositional)
On each LTS, we run the classic L∗ algorithm and Coal, and record the
number of queries posed to the teacher.5 The result is plotted in Figure 5; note
the log scale. Here, marks that lie on the dashed line indicate a time-out or
out-of-memory for one of the two algorithms.
Coal outperforms the monolithic L∗ algorithm in the number of member-
ship queries for all cases (unless it fails). In more than half of the cases, the
5
The number of queries is the standard performance measure for query learning algo-
rithms; runtime is less reliable, as it depends on the specific teacher implementation.
62 T. Neele and M. Sammartino
difference is at least three orders of magnitude; it can even reach six orders of
magnitude. For equivalence queries, the difference is less obvious, but our com-
positional approach scales better for larger systems. This is especially relevant,
because in practice implementations equivalence queries may require a number
of membership queries that is exponential in the size of the system. Multiparty
communication systems benefit most from compositional learning. The num-
ber of spurious counter-examples that occurs for these models is limited: about
one on average. Only twelve models require more than five spurious counter-
examples; the maximum number required is thirteen. This is encouraging, since
even for this varied set of LTSs the amount of duplicate work performed by
Coal is limited.
Management instances, the increasing runtime of Coal is due to the fact that
two of the components grow as the parameter W increases. The larger number of
states causes a higher runtime of the SAT procedure for constructing a minimal
LTS.
We remark that in our experiments, the teacher has direct access to the LTS
we aim to learn, leading to cheap membership and equivalence queries. Thus, in
this idealised setting, L∗ incurs barely any runtime penalty for the large number
of queries it requires. Using a realistic teacher implementation would quickly
cause time-outs for L∗ , making the results of our experiments less insightful.
5 Related Work
Finding ways of projecting a known concurrent system down into its components
is the subject of several works, e.g., [8,17]. In principle, it would be possible to
learn the system monolithically and use the aforementioned results. However, as
shown in Section 4, this may result in a substantial query blow-up.
Learning approach targeting various concurrent systems exist in the litera-
ture. As an example of the monolithic approach above, the approach of [6] learns
asynchronously-communicating finite state machines via queries in the form of
message sequence charts. The result is a monolithic DFA that is later broken
down into components via an additional synthesis procedure. This approach thus
does not avoid the exponential blow-up in queries. Another difference with our
work is that we consider synchronous communication.
Another monolithic approach is [18], which provides an extension of L∗ to
pomset automata. These automata are acceptors of partially-ordered multisets,
which model concurrent computations. Accordingly, this relies on an oracle capa-
ble of processing pomset-shaped queries; adapting the approach to an ordinary
sequential oracle – as in our setting – may cause a query blow-up.
A severely restricted variant of our setting is considered in [13], which in-
troduces an approach to learn Systems of Procedural Automata. Here, DFAs
representing procedures are learned independently. The constrained interaction
of such DFAs allows for deterministically translating between component-level
and system-level queries, and for univocally determining the target of a counter-
example. Our setting is more general – arbitrary (not just pair-wise) synchroni-
sations are allowed at any time – hence these abilities are lost.
Two works that do not allow synchronisation at all are [23,25]. In [23] indi-
vidual components are learned without any knowledge of the component number
and their individual alphabets, however components cannot synchronise (alpha-
bets are assumed to be disjoint). This is a crucial difference with our approach,
which instead has to deal with unknown query results and spurious counter-
examples precisely due to the presence of synchronising actions. An algorithm
for learning Moore machines with decomposable outputs is propose in [25]. This
algorithm spawns several copies of L∗ , one per component. This approach is not
applicable to our setting, as we do not assume decomposable output and allow
dependencies between components.
64 T. Neele and M. Sammartino
6 Conclusion
We have shown how to learn component systems with synchronous communica-
tion in a compositional way. Our framework uses an adapter and a number of
concurrent learners. Several extensions to L∗ were necessary to circumvent the
fundamental limitations of the adapter. Experiments with our tool Coal show
that our compositional approach offers much better scalability than a standard
monolithic approach.
In future work, we aim to build on our framework in a couple of ways. First,
we want to apply these ideas to all kinds of extensions of L∗ such as TTT [21]
(for reducing the number of queries) and algorithms for learning extended finite
state machines [7]. Our expectation is that the underlying learning algorithm
can be replaced with little effort. Next, we want to eliminate the assumption
that the alphabets of individual components are known a priori. We envisage
this can be achieved by combining our work and [23].
We also would like to explore the integration of learning and model-checking.
A promising direction is learning-based assume-guarantee reasoning, originally
introduced by Cobleigh et. al. in [9]. This approach assumes that models for
the individual components are available. Using our approach, we may be able
to drop this assumption, and enable a fully black-box compositional verification
approach.
References
1. Abel, A., Reineke, J.: Gray-Box Learning of Serial Compositions of Mealy Ma-
chines. In: NFM. pp. 272–287 (2016). https://doi.org/10.1007/978-3-319-40648-
0 21
2. Amparore, E., et al.: Presentation of the 9th Edition of the Model
Checking Contest. In: TACAS2019. LNCS, vol. 11429, pp. 50–68 (2019).
https://doi.org/10.1007/978-3-030-17502-3 4
3. Angluin, D.: Learning regular sets from queries and counterexamples. Infor-
mation and Computation 75(2), 87–106 (1987). https://doi.org/10.1016/0890-
5401(87)90052-6
4. Angluin, D., Fisman, D.: Learning regular omega languages. Theor. Comput. Sci.
650, 57–72 (2016). https://doi.org/10.1016/j.tcs.2016.07.031
Compositional Automata Learning of Synchronous Systems 65
5. Argyros, G., D’Antoni, L.: The Learnability of Symbolic Automata. In: CAV. pp.
427–445 (2018). https://doi.org/10.1007/978-3-319-96145-3 23
6. Bollig, B., Katoen, J., Kern, C., Leucker, M.: Learning Communicating Au-
tomata from MSCs. IEEE Trans. Software Eng. 36(3), 390–408 (2010).
https://doi.org/10.1109/TSE.2009.89
7. Cassel, S., Howar, F., Jonsson, B., Steffen, B.: Active learning for ex-
tended finite state machines. Formal Aspects Comput. 28(2), 233–263 (2016).
https://doi.org/10.1007/s00165-016-0355-5
8. Castellani, I., Mukund, M., Thiagarajan, P.S.: Synthesizing Distributed Transition
Systems from Global Specifications. In: FSTTCS. LNCS, vol. 1738, pp. 219–231
(1999). https://doi.org/10.1007/3-540-46691-6 17
9. Cobleigh, J.M., Giannakopoulou, D., Pasareanu, C.S.: Learning Assumptions for
Compositional Verification. In: TACAS. Lecture Notes in Computer Science,
vol. 2619, pp. 331–346. Springer (2003). https://doi.org/10.1007/3-540-36577-X 24
10. De Moura, L., Bjørner, N.: Z3: An efficient SMT Solver. In: TACAS 2008. LNCS,
vol. 4963, pp. 337–340 (2008). https://doi.org/10.1007/978-3-540-78800-3 24
11. Fiterau-Brostean, P., Janssen, R., Vaandrager, F.W.: Combining Model Learn-
ing and Model Checking to Analyze TCP Implementations. In: CAV2016. LNCS,
vol. 9780, pp. 454–471 (2016). https://doi.org/10.1007/978-3-319-41540-6 25
12. Fiterau-Brostean, P., Jonsson, B., Merget, R., de Ruiter, J., Sagonas, K., So-
morovsky, J.: Analysis of DTLS Implementations Using Protocol State Fuzzing.
In: USENIX (2020), https://www.usenix.org/conference/usenixsecurity20/
presentation/fiterau-brostean
13. Frohme, M., Steffen, B.: Compositional learning of mutually recursive procedural
systems (2021). https://doi.org/10.1007/s10009-021-00634-y
14. Grinchtein, O., Leucker, M.: Learning Finite-State Machines from Inexperienced
Teachers. In: ICGI. pp. 344–345 (2006). https://doi.org/10.1007/11872436 30
15. Grinchtein, O., Leucker, M., Piterman, N.: Inferring Network Invariants Automat-
ically. In: IJCAR. pp. 483–497 (2006). https://doi.org/10.1007/11814771 40
16. Groote, J.F., van der Hofstad, R., Raffelsieper, M.: On the random structure of
behavioural transition systems. Science of Computer Programming 128, 51–67
(2016). https://doi.org/10.1016/j.scico.2016.02.006
17. Groote, J.F., Moller, F.: Verification of parallel systems via decom-
position. In: CONCUR 1992. LNCS, vol. 630, pp. 62–76 (1992).
https://doi.org/10.1007/BFb0084783
18. van Heerdt, G., Kappé, T., Rot, J., Silva, A.: Learning Pomset Automata. In:
FoSSaCS2021. LNCS, vol. 12650, pp. 510–530 (2021). https://doi.org/10.1007/978-
3-030-71995-1 26
19. Heule, M.J.H., Verwer, S.: Exact DFA Identification Using SAT Solvers. In: ICGI
2010. LNCS, vol. 6339, pp. 66–79 (2010). https://doi.org/10.1007/978-3-642-15488-
17
20. de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars.
Cambridge University Press, USA (2010)
21. Isberner, M., Howar, F., Steffen, B.: The TTT Algorithm: A Redundancy-Free
Approach to Active Automata Learning. In: RV2014. LNCS, vol. 8734, pp. 307–
322 (2014). https://doi.org/10.1007/978-3-319-11164-3 26
22. Isberner, M., Howar, F., Steffen, B.: The Open-Source LearnLib: A Framework for
Active Automata Learning. In: CAV2015. LNCS, vol. 9206, pp. 487–495 (2015).
https://doi.org/10.1007/978-3-319-21690-4 32
66 T. Neele and M. Sammartino
23. Labbaf, F., Groote, J.F., Hojjat, H., Mousavi, M.R.: Compositional Learning for
Interleaving Parallel Automata. In: FoSSaCS 2023. LNCS, Springer (2023)
24. Leucker, M., Neider, D.: Learning Minimal Deterministic Automata from Inexpe-
rienced Teachers. In: ISoLA. pp. 524–538 (2012). https://doi.org/10.1007/978-3-
642-34026-0 39
25. Moerman, J.: Learning Product Automata. In: ICGI. vol. 93, pp. 54–66. PMLR
(2018), http://proceedings.mlr.press/v93/moerman19a.html
26. Moerman, J., Sammartino, M., Silva, A., Klin, B., Szynwelski,
M.: Learning nominal automata. In: POPL. pp. 613–625 (2017).
https://doi.org/10.1145/3009837.3009879
27. Neele, T., Sammartino, M.: Replication package for the paper “Com-
positional Automata Learning of Synchronous Systems” (2023).
https://doi.org/10.5281/zenodo.7503396
28. Nerode, A.: Linear automaton transformations 9(4), 541–544. (1958)
29. de Ruiter, J., Poll, E.: Protocol State Fuzzing of TLS Implementa-
tions. In: USENIX. pp. 193–206 (2015), https://www.usenix.org/conference/
usenixsecurity15/technical-sessions/presentation/de-ruiter
30. Schuts, M., Hooman, J., Vaandrager, F.W.: Refactoring of Legacy Software Using
Model Learning and Equivalence Checking: An Industrial Experience Report. In:
IFM. LNCS, vol. 9681, pp. 311–325 (2016). https://doi.org/10.1007/978-3-319-
33693-0 20
31. Shih, A., Darwiche, A., Choi, A.: Verifying Binarized Neural Networks
by Angluin-Style Learning. In: SAT. vol. 11628, pp. 354–370 (2019).
https://doi.org/10.1007/978-3-030-24258-9 25
32. Zuberek, W.: Petri net models of process synchronization mech-
anisms. In: SMC1999. vol. 1, pp. 841–847. IEEE (1999).
https://doi.org/10.1109/ICSMC.1999.814201
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Concolic Testing of Front-end JavaScript
Zhe Li( )
and Fei Xie
1 Introduction
2 Background
– P are the preconditions that must be met so that the function under test
can be executed.
– C is the function under test, containing the logic to be tested.
– Q are the post assertions of the test case that are expected to be true.
obtain the web pages, parses them and stores page functions and their context
information individually so that test runners can run the functions browser-
less [4]. The test runner sets up the three parts of a test case for each JS function
under test and then executes the test case. The front-end JS testing framework
helps isolate the JS function under test and provides the execution context for
testing the function, which is an ideal entry for our application of the concolic
testing to front-end JS.
JavaScript
JS execution engine
Constraint Solver
Symbolic Execution
Engine
3 Approach
3.1 Overview
(b) Naı̈ve execution of JS web function directly with in-situ concolic testing
3 pure J S function
with Web Info
Execution Context symbolic exeuction
Web J S Application
interface function
<html> <frame>
<component/> Node.js
call symbolic execution interface
<script>js_func</script>
... args (symbolic variable) Execution Tracer
</html> function js_func (args) =>
{
J S Testing Famework if (args=='test')
... Symbolic Execution
Test Runner Testing Libraries else Engine
Mocking input ... In-situ concolic testing
return;
HTML renderer }
Extract
Function testRunner.call();
Interceptor exit(); Test Case
</frame>
of tracing within the native execution context and what difference it makes in
Section 3.2.
Execution Tracer
– First, the function interceptor requests the page frame detail of the web
page where the targeted JS web function resides, utilizing the existing mock-
ing data and the HTML render function. The mocking data and the HTML
render function are usually created manually and included in the unit test
suite.
– Second, from the page frame detail, the function interceptor identifies
the function body in a pure JS form given the function name. To preserve the
Concolic Testing of Front-end JavaScript 73
mockingInput();
Execution Context Page frame detail HTMLrender();
Execute Test Suites...
J S Testing Framework
Fig. 5: How to avoid unnecessary tracing of the test runner setup by delaying
injection of symbolic values and start of tracing
(indicated by the red box in Figure 5), we choose to inject symbolic values
inside the execution context and start tracing when the test runner actually
74 Z. Li and F. Xie
executes the encapsulated function, by calling the interface functions the in-
situ concolic testing provides. This way the execution tracer only captures the
execution trace of the encapsulated JS web function. The locations for injecting
symbolic values and starting tracing are indicated in the ”Execution Context
(EC)” box in Figure 4 and the captured execution trace is indicated by the
”Execution Trace” box in the right corner of Figure 4.
Most Concise Execution Trace Figure 6 shows why our approach can obtain
the most concise execution trace for the JS web function driven by the test
runner of the JS testing framework. Apart from the overhead caused by the test
runner, the extraction of the execution context for the JS web function involves
calling a set of JS helper functions to collect web page information, such as
helper js 1 and JSHandle js 1. If we directly apply symbolic execution within
the test runner where the JS function is intercepted along with the execution
context extraction, the execution tracer will also capture the execution traces
of the test runner and the testing helper functions from the testing libraries
shown as ”Execution Trace 0” in the right-hand side of Figure 6. We modified
the test runner to mark symbolic variables and enable tracing control within the
execution context. Instead of starting tracing when the test runner starts, we
defer the tracing of the execution to when and where the test runner actually
Concolic Testing of Front-end JavaScript 75
executes the encapsulated function under test in the extracted execution context,
indicated by the ”Execution Trace 1” in the left-hand side of Figure 6. This way
we minimize the extend of execution tracing needed.
4 Implementations
Fig. 8: How we set symbolic variables in the execution context and enable cus-
tomized execution context tracing in Puppeteer
pure functions with respect to their inputs [16]. We refer to them as ”functional
components”. They accept arbitrary inputs (called “props”) and return React
elements describing what should appear on the web page [17]. An individual
component can be reused in different combinations. Therefore, the correctness
of an individual component is important with the respect to the correctness of
their compositions. In our implementation, we only consider components that
have at least one input.
Jest has a test runner, which allows us to
run tests from the command line. Jest also
Account Panel
provides additional utilities such as mocks,
Search... 1
stubs, etc., besides the utilities of test cases,
2 Click Me!
assertions, and test suites. We use Jest’s mock
Account ID Name
3
data to set up the testing environment for the
xxx xxx front-end components defined with React. Fig-
xxx xxx ure 10 shows how we leverage and extend Jest
assisted by React testing library to apply the
Fig. 9: Example React Compo- in-situ concolic testing to React component.
nents To encapsulate the JS function in the com-
ponent with its execution context, we aug-
mented the render function, whose functionality is to render the React com-
ponent function and props as an individual unit for Jest to execute from the
web page, with the function interceptor. Through the render function, the
function interceptor extracts a complete execution context for the functional
component and intercepts the JS function wrapped in the functional component
indicated by the arrows in Figure 10. To enable customized execution context
tracing, the function interceptor then marks symbolic variables and starts
tracing after the completion of the encapsulation. At last, we configure Jest’s
test runner to run each unit test individually while initiating in-situ concolic exe-
cution so that we can obtain the most concise execution traces for later symbolic
analysis.
Fig. 10: How to apply in-situ concolic testing on React components using Jest
Concolic Testing of Front-end JavaScript 79
5 Evaluations
We have selected 21 GitHub projects utilizing Puppeteer. We test them using the
Puppeteer framework extended with our concolic testing capability. As a result,
we discovered 4 bugs triggered from their web pages and 2 of them originated
from their dependency libraries.
(1) They use Puppeteer for unit testing of their JS web features;
(2) They have JS functions in web pages and such functions have at least one
argument whose type is string or number;
(3) They utilize evaluate in their unit tests.
We have developed a script based on such properties and used the searching API
provided by GitHub to collect applicable projects [20]. 21 projects were collected.
Table 1 summarizes the demographics of the 21 GitHub projects collected by our
script. We calculated the statistics using ls-files [7] combined with cloc provided
by GitHub [8]. The LoC/JS is the LoC (lines of code) of all JS files, which includes
the JS files of the libraries the project depends on. The LoC/HTML is the LoC of
HTML files, which indicates the volume of its front-end web contents. The LoC of
unit tests (LoC/unit test) includes the unit test files ending with .test.js. The
test ratio is the ratio between the LoC/unit test over the LoC/JS, indicating the
availability of unit tests for the projects. Before evaluation, we configure these
projects to use the extended Puppeteer framework instead of the original one.
Result Analysis We ran each project with our approach for 30 minutes. On
average, our implementation generates 200 to 400 test cases for each function.
Table 2 summarizes the bugs detected. For polymer, our method generates two
types of test cases that trigger two different bugs in user password validation
functionalities of the project: 1) a generated test case induces execution to skip
an if branch, which causes the password to be undefined, leading to the condi-
tion !password || this.password === password to return true, which should
have returned false. We have fixed this bug by changing the operator || to &&.
2) test cases containing unicode characters fail password pattern matching using
regular expression without g flag, i.e., /[!@#$%^ &*(),.?":|<>]/.test(value).
For InsugarTrading, a test case of a string not containing comma is generated for
80 Z. Li and F. Xie
We identified two traits of the projects for which we did not detect bugs
in. (1) A project does not fit the design of our Puppeteer implementation, i.e.,
evaluate is not used in the test suite. (2) The applicable JS part is small and
well tested.
– Dependency Installation. Collect and install dependencies for the target com-
ponent. Such dependencies can be components or libraries.
For the buy-eth feature as shown in Figure 12, a test network error with a
respond code of 500 was triggered when testing the Ether deposit functionality.
Concolic testing generates a test case of an invalid chainId for buyEth(), which
is defined in the DepositEtherModal component. It is wrapped by a <Button>
tag and can be triggered by onClick(). buyEth() calls into buyEthUrl(), which
retrieves a url for buyEth() function. Because buyEthUrl() did not check if the
url is valid or null before it calls openTab(url) with the returned url. And
there is also no validation for input in the component implementation. Addi-
tionally, this process was not wrapped in a try/catch block. We caught this
error in our evaluation. We tested 16 component folders and discovered that
metamask-extension most likely will ignore input checking if inputs are not di-
rectly from users. chainId is retrieved from mock data in this case, which is
generated by our concolic engine.
DepositEtherModal
render(){chainId, buyEth}
return (<div> ...
Test Runner
//Component has no input checking
Test("buyEth", ()=>{ <Button
let tc=render(<DepositEtherModal onButtonClick=buyEth(chainId)/>
onClick=mockProp>) ... </div>)
var chainId = "0x4"; actoin.js
%MarkSymbolic(chainId); buyEth(chainId) {...
%StartTracing() var url=getBuyEthUrl(chainId);
tc.buyEth(ChainId);}) //no validation of url.
var re=openTab(url);
//invalid chaidId cause empty url.
...}
6 Related Work
Our approach is closely related to work on symbolic execution for JS. Most of
them aim at back-end/standalone JS programs, primarily target specific bug
patterns and depend on whole-program analysis. Jalangi works on pure JS pro-
grams and instruments the source JS code to collect path constraints and data
84 Z. Li and F. Xie
for replaying [38]. COSETTE is another symbolic execution engine for JS using
an intermediate representation, namely JSIL, translated from JS [36]. ExpoSE
applies symbolic execution on standalone JS and uses JALANGI as its symbolic
execution engine. ExpoSE’s contribution is in addressing the limitation that
JALANGI has, which is to support regular expressions for JS [33]. There are
few symbolic analysis frameworks for JS web applications. Oblique injects sym-
bolic JS library into the page’s HTML. When a user loads the page, it conducts
a symbolic page load to explore the possible behaviors of a web browser and a
web server during the page load process. It generates a list of pre-fetch url for
client-side to speed up page load [30]. It is an extension of the ExpoSE concolic
engine. SymJS is a framework for testing client-side JS script and mainly focus
on automatically discovering and exploring web events [31]. It modifies Rhino
JS engine for symbolic execution [27,19]. Kudzu targets AJAX applications and
focuses on discovering code injection vulnerabilities by implementing a dynamic
symbolic interpreter that takes a simplified intermediate language for JS [37].
To the best of our knowledge, there has been no publicly available symbolic
execution engines targeting JS functions embedded in front-end web pages [32].
Another related approach to JS testing is fuzzing, which typically uses code
coverage as feedback to test generation. There are a few fuzzers for JS, e.g.,
jsfuzz [11] and js-fuzz [10], which are largely based on the fuzzing logic of AFL
(American fuzzy lop) [2] and re-implemented it for JS. We view fuzzing and
symbolic/concolic testing as complementing techniques: fuzzing for broader ex-
ploration of JS while symbolic/concolic testing for deeper exploration.
7 Conclusions
References
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Democratizing Quality-Based Machine Learning
Development through Extended Feature Models?
Giordano d’Aloisio( )
, Antinisca Di Marco , and Giovanni Stilo
1 Introduction
Machine Learning (ML) systems are increasingly becoming used instruments,
applied to all application domains and affecting our real life. The development
?
This work has been partially supported by EMELIOT national research project,
which has been funded by the MUR under the PRIN 2020 program (Con-
tract 2020W3A5FY) and by European Union – Horizon 2020 Program under the
scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communi-
ties”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastruc-
ture for Social Mining and Big Data Analytics” (http://www.sobigdata.eu)
2 Related Work
The problem of quality assurance in machine learning systems has gained much
relevance in the last years. Many articles highlight the needing of defining and
formalizing new standard quality attributes for machine learning systems
[30,65,70] [50,12,46]. Most of the works in the literature focus either on the
identification of the most relevant quality attributes for ML systems or on the
formalization of them in the context of ML systems development.
Concerning the identification of quality attributes in ML systems, the au-
thors of [40,72] identify three main components in which quality attributes can
be found: Training Data, ML Models and ML Platforms. The quality of
Training Data is usually evaluated with properties such as privacy, bias, num-
ber of missing values, expressiveness. For ML Model, the authors mean the
trained model used by the system. The quality of this component is usually
evaluated by fairness, explainability, interpretability, security. Finally, the ML
Platform is the implementation of the system, which is affected mostly by se-
curity and performance reliability and availability. Muccini et al. identify in [50]
a set of quality properties as stakeholders’ constraints and highlight the need-
ing of considering them during the Architecture Definition phase. The quality
attributes are: data quality, ethics, privacy, fairness, ML models’ performance,
etc. Martinez-Fernàndez et al. also highlight in [46] the needing of formalizing
quality properties in ML systems and to update the software quality require-
ments defined by ISO 25000 [36]. The most relevant properties highlighted by
the authors concern: ML safety, ML ethics, and ML explainability. In our work,
we focus on quality properties that arises during the development of ML systems
such as, fairness, explainability, interpretability, and dataset’s privacy, while we
leave other quality properties (e.g., performance) that arises during other phases
(e.g., deployment) for future works.
Many solutions have been proposed to formalize and model standard quality
assurance process in ML systems. Amershi et al., have been the first authors
to identify a set of common steps that identify each ML system development
[5]. In particular, each ML system is identified by nine stages that go from data
collection and cleaning, to model training and evaluation, and finally to the de-
ployment and monitoring of the ML model. Their work has been the foundation
of many subsequent papers on quality modelling of ML systems. CRISP ML
(Cross-Industry Standard Process model for Machine Learning ) is a process
model proposed by Studer et al. [66], extending the more known CRISP DL
[45] process model to ML systems. They identify a set of common phases for
the building of ML systems namely: Business and Data understanding, Data
preparation, Modeling, Evaluation, Deployment, Monitoring and Maintenance.
Extended Feature Models for Quality-Based ML Development 91
For each phase, the authors identify a set of functional quality properties to
guarantee the quality of such systems. Similarly, the Quality for Artificial In-
telligence (Q4AI ) consortium proposed a set of guidelines [32] for the quality
assurance of ML systems for specific domains: generative systems, operational
data in process systems, voice user interface system, autonomous driving and
AI OCR. For each domain, the authors identify a set of properties and met-
rics to ensure quality. Concerning the modelling of quality requirements, Azimi
et al. proposed a layered model for the quality assurance of machine learning
systems in the context of Internet of Things (IoT) [7]. The model is made of
two layers: Source Data and ML Function/Model. For the Source Data, a set of
quality attributes are defined: completeness, consistency, conformity, accuracy,
integrity, timeliness. Machine learning models are instead classified into predic-
tors, estimators and adapters and a set of quality attributes are defined for each
of them: accuracy, correctness, completeness, effectiveness, optimality. Each sys-
tem is then influenced by a subset of quality characteristics based on the type
of ML model and the required data. Ishikawa proposed, instead, a framework
for the quality evaluation of an ML system [35]. The framework defines these
components for ML applications: dataset, algorithm, ML component and system,
and, for each of them, proposed an argumentation approach to assess quality.
Finally, Siebert et al. [64] proposed a formal modelling definition for quality re-
quirements in ML systems. They start from the process definition in [45] and
build a meta-model for the description of quality requirements. The meta-model
is made of the following classes: Entity (which can be defined at various levels
of abstraction, such as the whole system or a specific component of the system),
Property (also expressed at different levels of abstraction), Evaluation and Mea-
sure related to the property. Starting from this meta-model, the authors build
a tree model to evaluate the quality of the different components of the system.
From this analysis, we can conclude that there is a robust research motivation in
formalizing and defining new quality attributes for ML systems. Many attempts
have been proposed to solve these issues, and several quality properties, metrics
and definitions of ML systems can now be extracted from the literature. However
a framework that actually guides the data scientist through the development of
a ML systems satisfying quality properties is still missing.In this paper, we aim
to solve these concerns by proposing MANILA, a novel approach which will de-
mocratize the quality-based development of ML systems by means of a low-code
platform. In particular, we model a general workflow for the quality-based de-
velopment of ML systems as a SPL through the ExtFM formalism. Next, we
demonstrate how it is possible to generate an actual implementation of such
workflow from a low-code experiment configuration and how this workflow is
actually able to find the best methods to satisfy a given quality requirement.
Recalling the ML development process of [5], MANILA focuses on the model
training and model evaluation development steps by guiding the data scientist
in selecting the ML system (i.e., ML algorithm and quality-enhancing method)
better satisfying a given quality attribute.
92 G. d’Aloisio et al.
4 Motivating Scenario
– if the quality method works on the training set, it has to be applied to the
dataset before training the ML algorithm;
– if the quality method works on the ML algorithm before training, then it
has to be applied to the ML algorithm before the training phase;
– if the method works on the trained ML algorithm (i.e., f in the code), then
it has to be applied after the training of the ML algorithm.
Finally, the data scientist computes the selected metrics for the specific pair of
ML and QA methods. After repeating the process for all the selected methods,
she chooses a report technique (e.g., table or chart), evaluates the obtained
results collected in the report and trains with the entire dataset the ML algorithm
performing better by applying the quality method that better achieves the QA.
If the data scientist has a threshold to achieve, then she can verify if at least
one of the ML and quality methods combinations satisfies the constraint. If so,
one of the suitable pair is selected. Otherwise, she has to relax the threshold and
repeat the process again.
The workflow described in Algorithm 1 can be generalized as a process of
common steps describing any experiment in the considered domain. Figure 1
sketches such a generalization. First, the data scientist selects all the features
of the experiment, i.e., the dataset, the ML Methods, the methods assuring a
specific QA and the related metrics. we call such a step Features Selection. Next,
she runs the quality methods using the general approach described in algorithm
1 and evaluates the results (namely, Experiment Execution). If the results are
Extended Feature Models for Quality-Based ML Development 95
Features selection
Select
dataset Experiment Execution
No
satisfying (i.e., they satisfy the quality constraints), then the method with the
best QA is returned. Otherwise, the data scientist have to repeat the process.
The described workflow is the foundation of MANILA that aims to formalise
and democratise it by providing a SPL and ExtFM-based low-code framework
that supports her in development of quality ML systems.
5 MANILA Approach
MANILA
Quality Attribute 1
Best Quality
Method
Quality report
Feature Experiment Experiment Quality
selection generation execution Quality Attribute 2 Trade-off
Quality report
1. a quality report reporting for each quality method and ML algorithm the
related metrics;
2. the ML algorithm with the applied quality enhancing method that better
performs with the given QA, trained and ready for production.
In the future, MANILA will analyse the quality reports of each selected QA in
order to find the best trade-off among them (for instance, by means of Pareto-
front functions). The architecture of MANILA makes it easy to extend. In fact,
adding a new method or metric to MANILA just translates to adding a new
feature to the ExtFM and adding the proper code implementing it.
Near each step, we report the tools involved in its implementation. The source
code of the implemented artefacts is available on Zenodo [23], and GitHub [22].
In the following, we detail the ExtFM and each process step.
Extended Feature Models for Quality-Based ML Development 97
the positive value (used to compute fairness metrics). The Dataset could also
have one or more sensitive variables that identify sensitive groups subject to
unfairness [47]. The sensitive variables have a set of attributes to specify their
name and the privileged and unprivileged groups [47]. Finally, there is a feature
to specify if the Dataset has only positive attributes. This feature has been in-
cluded to define a cross-tree constraint with a scaler technique that requires only
positive attributes (see table 1). All these features are modelled as abstract since
they do not have a concrete implementation in the final experiment. The next
feature is a Scaler algorithm, which is not mandatory and can be included in the
experiment to scale and normalize the data before training the ML model [54].
Different scaler algorithms from the scikit-learn library [55] are listed as concrete
children of this feature. Next, there is the macro-feature representing the ML
Task to perform. This feature has not been modelled as mandatory since there
are two fairness methods (i.e. Gerry Fair and Meta Fair [39,17]) that embed a
fair classification algorithm and so, if they are selected, the ML Task can not be
specified. However, we included a cross-tree constraint requiring the selection of
ML Task if any of these two methods are selected (¬ Gerry Fair ∧ ¬ Meta Fair
⇒ ML Task). An ML Task could be Supervised or Unsupervised. A Supervised
task could be a Classification task or a Regression task and has an attribute
to specify the size of the training set. These two abstract features are then de-
tailed by a set of concrete implementations of ML methods selected from the
scikit-learn library [55]. The Unsupervised learning task could be a Clustering
or an Aggregation task. At this stage of the work, these two features have not
been detailed and will be explored in future works. Next is the macro feature
representing the system’s Quality Attributes. This feature is detailed by the four
quality attributes described in section 3. Effectiveness is not included in these
features since it is an implicit quality of the ML methods and does not require
adding other components (i.e. algorithms) in the experiment. At the time of this
paper, the Fairness quality has been detailed, while the other properties will be
deepened in future works. In particular, Fairness methods can be Pre-Processing
(i.e. strategies that try to mitigate the bias on the dataset used to train the
ML model [47,37,27]), In-Processing (i.e. methods that modify the behaviour
of the ML model to improve fairness [47,3]), and Post-Processing (i.e. methods
that re-calibrate an already trained ML model to remove bias [47,56]). These
three features are detailed by several concrete features representing fairness-
enhancing methods. In selecting such algorithms, we selected methods with a
solid implementation, i.e., algorithms integrated into libraries such as AIF360
[8] or Fairlearn [11] or algorithms with a stable source code such as DEMV
[26] or Blackbox [56]. All these quality features have been implemented with an
Or-group relationship. Forward, the macro feature represents the Metrics to use
in the experiment. Metrics are divided among Classification Metrics, Regression
Metrics and Fairness Metrics. Each metric category has a set of concrete metrics
selected from the scikit-learn library [55] and the AIF360 library [8]. Based on
the ML Task and the Quality Attributes selected, the data scientist must select
the proper metrics to assess Correctness and the other Quality Attributes. This
Extended Feature Models for Quality-Based ML Development 99
in our model. These constraints are useful to guide the data scientist through
selecting proper fairness-enhancing methods or metrics based on the Dataset’s
characteristics (i.e., label type or the number of sensitive variables) or the ML
Task.
From the depicted ExtFM, the data scientist can define her experiment by spec-
ifying the needed features inside a configuration file. A configuration file is an
XML file describing the set of selected features and the possible attribute val-
ues. The constraints among features defined in the ExtFM will guide the data
100 G. d’Aloisio et al.
scientist in the selection by not allowing the selection of features that are in con-
trast with already selected ones. The editor used to implement the ExtFM [68]
provides a GUI for the specification of configuration files, making this process
accessible to non-technical users.
Figure 4 depicts how the features selection and attribute specification pro-
cesses are done in MANILA. In particular, figure 4a details how the features of
the Dataset are selected inside the configuration. Note how features in contrast
with already selected ones are automatically disabled by the system (e.g., the
Binary feature is disabled since the MultiClass feature is selected). This au-
tomatic cut of the ExtFM guides the data scientist in defining configurations
that always lead to valid (i.e., executable) experiments. Figure 4b details how
attributes can be specified during the definition of the configuration. In partic-
ular, the rightmost column in figure 4b displays the attribute value specified by
the data scientist (e.g., the name of the label is y, and the positive value is 2).
During the experiment generation step, a process will automatically check if all
the required attributes (e.g., label name) have been defined. Otherwise, it will
ask the data scientist to fill them.
< feature automatic = " undefined " manual = " selected " name = "
MultiClass " / >
Listing 1.1: Portion of configuration file
Listing 1.1 shows a portion of the configuration file derived from the fea-
ture selection process. In particular, it can be seen how the Dataset and the
Label features have been automatically selected by the system (features with
name="Dataset" and name="Label" and automatic="selected"), the Multi-
Class feature has been manually selected by the data scientist (feature with
name="MultiClass" and manual="selected"), and the Binary feature was not
selected (feature with name="Binary" and both automatic and manual un-
selected). In addition, the name and the value of two Label attributes (i.e.,
Positive value equal to 2 and Name equal to contr use) are reported.
The structure of the configuration file makes it easy to be parsed by a proper
script. In MANILA, we implemented a Python parser that reads the configura-
tion file given as input and generates a set of scripts implementing the defined
experiment. The parser can be invoked using the Python interpreter with the
following command shown in listing 1.2.
$ python generator . py -n < CONFIGURATION FILE PATH >
Listing 1.2: Python parser invocation
In particular, the parser first checks if all the required attributes (e.g., the
label’s name) are set. If some of them are not set, it asks the data scientist to
fill them in before continuing the parsing. Otherwise, it selects all the features
with automatic="selected" or manual="select" and uses them to fill a Jinja2
template [53]. The generated quality-evaluation experiment follows the same
structure of algorithm 1. It is embedded inside a Python function that takes
as input the dataset to use (listing 1.3). An example of a generated file can be
accessed on the GitHub [22] or Zenodo [23] repository.
def experiment ( data ) :
# quality evaluation experiment
Listing 1.3: Quality-testing experiment signature
In addition to the main file, MANILA generates also a set of Python files
needed to execute the experiment and an environment.yml file containing the
specification of the conda [1] environment needed to perform the experiment.
All the files are generated inside a folder named gen.
The generated experiment can be invoked directly through the Python inter-
preter using the command given in listing 1.4. Otherwise, it can be called through
a REST API or any other interface such as a desktop application, or a Scien-
tific Workflow Management System like KNIME [44,10]. This generality of our
experimental workflow, makes it very flexible and suitable to many use-cases.
102 G. d’Aloisio et al.
The experiment applies each ML algorithm with each quality method and
returns a report using the adequate selected metrics along with the method
achieving the best QA. It is worth noting how each quality method is evaluated
individually on the selected ML algorithm, and for each QA, a corresponding
report is returned by the system. Figure 5 reports an example of how the quality
evaluation process is done in MANILA. In this example, the data scientist has
selected three ML algorithms and wants to assure Fairness and Explainability.
She has selected n methods to assure Fairness and m methods to assure Ex-
plainability. In addition, she has selected j metrics for Fairness and k metrics
for Explainability. Then, the testing process performs two parallel set of exper-
iments. In the first, it applies the n fairness methods to each ML algorithm
accordingly and computes the j fairness metrics. In the second, it applies the m
Explainability methods to the ML algorithms and computes the k Explainabil-
ity metrics. Finally, the process returns two reports synthesising the obtained
results for Fairness and Explainability along with the ML algorithms with the
best Fairness and Explainability, respectively. If the data scientist chooses to see
the results in tabular form (i.e., selects the Tabular feature in the ExtFM), then
the results are saved in a CSV file. Otherwise, the charts displaying the results
are saved as PNG files. The ML algorithm returned by the experiment is instead
saved as a pickle file [2]. We have chosen this format since it is a standard format
to store serialized objects in Python and can be easily imported in other scripts.
Finally, it is worth noting how the generated experiment workflow is written
in Python and can be customised to address particular stakeholders’ needs or
evaluate other quality methods.
Extended Feature Models for Quality-Based ML Development 103
6 Proof of Concept
To prove the ability of MANILA in supporting the quality-based development
of ML systems, we implemented with MANILA a fair classification system to
predict the frequency of contraceptive use by women, using a famous dataset
in the Fairness literature [42]. This use case is reasonable since fairness has
acquired much importance in recent years, partly because of the sustainable
goals of the UN [51]. The first step in the quality development process is feature
generation of a chart. From the given configuration, MANILA generates all the
python files needed to run the quality-assessment experiment. In particular, the
generated experiment trains and tests all the selected ML algorithm (i.e., Lo-
gistic Regression, Support Vector Classifier, and Gradient Boosting Classifier )
applying all the selected fairness methods properly (i.e., DEMV, Exponentiated
Gradient, and Grid Search). Finally, it computes the selected metrics on the
trained ML algorithms and returns a report of the metrics along with the fully
trained ML algorithm with the best fairness. All the generated files are available
on Zenodo [23] and Github [22]. The generated experiment was executed di-
rectly from the python interpreter, and the obtained results are available in table
2. In the table are reported the Fairness enhancing methods, the ML algorithms
and all the metrics computed. The table has been automatically ordered based
on the given aggregation function (i.e., the rightmost column HMean). From
the results, we can see that the Support Vector Classifier (i.e., svc in the table)
and the DEMV fairness method can achieve the best Fairness and Effectiveness
trade-off, since they have the highest HMean value (highlighted in green in table
2). Hence, the ML algorithm returned by the experiment is the Support Vec-
tor Classifier, trained with the full dataset after the application of the DEMV
algorithm.
Extended Feature Models for Quality-Based ML Development 105
7 Threats to Validity
Although the QA considered in MANILA are the most relevant and the most
cited in the literature, there could be other QA highly affecting the environ-
ment/end users of the ML system that are not focused prominently by existing
papers. In addition, the proposed experimental workflow is based on the consid-
ered QA; there could be other QA not considered at the time of this paper that
should be evaluated differently.
References
1. Conda website, https://docs.conda.io/
2. Pickle documentation, https://docs.python.org/3/library/pickle.html
3. Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., Wallach, H.: A Re-
ductions Approach to Fair Classification. In: Proceedings of the 35th Interna-
tional Conference on Machine Learning. pp. 60–69. PMLR (Jul 2018), https:
//proceedings.mlr.press/v80/agarwal18a.html, iSSN: 2640-3498
4. Aly, M.: Survey on multiclass classification methods. Neural Netw 19(1-9), 2
(2005)
5. Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan,
N., Nushi, B., Zimmermann, T.: Software Engineering for Machine Learning: A
Case Study. In: 2019 IEEE/ACM 41st International Conference on Software Engi-
neering: Software Engineering in Practice (ICSE-SEIP). pp. 291–300. IEEE, Mon-
treal, QC, Canada (May 2019). https://doi.org/10.1109/ICSE-SEIP.2019.00042,
https://ieeexplore.ieee.org/document/8804457/
6. Apel, S., Batory, D., Kästner, C., Saake, G.: Feature-oriented software product
lines. Springer (2016)
7. Azimi, S., Pahl, C.: A layered quality framework for machine learning-driven data
and information models. In: ICEIS (1). pp. 579–587 (2020)
8. Bellamy, R.K., Dey, K., Hind, M., Hoffman, S.C., Houde, S., Kannan, K., Lohia,
P., Martino, J., Mehta, S., Mojsilović, A., et al.: Ai fairness 360: An extensible
toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research
and Development 63(4/5), 4–1 (2019)
9. Benavides, D., Segura, S., Ruiz-Cortés, A.: Automated analysis of fea-
ture models 20 years later: A literature review. Information Systems 35(6),
615–636 (Sep 2010). https://doi.org/10.1016/j.is.2010.01.001, https://www.
sciencedirect.com/science/article/pii/S0306437910000025
10. Berthold, M.R., Cebron, N., Dill, F., Gabriel, T.R., Kötter, T., Meinl,
T., Ohl, P., Thiel, K., Wiswedel, B.: Knime - the konstanz informa-
tion miner: Version 2.0 and beyond. SIGKDD Explor. Newsl. 11(1), 26–31
(Nov 2009). https://doi.org/10.1145/1656274.1656280, https://doi-org.univaq.
clas.cineca.it/10.1145/1656274.1656280
11. Bird, S., Dudı́k, M., Edgar, R., Horn, B., Lutz, R., Milan, V., Sameki,
M., Wallach, H., Walker, K.: Fairlearn: A toolkit for assessing and
improving fairness in AI. Tech. Rep. MSR-TR-2020-32, Microsoft
(May 2020), https://www.microsoft.com/en-us/research/publication/
fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/
12. Bosch, J., Olsson, H.H., Crnkovic, I.: Engineering AI Systems: A Research
Agenda (2021). https://doi.org/10.4018/978-1-7998-5101-1.ch001, https:
//www.igi-global.com/chapter/engineering-ai-systems/www.igi-global.
com/chapter/engineering-ai-systems/266130, iSBN: 9781799851011 Pages:
1-19 Publisher: IGI Global
13. Braiek, H.B., Khomh, F.: On testing machine learning pro-
grams. Journal of Systems and Software 164, 110542 (2020).
https://doi.org/https://doi.org/10.1016/j.jss.2020.110542, https://www.
sciencedirect.com/science/article/pii/S0164121220300248
14. Buckland, M., Gey, F.: The relationship between recall and precision. Journal of
the American society for information science 45(1), 12–19 (1994), publisher: Wiley
Online Library
Extended Feature Models for Quality-Based ML Development 107
15. Carvalho, D.V., Pereira, E.M., Cardoso, J.S.: Machine learning interpretability: A
survey on methods and metrics. Electronics 8(8), 832 (2019)
16. Caton, S., Haas, C.: Fairness in machine learning: A survey (2020)
17. Celis, L.E., Huang, L., Keswani, V., Vishnoi, N.K.: Classification with fairness
constraints: A meta-algorithm with provable guarantees. In: Proceedings of the
conference on fairness, accountability, and transparency. pp. 319–328 (2019)
18. Chakraborty, J., Majumder, S., Yu, Z., Menzies, T.: Fairway: A way to build
fair ml software. In: Proceedings of the 28th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations of Software
Engineering. pp. 654–665 (2020)
19. Chen, L., Ali Babar, M., Nuseibeh, B.: Characterizing architec-
turally significant requirements. IEEE Software 30(2), 38–45 (2013).
https://doi.org/10.1109/MS.2012.174
20. Chen, Z., Zhang, J.M., Hort, M., Sarro, F., Harman, M.: Fairness Testing: A Com-
prehensive Survey and Analysis of Trends (Aug 2022), http://arxiv.org/abs/
2207.10223, arXiv:2207.10223 [cs]
21. Clifton, C.: Privacy Metrics. In: LIU, L., ÖZSU, M.T. (eds.) Encyclope-
dia of Database Systems, pp. 2137–2139. Springer US, Boston, MA (2009).
https://doi.org/10.1007/978-0-387-39940-9 272, https://doi.org/10.1007/
978-0-387-39940-9_272
22. d’Aloisio, G., Marco, A.D., Stilo, G.: Manila github repository (Jan 2023), https:
//github.com/giordanoDaloisio/manila
23. d’Aloisio, G., Marco, A.D., Stilo, G.: Manila zenodo repository (Jan 2023).
https://doi.org/10.5281/zenodo.7525759, https://doi.org/10.5281/zenodo.
7525759
24. Di Sipio, C., Di Rocco, J., Di Ruscio, D., Nguyen, D.P.T.: A Low-Code Tool
Supporting the Development of Recommender Systems. In: Fifteenth ACM Con-
ference on Recommender Systems. pp. 741–744. ACM, Amsterdam Netherlands
(Sep 2021). https://doi.org/10.1145/3460231.3478885, https://dl.acm.org/doi/
10.1145/3460231.3478885
25. Domingos, P., Pazzani, M.: On the Optimality of the Simple Bayesian
Classifier under Zero-One Loss. Machine Learning 29(2), 103–130 (Nov
1997). https://doi.org/10.1023/A:1007413511361, https://doi.org/10.1023/A:
1007413511361
26. d’Aloisio, G., D’Angelo, A., Di Marco, A., Stilo, G.: Debiaser for Multiple Variables
to enhance fairness in classification tasks. Information Processing & Management
60(2), 103226 (Mar 2023). https://doi.org/10.1016/j.ipm.2022.103226, https://
www.sciencedirect.com/science/article/pii/S0306457322003272
27. Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubrama-
nian, S.: Certifying and Removing Disparate Impact. In: Proceedings of
the 21th ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining. pp. 259–268. ACM, Sydney NSW Australia (Aug
2015). https://doi.org/10.1145/2783258.2783311, https://dl.acm.org/doi/10.
1145/2783258.2783311
28. Friedman, J.H.: Stochastic gradient boosting. Computational statistics & data
analysis 38(4), 367–378 (2002), publisher: Elsevier
29. Galindo, J.A., Benavides, D., Trinidad, P., Gutiérrez-Fernández, A.M., Ruiz-
Cortés, A.: Automated analysis of feature models: Quo vadis? Computing 101(5),
387–433 (May 2019). https://doi.org/10.1007/s00607-018-0646-1, http://link.
springer.com/10.1007/s00607-018-0646-1
108 G. d’Aloisio et al.
46. Martı́nez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendow-
icz, A., Vollmer, A.M., Wagner, S.: Software Engineering for AI-Based Systems:
A Survey. ACM Transactions on Software Engineering and Methodology 31(2),
37e:1–37e:59 (Apr 2022). https://doi.org/10.1145/3487043, https://doi.org/10.
1145/3487043
47. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A Survey on
Bias and Fairness in Machine Learning. ACM Computing Surveys 54(6), 1–35
(Jul 2021). https://doi.org/10.1145/3457607, https://dl.acm.org/doi/10.1145/
3457607
48. Menard, S.: Applied logistic regression analysis, vol. 106. Sage (2002)
49. Molnar, C.: Interpretable machine learning. Lulu. com (2020)
50. Muccini, H., Vaidhyanathan, K.: Software Architecture for ML-based Systems:
What Exists and What Lies Ahead. In: 2021 IEEE/ACM 1st Workshop on AI
Engineering - Software Engineering for AI (WAIN). pp. 121–128 (May 2021).
https://doi.org/10.1109/WAIN52551.2021.00026
51. Nations, U.: THE 17 GOALS | Sustainable Development, https://sdgs.un.org/
goals
52. Noble, W.S.: What is a support vector machine? Nature biotechnology 24(12),
1565–1567 (2006), publisher: Nature Publishing Group
53. PalletsProject: Jinja website, https://jinja.palletsprojects.com/
54. Patro, S., Sahu, K.K.: Normalization: A preprocessing stage. arXiv preprint
arXiv:1503.06462 (2015)
55. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
56. Putzel, P., Lee, S.: Blackbox Post-Processing for Multiclass Fairness.
arXiv:2201.04461 [cs] (Jan 2022), http://arxiv.org/abs/2201.04461, arXiv:
2201.04461
57. Refaeilzadeh, P., Tang, L., Liu, H.: Cross-validation. Encyclopedia of database
systems 5, 532–538 (2009)
58. Refaeilzadeh, P., Tang, L., Liu, H.: Cross-Validation, pp. 1–7. Springer New York,
New York, NY (2016). https://doi.org/10.1007/978-1-4899-7993-3 565-2
59. Refaeilzadeh, P., Tang, L., Liu, H.: Cross-Validation. In: Encyclopedia of
Database Systems, pp. 1–7. Springer New York, New York, NY (2016).
https://doi.org/10.1007/978-1-4899-7993-3 565-2
60. Rosenfield, G., Fitzpatrick-Lins, K.: A coefficient of agreement as a measure of the-
matic classification accuracy. Photogrammetric Engineering and Remote Sensing
52(2), 223–227 (1986), http://pubs.er.usgs.gov/publication/70014667
61. Rönkkö, M., Heikkinen, J., Kotovirta, V., Chandrasekar, V.: Automated prepro-
cessing of environmental data. Future Generation Computer Systems 45, 13–
24 (2015). https://doi.org/https://doi.org/10.1016/j.future.2014.10.011, https://
www.sciencedirect.com/science/article/pii/S0167739X14002040
62. Sahay, A., Indamutsa, A., Di Ruscio, D., Pierantonio, A.: Supporting the under-
standing and comparison of low-code development platforms. In: 2020 46th Eu-
romicro Conference on Software Engineering and Advanced Applications (SEAA).
pp. 171–178. IEEE (2020)
63. Saleiro, P., Kuester, B., Hinkson, L., London, J., Stevens, A., Anisfeld, A., Rodolfa,
K.T., Ghani, R.: Aequitas: A bias and fairness audit toolkit. arXiv preprint
arXiv:1811.05577 (2018)
110 G. d’Aloisio et al.
64. Siebert, J., Joeckel, L., Heidrich, J., Trendowicz, A., Nakamichi, K., Ohashi, K.,
Namba, I., Yamamoto, R., Aoyama, M.: Construction of a quality model for ma-
chine learning systems. Software Quality Journal pp. 1–29 (2021)
65. de Souza Nascimento, E., Ahmed, I., Oliveira, E., Palheta, M.P., Steinmacher, I.,
Conte, T.: Understanding development process of machine learning systems: Chal-
lenges and solutions. In: 2019 ACM/IEEE International Symposium on Empirical
Software Engineering and Measurement (ESEM). pp. 1–6. IEEE (2019)
66. Studer, S., Bui, T.B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S., Müller,
K.R.: Towards crisp-ml (q): a machine learning process model with quality assur-
ance methodology. Machine Learning and Knowledge Extraction 3(2), 392–413
(2021)
67. Taha, A.A., Hanbury, A.: Metrics for evaluating 3D medical image seg-
mentation: analysis, selection, and tool. BMC Medical Imaging 15(1), 29
(Aug 2015). https://doi.org/10.1186/s12880-015-0068-x, https://doi.org/10.
1186/s12880-015-0068-x
68. Thüm, T., Kästner, C., Benduhn, F., Meinicke, J., Saake, G., Leich, T.: Featureide:
An extensible framework for feature-oriented software development. Science of
Computer Programming 79, 70–85 (2014)
69. Tramer, F., Atlidakis, V., Geambasu, R., Hsu, D., Hubaux, J.P., Humbert, M.,
Juels, A., Lin, H.: Fairtest: Discovering unwarranted associations in data-driven
applications. In: 2017 IEEE European Symposium on Security and Privacy (Eu-
roS&P). pp. 401–416. IEEE (2017)
70. Villamizar, H., Escovedo, T., Kalinowski, M.: Requirements engineering for ma-
chine learning: A systematic mapping study. In: SEAA. pp. 29–36 (2021)
71. Xu, R., Baracaldo, N., Joshi, J.: Privacy-Preserving Machine Learning: Methods,
Challenges and Directions. arXiv:2108.04417 [cs] (Sep 2021), http://arxiv.org/
abs/2108.04417, arXiv: 2108.04417
72. Zhang, J.M., Harman, M., Ma, L., Liu, Y.: Machine learning testing: Survey, land-
scapes and horizons. IEEE Transactions on Software Engineering (2020)
73. Zhou, J., Gandomi, A.H., Chen, F., Holzinger, A.: Evaluating the Quality of Ma-
chine Learning Explanations: A Survey on Methods and Metrics. Electronics 10(5),
593 (Jan 2021). https://doi.org/10.3390/electronics10050593, https://www.mdpi.
com/2079-9292/10/5/593, number: 5 Publisher: Multidisciplinary Digital Publish-
ing Institute
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Efficient Bounded Exhaustive Input Generation
from Program APIs
1 Introduction
Automated test generation approaches aim at assisting developers in crucial
software testing tasks [2,22], like automatically generating test cases or suites
[6,18,10], and automatically finding and reporting failures [23,19,12,20,4,13].
Many of these approaches involve random components, that avoid making a
systematic exploration of the space of behaviors, but improve test generation
efficiency [23,19,10]. While these approaches have been useful in finding a large
number of bugs in software, they might miss exploring certain faulty software
behaviors due to their random nature. Alternative approaches aim at system-
atically exploring a very large number of executions of the software under test
(SUT), with the goal of providing stronger guarantees about the absence of
bugs [20,4,12,14,6,18]. Some of these approaches are based on bounded exhaus-
tive generation (BEG) [20,4], which consists of generating all feasible inputs that
can be constructed using bounded data domains. Common targets to BEG ap-
proaches have been implementations of complex, dynamic data structures with
rich structural constraints (e.g., linked lists, trees, etc). The most widely-used
and efficient BEG approaches for testing software [20,4] require the user to pro-
vide a formal specification of the constraints that the inputs must satisfy –often
a representation invariant of the input (repOK)–, and bounds on data domains
[20,4] –often called scopes. Thus, specification-based BEG approaches yield all
inputs within the provided scopes that satisfy repOK.
Writing appropriate formal specifications for BEG is a challenging and time
consuming task. The specifications must precisely capture the intended con-
straints of the inputs. Overconstrained specifications lead to missing the gen-
eration of valid inputs, which might make the subsequent testing stage miss
the exploration of faulty behaviors of the SUT. Underconstrained specifications
may lead to the generation of invalid inputs, which might produce false alarms
while testing the SUT. Furthermore, sometimes the user needs to take into ac-
count the way the generation approach operates, and write the specifications in
a very specific way for the approach to achieve good performance [4] (see Section
4). Finally, such precise formal specifications are seldom available in software,
hindering the usability of specification-based BEG approaches.
Several studies show that BEG approaches are effective in revealing software
failures [20,16,4,33]. Furthermore, the small scope hypothesis [3], which states
that most software faults can be revealed by executing the SUT on “small inputs”,
suggests that BEG approaches should discover most (if not all) faults in the
SUT, if large enough scopes are used. The challenge that BEG approaches face
is how to efficiently explore a huge search space, that often grows exponentially
with respect to the scope. The search space often includes a very large number
of invalid (not satisfying repOK) and isomorphic inputs [15,28]. Thus, pruning
parts of the search space involving invalid and redundant inputs is key to make
BEG approaches scale up in practice [4].
In this paper, we propose a new approach for BEG, called BEAPI, that works
by making calls to API methods of the SUT. Similarly to API-based test gener-
ation approaches [23,19,10], BEAPI generates sequences of calls to methods from
the API (i.e., test sequences). The execution of each test sequence yielded by
BEAPI generates an input in the resulting BEG set of objects. As usual in BEG,
BEAPI requires the user to provide scopes for generation, which for BEAPI in-
cludes a maximum test sequence length. Brute force BEG from a user-provided
scope would attempt to generate all feasible test sequences of methods form the
API with up to a maximum sequence length. This is an intrinsically combinato-
rial process, that exhausts computational resources before completion even for
Efficient Bounded Exhaustive Input Generation from Program APIs 113
very small scopes (see Section 4). We propose several pruning techniques that are
crucial for the efficiency of BEAPI, and allow it to scale up to significantly larger
scopes. First, BEAPI executes test sequences and discards those that correspond
to violations of API usage rules (e.g., throwing exceptions that indicate incorrect
API usage, such as IllegalArgumentException in Java [17,23]). Thus, as op-
posed to specification-based BEG approaches, BEAPI does not require a repOK
that precisely describes valid inputs. In contrast, BEAPI requires minimum spec-
ification effort in most cases (including most of our case studies in Section 4),
which consists of making API methods throw exceptions on invalid inputs (in
the “defensive programming” style popularized by Liskov [17]). Second, BEAPI
implements state matching [15,28,36] to discard test sequences that produce in-
puts already created by previously explored sequences. Third, BEAPI employs
only a subset of the API methods to create test sequences: a set of methods
automatically identified as builders [27]. Before test generation, BEAPI executes
an automated builders identification approach [27] to find a smaller subset of the
API that is sufficient to yield the resulting BEG set of inputs. Another advan-
tage of BEAPI with respect to specification-based approaches is that it produces
test sequences to create the corresponding inputs using methods from the API,
making it easier to create tests from BEAPI’s output [5].
We experimentally assess BEAPI, and show that its efficiency and scalability
are comparable to those of the fastest BEG approach (Korat), without the need
for repOKs. We also show that BEAPI can be of help in finding flaws in repOKs,
by comparing the sets of inputs generated by BEAPI using the API against the
sets of inputs generated by Korat from a repOK. Using this procedure, we found
several flaws in repOKs employed in the experimental assessment of related tools,
thus providing evidence on the difficulty of writing repOKs for BEG.
2 A Motivating Example
To illustrate the difficulties of writing formal specifications for BEG, consider
Apache’s NodeCachingLinkedList’s (NCL) representation invariant shown in
Figure 1 (taken from the ROOPS benchmark5 ). NCLs are composed of a main
circular, doubly-linked list, used for data storage, and a cache of previously used
nodes implemented as a singly linked list. Nodes removed from the main list
are moved to the cache, where they are saved for future usage. When a node is
required for an insertion operation, a cache node (if one exists) is reused (instead
of allocating a new node). As usual, repOK returns true iff the input structure
satisfies the intended NCL properties [17]. Lines 1 to 20 check that the main list
is a circular doubly-linked list with a dummy head; lines 21 to 33 check that the
cache is a null terminated singly linked list (and the consistency of size fields
is verified in the process). This repOK is written in the way recommended by
the authors of Korat [4]. It returns false as soon as it finds a violation of an
intended property in the current input. Otherwise, it returns true at the end.
This allows Korat to prune large portions of the search space, and improves its
5
https://code.google.com/p/roops/
114 M. Politano et al.
performance [4]. repOK suffers from underspecification: it does not state that the
sentinel node and all cache nodes must have null values (lines 3-4 and 28-29,
respectively). Mistakes like these are very common when writing specifications
(see Section 4.3), and difficult to discover by manual inspection of repOK. These
errors can have serious consequences for BEG. Executing Korat with repOK and
a scope of up to 8 nodes produces 54.5 million NCL structures, while the actual
number of valid NCL instances is 2.8 million. Clearly, this is a problem for Ko-
rat’s performance, and for the subsequent testing of the SUT. In addition, the
invalid instances generated might trigger false alarms in the SUT in many cases.
We discovered these errors in repOK with the help of BEAPI: we automatically
contrasted the structures generated using BEAPI and the NCL’s API, with those
generated using Korat with repOK, for the same scope.
This example shows that writing sound and precise repOKs for BEG is difficult
and time consuming. Fine-tuning repOKs to improve the performance of BEG
(e.g., for Korat) is even harder. The main advantage of BEAPI is that it requires
minimal specification effort to perform BEG. If API methods used for generation
are correct, all generated structures are valid by construction. The programmer
only needs to make sure that API methods throw exceptions when API usage
Efficient Bounded Exhaustive Input Generation from Program APIs 115
1 max.objects=3
2 int.range=0:2
3 # strings=str1,str2,str3
4 # omit.fields=NodeCachingLinkedList.DEFAULT_MAXIMUM_CACHE_SIZE
rules are violated, in a defensive programming style [17]. In most cases, this
requires checking very simple conditions on the inputs. In our example, the
method to add an element to a NCL throws an IllegalArgumentException
when is called with the null element (the implementation of the method takes
care that the remaining NCL properties hold).
BEAPI, but in other cases omitting fields may have an impact. The configura-
tion in Figure 2 is enough for BEAPI to generate NCLs with a maximum of 3
nodes, containing integers from 0 to 2 as values, which allowed us to mimic the
structures generated by Korat for the same scope.
1 int[] linearize(O root, Heap<O, E> heap, int scope, Regex omitFields) {
2 Map ids = new Map(); // maps nodes into their unique ids
3 return lin(root, heap, scope, ids, omitFields);
4 }
5 int[] lin(O root, Heap<O, E> heap, int scope, Map ids, Regex omitFields) {
6 if (ids.containsKey(root))
7 return singletonSequence(ids.get(root));
8 if (ids.size() == scope)
9 throw new ScopeExceededException();
10 int id = ids.size() + 1;
11 ids.put(root, id);
12 int[] seq = singletonSequence(id);
13 Edge[] fields = sortByField({ <root, f, o> in E }, omitFields);
14 foreach (<root, f, o> in fields) {
15 if (isPrimitive(o))
16 seq.add(uniqueRepresentation(o));
17 else
18 seq.append(lin(o, heap, scope, ids, omitFields));
19 }
20 return seq;
21 }
fields (for the original version see [36]). linearize starts a depth-first traversal
of the heap from the root, by invoking lin in line 3. To canonicalize the heap,
lin assigns different identifiers to the different objects it visits. Map ids stores
the mapping between objects and unique object identifiers. When an object is
visited for the first time, it is assigned a new unique identifier (lines 10-11), and
a singleton sequence with the identifier is created to represent the object (line
12). Then, the object’s fields, sorted in a predefined order (e.g., by name), are
traversed and the linearization of each field value is constructed, and the result
is appended to the sequence representing the current object (lines 13-19). A field
storing a primitive value is represented by a singleton sequence with the primitive
value (line 15-16). If a field references an object, a recursive call to lin converts
the object into a sequence, which will be appended to the result (line 18). At the
end of the loop, seq contains the canonical representation of the whole rooted
heap starting at root, and is returned by lin (line 20). When an already visited
object is traversed by a recursive call, the object must have an identifier already
assigned in ids (line 6), and lin returns the singleton sequence with the object’s
unique identifier (lines 7). When more than scope objects are reachable from the
rooted heap, lin returns an exception to report that the scope has been exceeded
(lines 9-10). The exception will be employed later on by BEAPI to discard test
sequences that create objects larger than allowed by the scope. linearize also
takes as a parameter a regular expression omitFields, that matches the names of
the fields that must be omitted during canonicalization (see Section 3.1). To omit
such fields, we implemented sortByField (line 13) in such a way that it does
not return the edges corresponding to fields whose names match omitFields.
This in turn avoids saving the values of omitted fields in the sequence yielded by
linearize. Finally, notice that linearization allows for efficient comparison of
objects (rooted heaps): two objects are equal if and only if their corresponding
sequences yielded by linearize are equal.
118 M. Politano et al.
ically created from configuration options int.range, strings, etc., see Fig. 2);
and a regular expression matching fields to be omitted in the canonicalization
of structures, omitFields. Notice that methods from more than one class could
be passed in methods if one wants to generate objects for several classes in the
same execution of BEAPI, e.g., when methods from one class take objects from
another class as parameters. BEAPI’s map currSeqs stores, for each type, the
list of test sequences that are known to generate structures of the type. currSeqs
starts with all the primitive typed sequences in primitives (lines 2-3). At each
iteration of the main loop (lines 5-34), BEAPI creates new sequences for each
available method m (line 8), by exhaustively exploring all the possibilities for
creating test sequences using m and inputs generated in previous iterations and
stored in currSeqs (lines 9-30). The newly created test sequences that generate
new structures in the current iteration are saved in map newSeqs (initialized
empty in line 6); all the generated sequences are then added to currSeqs at the
end of the iteration (line 33). If no new structures are produced at the current
iteration (newStrs is false in line 32), BEAPI’s main loop terminates and the list
of all sequences in currSeqs is returned (line 35).
Let us now discuss the details of the for loop in lines 9-30. First, all sequences
that can be used to construct inputs for m are retrieved in seqsT1 ,...,seqsTn .
BEAPI explores each tuple (s1 ,...,sn ) of feasible inputs for m. Then, it executes
createNewSeq (line 13), which constructs a new test sequence newSeq by per-
forming the sequential composition of test sequences s1 ,...,sn and routine m, and
replacing m’s formal parameters by the variables that create the required objects
in s1 ,...,sn . newSeq is then executed (line 14) and it either produces a failure
(failure is set to true), raises an exception that represents an invalid usage of
the API (exception is set to true), or its execution is successful and it creates
new objects o1 ,. . .,on ,or . In case of a failure, an exception is thrown and newSeq
is presented to the user as a witness of the failure (line 15). If a different kind of
exception is thrown, BEAPI assumes it corresponds to an API misuse (see below),
discards the test sequence (line 16) and continues with the next candidate se-
quence. Otherwise, the execution of newSeq builds new objects o1 ,. . .,on ,or (or
values of primitive types) that are canonicalized by makeCanonical (line 17) –by
executing linearize from Figure 3 on each structure. If any of the structures
produced by newSeq exceeds the scope, makeCanonical sets outOfScope to true,
BEAPI discards newSeq and continues with the next one (line 18). If none of the
above happens, makeCanonical returns canonical versions of o1 ,. . .,on ,or in
variables c1 ,. . .,cn ,cr , respectively. Afterwards, BEAPI performs state match-
ing by checking that the canonical structure c1 is of reference type and that
it has not been created by any previous test sequence (line 19). Notice that
canonicalStrs stores all of the already visited structures. If c1 is a new struc-
ture, it is added to canonicalStrs (line 27), and the sequence that creates c1 ,
newSeq, is added to the set of test sequences producing structures of type T1
(newSeqs in line 27). Also, newStrs is set to true to indicate that at least a
new object has been created in the current iteration (line 22). This process is
repeated for canonical objects c2 ,. . .,cn ,cr (lines 24-29).
Efficient Bounded Exhaustive Input Generation from Program APIs 121
BEAPI distinguishes failures from bad API usage based on the type of the ex-
ception (similarly to previous API based test generation techniques [23]). For ex-
ample, IllegalArgumentException and IllegalStateException correspond
to API misuses, and the remaining exceptions are considered failures by default.
BEAPI’s implementation allows the user to select the exceptions that correspond
to failures and those that do not, by setting the corresponding configuration pa-
rameters. As mentioned in Section 2, BEAPI assumes that API methods throw
exceptions when they fail to execute on invalid inputs. We argue that this is a
common practice, called defensive programming [17], that should be followed by
all programmers, as it results in more robust code and improves software testing
in general [2] (besides helping automated test generation tools). We also argued
in Section 2 that the specification effort required for defensive programming is
much less than writing precise (and efficient) repOKs for BEG, and that this was
true after manually inspecting the source code of our case studies. On the other
hand, note that BEAPI can employ formal specifications to reveal bugs in the
API, e.g., by executing repOK and check that it returns true on every generated
object of the corresponding type (as in Randoop [23]). However, the specifica-
tions used for bug finding do not need to be very precise (e.g., the underspecified
NCL repOK from Section 2 is fine for bug finding), or written in a particular way
(as required by Korat). Other kinds of specifications that are weaker and simpler
to write can also be used by BEAPI to reveal bugs, like violations of language
specific contracts (e.g., equals is an equivalence relation in Java), metamorphic
properties [7], user-provided assertions (assert), etc.
Another advantage of BEAPI is that, for each generated object, it yields
a test sequence that can be executed to create the object. This is in contrast
with specification based approaches (that generate a set of objects from repOK).
Finding a sequence of invocations to API methods that create a specific structure
is a difficult problem on its own, that can be rather costly computationally [5], or
require significant effort to perform manually. Thus, often objects generated by
specification based approaches are “hardwired” when used for testing a SUT (e.g.,
by using Java reflection), making tests very hard to understand and maintain,
as they depend on the low-level implementation details of the structures [5].
4 Evaluation
In this section, we experimentally assess BEAPI against related approaches. The
evaluation is organized around the following research questions:
RQ1 Can BEG be performed efficiently using API routines?
RQ2 How much do the proposed optimizations impact the performance of BEG
from the API?
RQ3 Can BEAPI help in finding discrepancies between repOK specifications and
the API’s object generation ability?
As case studies, we employ data structures implementations from four bench-
marks: three employed in the assessment of existing testing tools (Korat [4],
122 M. Politano et al.
Kiasan [9], FAJITA [1]), and ROOPS. These benchmarks cover diverse implemen-
tations of complex data structures, which are a good target for BEG. We choose
these as case studies because the implementations come equipped with repOKs,
written by the authors of the benchmarks. The experiments were run on a work-
station with an Intel Core i7-8700 CPU (3.2 Ghz) and 16Gb of RAM. We set a
timeout of 60 minutes for each individual run. To replicate the experiments, we
refer the reader to the paper’s artifact [25].
to create more test sequences in each successive iteration, which makes its per-
formance suffer more in such cases. As expected, the way repOKs are written
has a significant impact in Korat’s performance. For example, for binomial heaps
(BinHeap) Korat reaches scope 8 with Roops’ repOK, scope 10 with FAJITA’s
repOK, and scope 11 with Korat’s repOK (all equivalent in terms of generated
structures). In most cases, repOKs from the Korat benchmark result in better
performance, as these are fine-tuned for usage with Korat. Case studies with er-
rors in repOKs are grayed out in the table, and discussed further in Section 4.3.
Notice that errors in repOKs can severely affect Korat’s performance.
ROOPS
methods) from java.util, and NCL from Apache Collections (20 methods). As
most real world implementations, these data structures do not come equipped
with repOKs, hence we only employed them in this RQ.
The brute force approach (NoOPT) performs poorly even for the easiest case
studies and very small scopes. These scopes are too small and often not enough
if one wants to generate high quality test suites. State matching is the most im-
pactful optimization, greatly improving by itself the performance and scalability
all around (compare NoOPT and SM results). As expected, builders identifica-
tion is much more relevant in cases where the number of methods in the API
is large (more than 10), and remarkably in the real world data structures (with
20 or more API methods). SM/BLD is more than an order of magnitude faster
than SM in AVL and RBT, and it reaches one more scope in NCL and LList. The
remaining classes of ROOPS have just a few methods, and the impact of using
builders is relatively small. The conclusions drawn from ROOPS apply to the other
three benchmarks (we omit their results here for space reasons, visit the paper’s
website for a complete report [26]). In the real world data structures, using pre-
computed builders allowed SM/BLD to scale to significantly larger scopes in all
cases but TreeMap and TreeSet, where it significantly improves running times.
Overall, the proposed optimizations have a crucial impact in BEAPI’s perfor-
mance and scalability, and both should be enabled to obtain good results.
On the cost of builders identification. Due to space reasons we report builders
identification times in the paper’s website [26]. For the conclusions of this sec-
tion, it is sufficient to say that scope 5 was employed for builders identification
in all cases, and that the maximum runtime of the approach was 65 seconds
in the four benchmarks (ROOPS’ SLL, 11 methods), and 132 seconds in the real
world data structures (TreeMap, 32 methods). We manually checked that the
identified methods included a set of sufficient builders in all cases. Notice that
BEG is often performed for increasingly larger scopes, and the identified builders
can be reused across executions. Thus, builder identification times are amor-
tized across different executions, which makes it difficult to calculate how much
builder identification times add to BEAPI running times in each case. So we did
not include builder identification times in BEAPI running times in any of the
experiments. Notice that, for the larger scopes, which arguably are the most im-
portant, builders identification time is negligible in relation to generation times.
Fajita AVL Height computation is wrong (leaves are assigned the wrong value) error
that the API generates a subset of the valid structures, that repOK suffers from
underspecification (missing constraints), or both. In this case, the structures in
SR that do not belong to SA are witnesses of the problem, and the user has
to manually analyze them to find out where the error is. Here, we report the
(manually confirmed) underspecification errors in repOKs that are witnessed by
the aforementioned structures. In contrast, when SR ⊂ SA, it can be the case
that the API generates a superset of the valid structures, that repOK suffers
from overspecification (repOK is too strong), or both. The structures in SA that
do not belong to SR might point out to the root of the error, and again they
have to be manually analyzed by the user. We report the (manually confirmed)
overspecification errors in repOKs that are witnessed by these structures. Finally,
it can be the case that there are structures in SR that do not belong to SA, and
there are structures (distinct than the former ones) in SA that do not belong
to SR. These might be due to faults in the API, flaws in the repOK, or both.
We report the manually confirmed flaws in repOKs witnessed by such structures
simply as errors (repOK describes a different set of structures than the one it
should). Notice that differences in the scope definitions for the approaches might
make sets SA and SR differ. This was only the case in the RBT and FibHeap
structures, where BEAPI generated a smaller set of structures for the same scope
than Korat due to balance constraints (as explained in Section 4.1). However,
these “false positives” can be easily revealed, since all the structures generated
by Korat were always included in the structures generated by BEAPI if a larger
scope was used for the latter approach. Using this insight we manually discarded
the “false positives” due to scope differences in RBT and FibHeap.
The results of this experiment are summarized in Table 3. We found out flaws
in 9 out of 26 repOKs using the approach described above. The high number of
Efficient Bounded Exhaustive Input Generation from Program APIs 127
flaws discovered evidences that problems in repOKs are hard to find manually,
and that BEAPI can be of great help for this task.
5 Related Work
BEG approaches have been shown effective in achieving high code coverage and
finding faults, as reported in various research papers [20,16,4,33]. Our goal here
is not to assess yet again the effectiveness of BEG suites, but to introduce an
approach that is straightforward to use in today’s software because it does not
require the manual work of writing formal specifications of the properties of the
inputs (e.g., repOKs). Different languages have been proposed to formally de-
scribe structural constraints for BEG, including Alloy’s relational logic (in the
so-called declarative style), employed by the TestEra tool [20]; and source code
in an imperative programming language (in the so-called operational style), as
used by Korat [4]. The declarative style has the advantage of being more concise
and simpler for people familiar with it, however this knowledge is not common
among developers. The operational style can be more verbose, but as specifi-
cations and source code are written in the same language this style is most of
the time preferred by developers. UDITA [11] and HyTeK [29] propose to employ
a mix of the operational and the declarative styles to write the specifications,
as parts of the constraints are often easier to write in one style or the other.
With precise specifications both approaches can be used for BEG. Still, to use
these approaches developers have to be familiar with both specification styles,
and take the time and effort required to write the specifications. Model checkers
like Java Pathfinder [34] (JPF) can also perform BEG, but the user has to manu-
ally provide a “driver” for the generation: a program that the model checker can
use to generate the structures that will be fed to the SUT afterwards. Writing
a BEG driver often involves invoking API routines in combination with JPF’s
nondeterministic operators, hence the developer must familiarize with such op-
erators and put in some manual effort to use this approach. Furthermore, JPF
runs over a customized virtual machine in place of Java’s standard JVM, so there
is a significant overhead in running JPF compared to the use of the standard
JVM (employed by BEAPI). The results of a previous study [32] show that JPF
is significantly slower than Korat for BEG. Therein, Korat has been shown to
be the fastest and most scalable BEG approach at the time of publication [32].
This in part can be explained by its smart pruning of the search space of invalid
structures and the elimination of isomorphic structures. In contrast, BEAPI does
not require a repOK and works by making calls to the API.
An alternative kind of BEG consists of generating all inputs to cover all feasi-
ble (bounded) program paths, instead of generating all feasible bounded inputs.
This is the approach of systematic dynamic test generation, a variant of symbolic
execution [14]. This approach is implemented by many tools [13,12,24,8], and has
been successfully used to produce test suites with high code coverage, reveal real
program faults, and for proving memory safety of programs. Kiasan [9] and FA-
128 M. Politano et al.
JITA [1] are also white-box test case generation approaches that require formal
specifications and aim for coverage of the SUT.
Linearization has been employed to eliminate isomorphic structures in tradi-
tional model checkers [15,28], and also in software model checkers [35]. A previous
study experimented with state matching in JPF and proposed several approaches
for pruning the search space for program inputs using linearization, for both con-
crete and symbolic execution [35]. As stated before, concrete execution in JPF
requires the user to provide a driver. The symbolic approach attempts to find
inputs to cover paths of the SUT; we perform BEG instead. Linearization has
also been employed for test suite minimization [36].
6 Conclusions
Software quality assurance can be greatly improved thanks to modern software
analysis techniques, among which automated test generation techniques play
an outstanding role [6,18,10,23,19,12,20,4,13]. Random and search-based ap-
proaches have shown great success in automatically generating test suites with
very good coverage and mutation metrics, but their random nature does now
allow these techniques to precisely characterize the families of software behav-
iors that the generated tests cover. Systematic techniques such as those based
on model checking, symbolic execution or bounded exhaustive generation, cover
a precise set of behaviors, and thus can provide specific correctness guarantees.
In this paper, we presented BEAPI, a technique that aims at facilitating
the application of a systematic technique, bounded exhaustive input genera-
tion, by producing structures solely from a component’s API, without the need
for a formal specification of the properties of the structures. BEAPI can generate
bounded exhaustive suites from components with implicit invariants, and reduces
the burden of providing formal specifications, and tailoring the specifications for
improved generation. Thanks to a number of optimizations, including an auto-
mated identification of builder routines and a canonicalization/state matching
mechanism, BEAPI can generate bounded exhaustive suites with a performance
comparable to that of the fastest specification-based technique Korat [4]. We
have also identified the characteristics of a component that may make it more
suitable for a specification-based generation, or an API-based generation.
Finally, we have shown how specification based approaches and BEAPI can
complement each other, depicting how BEAPI can be used to assess repOK imple-
mentations. Using this approach, we found a number of subtle errors in repOK
specifications taken from the literature. Thus, techniques that require repOK
specifications (e.g, [30]), as well as techniques that require bounded-exhaustive
suites (e.g., [21]) can benefit from our presented generation technique.
References
1. Abad, P., Aguirre, N., Bengolea, V.S., Ciolek, D.A., Frias, M.F., Galeotti, J.P.,
Maibaum, T., Moscato, M.M., Rosner, N., Vissani, I.: Improving test generation
under rich contracts by tight bounds and incremental SAT solving. In: Sixth IEEE
International Conference on Software Testing, Verification and Validation, ICST
2013, Luxembourg, Luxembourg, March 18-22, 2013. pp. 21–30. IEEE Computer
Society (2013)
2. Ammann, P., Offutt, J.: Introduction to Software Testing. Cambridge University
Press, Cambridge (2016)
3. Andoni, A., Daniliuc, D., Khurshid, S., Marinov, D.: Evaluating the "small scope
hypothesis". Tech. rep., MIT CSAIL (10 2002)
4. Boyapati, C., Khurshid, S., Marinov, D.: Korat: automated testing based on java
predicates. In: Frankl, P.G. (ed.) Proceedings of the International Symposium on
Software Testing and Analysis, ISSTA 2002, Roma, Italy, July 22-24, 2002. pp.
123–133. ACM (2002)
5. Braione, P., Denaro, G., Mattavelli, A., Pezzè, M.: Combining symbolic execution
and search-based testing for programs with complex heap inputs. In: Bultan, T.,
Sen, K. (eds.) Proceedings of the 26th ACM SIGSOFT International Symposium
on Software Testing and Analysis, Santa Barbara, CA, USA, July 10 - 14, 2017.
pp. 90–101. ACM (2017)
6. Cadar, C., Dunbar, D., Engler, D.R.: KLEE: unassisted and automatic generation
of high-coverage tests for complex systems programs. In: Draves, R., van Renesse,
R. (eds.) 8th USENIX Symposium on Operating Systems Design and Implementa-
tion, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings.
pp. 209–224. USENIX Association (2008)
7. Chen, T.Y., Kuo, F.C., Liu, H., Poon, P.L., Towey, D., Tse, T.H., Zhou, Z.Q.:
Metamorphic testing: A review of challenges and opportunities. ACM Comput.
Surv. 51(1) (jan 2018)
8. Christakis, M., Godefroid, P.: Proving memory safety of the ANI windows image
parser using compositional exhaustive testing. In: D’Souza, D., Lal, A., Larsen,
K.G. (eds.) Verification, Model Checking, and Abstract Interpretation - 16th In-
ternational Conference, VMCAI 2015, Mumbai, India, January 12-14, 2015. Pro-
ceedings. Lecture Notes in Computer Science, vol. 8931, pp. 373–392. Springer
(2015)
9. Deng, X., Robby, Hatcliff, J.: Kiasan: A verification and test-case generation frame-
work for java based on symbolic execution. In: Leveraging Applications of Formal
Methods, Second International Symposium, ISoLA 2006, Paphos, Cyprus, 15-19
November 2006. p. 137. IEEE Computer Society (2006)
10. Fraser, G., Arcuri, A.: Evosuite: automatic test suite generation for object-oriented
software. In: Gyimóthy, T., Zeller, A. (eds.) SIGSOFT/FSE’11 19th ACM SIG-
SOFT Symposium on the Foundations of Software Engineering (FSE-19) and
ESEC’11: 13th European Software Engineering Conference (ESEC-13), Szeged,
Hungary, September 5-9, 2011. pp. 416–419. ACM (2011)
11. Gligoric, M., Gvero, T., Jagannath, V., Khurshid, S., Kuncak, V., Marinov, D.:
Test generation through programming in UDITA. In: Kramer, J., Bishop, J., De-
vanbu, P.T., Uchitel, S. (eds.) Proceedings of the 32nd ACM/IEEE International
Conference on Software Engineering - Volume 1, ICSE 2010, Cape Town, South
Africa, 1-8 May 2010. pp. 225–234. ACM (2010)
130 M. Politano et al.
12. Godefroid, P., Klarlund, N., Sen, K.: DART: directed automated random testing.
In: Sarkar, V., Hall, M.W. (eds.) Proceedings of the ACM SIGPLAN 2005 Confer-
ence on Programming Language Design and Implementation, Chicago, IL, USA,
June 12-15, 2005. pp. 213–223. ACM (2005)
13. Godefroid, P., Levin, M.Y., Molnar, D.A.: SAGE: whitebox fuzzing for security
testing. Commun. ACM 55(3), 40–44 (2012)
14. Godefroid, P., Sen, K.: Combining model checking and testing. In: Clarke, E.M.,
Henzinger, T.A., Veith, H., Bloem, R. (eds.) Handbook of Model Checking, pp.
613–649. Springer (2018)
15. Iosif, R.: Symmetry reduction criteria for software model checking. In: Bosnacki,
D., Leue, S. (eds.) Model Checking of Software, 9th International SPIN Workshop,
Grenoble, France, April 11-13, 2002, Proceedings. Lecture Notes in Computer Sci-
ence, vol. 2318, pp. 22–41. Springer (2002)
16. Khurshid, S., Marinov, D.: Checking java implementation of a naming architecture
using testera. Electron. Notes Theor. Comput. Sci. 55(3), 322–342 (2001)
17. Liskov, B., Guttag, J.: Program Development in Java: Abstraction, Specification,
and Object-Oriented Design. Addison-Wesley Longman Publishing Co., Inc., USA,
1st edn. (2000)
18. Luckow, K.S., Pasareanu, C.S.: Symbolic pathfinder v7. ACM SIGSOFT Softw.
Eng. Notes 39(1), 1–5 (2014)
19. Ma, L., Artho, C., Zhang, C., Sato, H., Gmeiner, J., Ramler, R.: GRT: an au-
tomated test generator using orchestrated program analysis. In: Cohen, M.B.,
Grunske, L., Whalen, M. (eds.) 30th IEEE/ACM International Conference on Au-
tomated Software Engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015.
pp. 842–847. IEEE Computer Society (2015)
20. Marinov, D., Khurshid, S.: Testera: A novel framework for automated testing of
java programs. In: 16th IEEE International Conference on Automated Software
Engineering (ASE 2001), 26-29 November 2001, Coronado Island, San Diego, CA,
USA. p. 22. IEEE Computer Society (2001)
21. Molina, F., Ponzio, P., Aguirre, N., Frias, M.: EvoSpex: An evolutionary algorithm
for learning postconditions. In: Proceedings of the 43rd ACM/IEEE International
Conference on Software Engineering ICSE 2021, Virtual (originally Madrid, Spain),
23-29 May 2021 (2021)
22. Myers, G.J., Sandler, C., Badgett, T.: The Art of Software Testing. Wiley Pub-
lishing, 3rd edn. (2011)
23. Pacheco, C., Lahiri, S.K., Ernst, M.D., Ball, T.: Feedback-directed random test
generation. In: 29th International Conference on Software Engineering (ICSE
2007), Minneapolis, MN, USA, May 20-26, 2007. pp. 75–84. IEEE Computer So-
ciety (2007)
24. Pham, L.H., Le, Q.L., Phan, Q., Sun, J.: Concolic testing heap-manipulating pro-
grams. In: ter Beek, M.H., McIver, A., Oliveira, J.N. (eds.) Formal Methods - The
Next 30 Years - Third World Congress, FM 2019, Porto, Portugal, October 7-11,
2019, Proceedings. Lecture Notes in Computer Science, vol. 11800, pp. 442–461.
Springer (2019)
25. Politano, M., Bengolea, V., Molina, F., Aguirre, N., Frias, M.F., Ponzio, P.: Effi-
cient Bounded Exhaustive Input Generation from Program APIs paper’s artifact.
https://doi.org/10.5281/zenodo.7574758
26. Politano, M., Bengolea, V., Molina, F., Aguirre, N., Frias, M.F., Ponzio, P.: Effi-
cient Bounded Exhaustive Input Generation from Program APIs paper’s website.
https://sites.google.com/view/bounded-exhaustive-api
Efficient Bounded Exhaustive Input Generation from Program APIs 131
27. Ponzio, P., Bengolea, V.S., Politano, M., Aguirre, N., Frias, M.F.: Automatically
identifying sufficient object builders from module apis. In: Hähnle, R., van der
Aalst, W.M.P. (eds.) Fundamental Approaches to Software Engineering - 22nd
International Conference, FASE 2019, Held as Part of the European Joint Confer-
ences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic,
April 6-11, 2019, Proceedings. Lecture Notes in Computer Science, vol. 11424, pp.
427–444. Springer (2019)
28. Robby, Dwyer, M.B., Hatcliff, J., Iosif, R.: Space-reduction strategies for model
checking dynamic software. Electron. Notes Theor. Comput. Sci. 89(3), 499–517
(2003)
29. Rosner, N., Bengolea, V., Ponzio, P., Khalek, S.A., Aguirre, N., Frias, M.F., Khur-
shid, S.: Bounded exhaustive test input generation from hybrid invariants. SIG-
PLAN Not. 49(10), 655–674 (oct 2014)
30. Rosner, N., Geldenhuys, J., Aguirre, N., Visser, W., Frias, M.F.: BLISS: improved
symbolic execution by bounded lazy initialization with SAT support. IEEE Trans.
Software Eng. 41(7), 639–660 (2015)
31. Rosner, N., Pombo, C.G.L., Aguirre, N., Jaoua, A., Mili, A., Frias, M.F.: Parallel
bounded verification of alloy models by transcoping. In: Cohen, E., Rybalchenko,
A. (eds.) Verified Software: Theories, Tools, Experiments - 5th International Con-
ference, VSTTE 2013, Menlo Park, CA, USA, May 17-19, 2013, Revised Selected
Papers. Lecture Notes in Computer Science, vol. 8164, pp. 88–107. Springer (2013)
32. Siddiqui, J.H., Khurshid, S.: An empirical study of structural constraint solving
techniques. In: Breitman, K.K., Cavalcanti, A. (eds.) Formal Methods and Soft-
ware Engineering, 11th International Conference on Formal Engineering Methods,
ICFEM 2009, Rio de Janeiro, Brazil, December 9-12, 2009. Proceedings. Lecture
Notes in Computer Science, vol. 5885, pp. 88–106. Springer (2009)
33. Sullivan, K.J., Yang, J., Coppit, D., Khurshid, S., Jackson, D.: Software assurance
by bounded exhaustive testing. In: Avrunin, G.S., Rothermel, G. (eds.) Proceedings
of the ACM/SIGSOFT International Symposium on Software Testing and Analy-
sis, ISSTA 2004, Boston, Massachusetts, USA, July 11-14, 2004. pp. 133–142. ACM
(2004)
34. Visser, W., Mehlitz, P.C.: Model checking programs with java pathfinder. In: Gode-
froid, P. (ed.) Model Checking Software, 12th International SPIN Workshop, San
Francisco, CA, USA, August 22-24, 2005, Proceedings. Lecture Notes in Computer
Science, vol. 3639, p. 27. Springer (2005)
35. Visser, W., Pasareanu, C.S., Pelánek, R.: Test input generation for java contain-
ers using state matching. In: Pollock, L.L., Pezzè, M. (eds.) Proceedings of the
ACM/SIGSOFT International Symposium on Software Testing and Analysis, IS-
STA 2006, Portland, Maine, USA, July 17-20, 2006. pp. 37–48. ACM (2006)
36. Xie, T., Marinov, D., Notkin, D.: Rostra: A framework for detecting redundant
object-oriented unit tests. In: 19th IEEE International Conference on Automated
Software Engineering (ASE 2004), 20-25 September 2004, Linz, Austria. pp. 196–
205. IEEE Computer Society (2004)
132 M. Politano et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Feature-Guided Analysis of Neural Networks
Divya Gopinath1( )
, Luca Lungeanu4 , Ravi Mangal2 , Corina Păsăreanu1,2 ,
Siqi Xie3 , and Huanfeng Yu3
1
KBR, NASA Ames, Moffett Field CA 94035, USA
[email protected]
2
Lynbrook High School, San Jose CA, 95129, USA
3
Carnegie Mellon University, Pittsburgh PA 15213, USA
4
Boeing Research & Technology, Santa Clara CA, USA
1 Introduction
The remarkable computational capabilities unlocked by neural networks have
led to the emergence of a rapidly growing class of neural-network based software
applications. Unlike traditional software applications whose logic is driven from
input-output specifications, neural networks are inherently opaque, as their logic
is learned from examples of input-output pairs. The lack of high-level abstractions
makes it challenging to interpret the logical reasoning employed by a neural
network and hinders the use of standard software engineering practices such as
automated testing, debugging, requirements analysis, and formal verification that
have been established for producing high-quality software.
In this work, we aim to address this challenge by proposing a feature-guided
approach to neural network engineering. Our proposed approach is illustrated
in Figure 1. We draw from the insight that, in a neural network, early layers
typically extract the important features of the inputs and the dense layers close
to the output contain logic in terms of these features to make decisions [12].
The approach therefore first extracts high-level, human-understandable feature
representations from the trained neural network which allows us to formally link
domain-specific, human-understandable features to the internal logic of a trained
model. This in turn enables us to reason about the model through the lens of
the features and to drive the above mentioned software engineering activities.
3 Feature-Guided Analyses
The extracted feature representations as formal, checkable rules enable multiple
analyses, as listed below.
– Data analysis and testing. We can assess the quality of the training and
testing data in terms of coverage of different features. We can leverage the
extracted feature representations to automatically retrieve new data that has
the necessary features, by checking that the (unlabeled) data satisfies the
corresponding rules. We can also use the extracted rules to label new data
with their corresponding features, enabling further data-coverage analysis.
– Debugging and explanations of network behavior. We can leverage
the feature rules to uncover the high-level, human-understandable reasons
for a neural network model making correct and incorrect predictions. In the
latter case we can repair the model, which involves automatically selecting
and training based on inputs with features that caused incorrect predictions.
– Formulation and analysis of requirements. Extracted feature represen-
tations are the key to enabling verification of models with respect to high-level
safety requirements (pre =⇒ post). Specifically, the constraint pre in the
requirement expressed over features can be translated into a constraint pre0
expressed over activation values, by substituting the features with their cor-
responding representations. The modified requirement pre0 =⇒ post can
be checked automatically using off-the-shelf verification tools [10].
136 D. Gopinath et al.
4 Case Studies
We use two case studies to present initial empirical evidence in support of our
ideas. In particular, we show that Algorithm 2.1 with decision tree learning is
successful in extracting feature representations. We also demonstrate how these
representations can be used for analyzing the behavior of neural networks.
Table 1: Rules for TaxiNet: d: annotated dataset, #d: total number of instances
for that feature value in d, Rd : recall (%) on d, Pv ,Rv : precision (%) and recall
(%) on validation set. Rules with highest Rd are shown.
Metrics
Feature Rule
#d Rd Pv Rv
N3,9 <= 1.39 ∧ N3,9 > −0.98 ∧ N3,1 > −0.99 202 92 93 100
Center-line
∧N3,4 > −0.99 =⇒ present
N3,9 > 1.39 ∧ N3,6 <= −0.81 =⇒ absent 25 40 100 12
N2,45 <= −0.75 ∧ N1,50 > −0.91 30 86 100 69.23
Shadow
∧N1,9 <= −0.95 =⇒ present
N2,4 <= −0.73 ∧ N2,3 > 0.06 ∧ N2,9 > −0.98 =⇒ absent 200 94.5 97 100
N2,8 <= −0.98 ∧ N2,10 <= 0.32 =⇒ dark 40 52.5 94.44 43.5
Skid
N1,28 <= −0.93 ∧ N2,58 > −0.88 =⇒ no 5 60 0 0
N2,8 > −0.997 ∧ N2,48 > −0.991 ∧ N2,42 <= −0.342 182 97.8 93.4 95
∧N2,25 <= 1.82 =⇒ light
N2,2 > −0.99 ∧ N2,24 <= −0.3 ∧ N2,9 <= −1.19 =⇒ right 101 90 92.3 95.09
Position
N1,26 > −0.55 ∧ N1,20 <= −0.29 ∧ N1,52 <= −0.96 =⇒ lef t 109 91 100 75.22
N3,6 > −0.17 ∧ N3,6 <= 0.45 ∧ N3,3 > −0.38 ∧ N3,7 <= −0.55
∧N3,0 <= 2.56 ∧ N3,5 <= −0.95 =⇒ on 11 45 13.5 45.45
N1,5 > 3.29 ∧ N1,90 <= −0.87 ∧ N1,81 <= −0.76 =⇒ away 120 65 62.2 90.6
Heading
N1,5 <= 3.29 ∧ N1,37 > −0.84 ∧ N1,50 <= 8.22 ∧ N1,53 <= −0.39 102 83 73.9 16.5
∧N1,64 > −0.98 ∧ N1,45 <= −0.26 ∧ N1,34 <= 12.21 =⇒ towards
image impacting the rule (highlighted in red) is the center-line, indicating that
indeed the rules identify the feature. On the other hand, in the absence of the
center-line, it is unclear what information is used by the model (and the image
leads to error). The heatmaps for the shadow and skid also correctly highlight
the part of the image with the shadow of the nose and the skid marks. We used
such visualization techniques to further validate the rules.
Labeling New Data The rules extracted based on a small set of manually
annotated data can be leveraged to annotate a much larger data set. We used the
rules for center-line (present/absent) to label all of the test data (2000 images).
We chose the rule with highest Rd for the experiments. However, more rules could
be chosen to increase coverage. 1822 of the images satisfied the rule for "center-
line present" and 79 images for "center-line absent". We visually checked some
of the images to estimate the accuracy of the labelling. We similarly annotated
more images for the shadow and skid features. These new labels enable further
data-coverage analysis over the train and test datasets.
Feature-Guided Analysis We per-
Table 2: Feature-Guided Analysis Results
formed preliminary experiments to demon- Rule MAE MAE errors
strate the potential of feature-guided anal- CTE HE
yses. We first calculated the model accu- "center-line present" 0.36 1.63 45%
"center-line absent" 0.62 2.68 75%
racy (MAE) on subsets of the data labelled "shadow present" 0.66 2.23 42%
with the feature present and absent respec- "shadow absent" 0.34 1.55 7%
"dark skid" 0.43 1.84 52%
tively. We also determined the % of inputs "light or no skid" 0.33 1.49 42%
in the respective subsets violating the cor-
rectness property. The results are summarized in Table 2.
These results can be used by developers to better understand and debug the
model behavior. For instance, the model accuracy computed for the subsets with
"shadow present" and "dark skid", respectively, is poor and also a high % of the
respective inputs violate the correctness property. This information can be used
by developers to retrieve more images with shadows and dark skids, to retrain
the model and improve its performance. The extracted rules can be leveraged to
automate the retrieval.
Furthermore, we observe that in the absence of the center-line feature, the
model has difficulty in making correct predictions. This is not surprising, as the
presence of the center-line can be considered as a (rudimentary) input requirement
for the center-line tracking application. Indeed, in the absence of the center-line it
is hard to envision how the network can estimate correctly the airplane position
from it. The network may use other clues on the runway, leading to errors. We
can thus consider the presence of the center-line feature as part of the ODD
for the application. The rules for the center-line feature can be deployed as a
run-time monitor to either pass inputs satisfying the rules for "present" or reject
those that satisfy the rules for "absent", ensuring that the model operates in the
safe zone as defined by the ODD, and at the same time increasing its accuracy.
We also experimented with generating rules to explain correct and incorrect
behavior in terms of combinations of features such as: (center − line present) ∧
Feature-Guided Analysis of Neural Networks 139
considering more than one rule for each feature value (here, we only consider
pure rules with the highest recall Rd on dataset d used for feature extraction).
5 Related Work
6 Conclusion
effective techniques for checking them. We also plan to apply Marabou [10] for the
verification of safety properties expressed in terms of high-level features. Finally,
we plan to investigate neuro-symbolic techniques to develope high-assurance
neural network models.
References
1. nuimages, https://www.nuscenes.org/nuimages
2. Yolov4-tiny, https://github.com/WongKinYiu/PyTorch_YOLOv4
3. Ashmore, R., Calinescu, R., Paterson, C.: Assuring the machine learning lifecycle:
Desiderata, methods, and challenges. ACM Comput. Surv. 54(5), 111:1–111:39
(2021). https://doi.org/10.1145/3453444, https://doi.org/10.1145/3453444
4. Beland, S., Chang, I., Chen, A., Moser, M., Paunicka, J.L., Stuart, D., Vian, J.,
Westover, C., Yu, H.: Towards assurance evaluation of autonomous systems. In:
IEEE/ACM International Conference On Computer Aided Design, ICCAD 2020,
San Diego, CA, USA, November 2-5, 2020. pp. 84:1–84:6. IEEE (2020). https://doi.
org/10.1145/3400302.3415785, https://doi.org/10.1145/3400302.3415785
5. Bondarenko, A., Aleksejeva, L., Jumutc, V., Borisov, A.: Classification
tree extraction from trained artificial neural networks. Procedia Computer
Science 104, 556–563 (2017). https://doi.org/https://doi.org/10.1016/
j.procs.2017.01.172, https://www.sciencedirect.com/science/article/pii/
S1877050917301734, iCTE 2016, Riga Technical University, Latvia
6. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A.,
Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous
driving. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. pp. 11621–11631 (2020)
7. Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-cam++:
Generalized gradient-based visual explanations for deep convolutional networks. In:
2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp.
839–847. IEEE (2018)
8. Fahmy, H.M., Pastore, F., Briand, L.C.: HUDD: A tool to debug dnns
for safety analysis. In: 44th 2022 IEEE/ACM International Conference on
Software Engineering: Companion Proceedings, ICSE Companion 2022, Pitts-
burgh, PA, USA, May 22-24, 2022. pp. 100–104. IEEE (2022). https://doi.
org/10.1109/ICSE-Companion55297.2022.9793750, https://doi.org/10.1109/
ICSE-Companion55297.2022.9793750
9. Frew, E., McGee, T., Kim, Z., Xiao, X., Jackson, S., Morimoto, M., Rathinam, S.,
Padial, J., Sengupta, R.: Vision-based road-following using a small autonomous air-
craft. In: 2004 IEEE Aerospace Conference Proceedings (IEEE Cat. No.04TH8720).
vol. 5, pp. 3006–3015 Vol.5 (2004). https://doi.org/10.1109/AERO.2004.1368106
10. Katz, G., Huang, D.A., Ibeling, D., Julian, K., Lazarus, C., Lim, R., Shah, P.,
Thakoor, S., Wu, H., Zeljić, A., et al.: The marabou framework for verification and
analysis of deep neural networks. In: International Conference on Computer Aided
Verification. pp. 443–452. Springer (2019)
11. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., sayres, R.:
Interpretability beyond feature attribution: Quantitative testing with concept
activation vectors (TCAV). In: Dy, J., Krause, A. (eds.) Proceedings of the 35th
International Conference on Machine Learning. Proceedings of Machine Learning
Research, vol. 80, pp. 2668–2677. PMLR (10–15 Jul 2018), https://proceedings.
mlr.press/v80/kim18d.html
142 D. Gopinath et al.
12. Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., Carter, S.: Zoom in:
An introduction to circuits. Distill (2020). https://doi.org/10.23915/distill.
00024.001, https://distill.pub/2020/circuits/zoom-in
13. Schwalbe, G.: Concept embedding analysis: A review (2022). https://doi.org/10.
48550/ARXIV.2203.13909, https://arxiv.org/abs/2203.13909
14. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-yolov4: Scaling cross stage
partial network. In: Proceedings of the IEEE/cvf conference on computer vision
and pattern recognition. pp. 13029–13038 (2021)
15. Yeh, C.K., Kim, B., Ravikumar, P.: Human-centered concept explanations for
neural networks. In: Neuro-Symbolic Artificial Intelligence: The State of the Art,
pp. 337–352. IOS Press (2021)
16. Zhang, M., Zhang, Y., Zhang, L., Liu, C., Khurshid, S.: Deeproad: Gan-based meta-
morphic testing and input validation framework for autonomous driving systems.
In: Huchard, M., Kästner, C., Fraser, G. (eds.) Proceedings of the 33rd ACM/IEEE
International Conference on Automated Software Engineering, ASE 2018, Montpel-
lier, France, September 3-7, 2018. pp. 132–142. ACM (2018). https://doi.org/10.
1145/3238147.3238187, https://doi.org/10.1145/3238147.3238187
17. Zhou, B., Sun, Y., Bau, D., Torralba, A.: Interpretable basis decomposition for
visual explanation. In: Proceedings of the European Conference on Computer Vision
(ECCV). pp. 119–134 (2018)
18. Zohdinasab, T., Riccio, V., Gambi, A., Tonella, P.: Efficient and effective feature
space exploration for testing deep learning systems. ACM Trans. Softw. Eng.
Methodol. (jun 2022). https://doi.org/10.1145/3544792, https://doi.org/10.
1145/3544792, just Accepted
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
JavaBIP meets VerCors: Towards the Safety of
Concurrent Software Systems in Java⋆
1 Introduction
Modern software systems are inherently concurrent: they consist of multiple
components that run simultaneously and share access to resources. Component
interaction leads to resource contention, and if not coordinated properly, can
compromise safety-critical operations. The concurrent nature of such interactions
is the root cause of the sheer complexity of the resulting software [9]. Model-
based coordination frameworks such as Reo [5] and BIP [6] address this issue by
providing models with a formally defined behaviour and verification tools.
JavaBIP [10] is an open-source Java implementation of the BIP coordina-
tion mechanism. It separates the application model into component behaviour,
modelled as Finite State Machines (FSMs), and glue, which defines the possi-
ble stateless interactions among components in terms of synchronisation con-
straints. The overall behaviour of an application is to be enforced at run time
⋆
L. Safina and S. Bliudze were partially supported by ANR Investissements d’avenir (ANR-16-
IDEX-0004 ULNE) and project NARCO (ANR-21-CE48-0011). P. van den Bos, M. Huisman and
R. Rubbens were supported by the NWO VICI 639.023.710 Mercedes project.
by the framework’s engine. Unlike BIP, JavaBIP does not provide automatic
code generation from the provided model; instead it realises the coordination
of existing software components in an exogenous manner, relying on component
annotations that provide an abstract view of the software under development.
To model component behaviour, methods of a JavaBIP program are anno-
tated with FSM transitions. These annotated methods model the actions of the
program components. Computations are assumed to be terminating and non-
blocking. Furthermore, side-effects are assumed to be either represented by the
change of the FSM state, or to be irrelevant for the system behaviour. Any
correctness argument for the system depends on these assumptions. A limita-
tion of JavaBIP is that it does not guarantee that these assumptions hold. This
paper proposes a joint extension of JavaBIP and VerCors [11] providing such
guarantees about the implementation statically and at run time.
VerCors [11] is a state-of-the-art deductive verification tool for concurrent
programs that uses permission-based separation logic [3]. This logic is an exten-
sion of Hoare logic that allows specifying properties using contract annotations.
These contract annotations include permissions, pre- and postconditions and
loop invariants. VerCors automatically verifies programs with contract annota-
tions. To verify JavaBIP models, we (i) extend JavaBIP annotations with ver-
ification annotations, and (ii) adapt VerCors to support JavaBIP annotations.
VerCors was chosen for integration with JavaBIP because it supports multi-
threaded Java, which makes it straightforward to express JavaBIP concepts in
its internal representation. To analyze JavaBIP models, VerCors transforms the
model with verification annotations into contract annotations, leveraging their
structure as specified by the FSM annotations and the glue.
For some programs VerCors requires extra contract annotations. This is gen-
erally the case with while statements and when recursive methods are used. To
enable properties to be analysed when not all necessary annotations are added
yet, we extend the JavaBIP engine with support for run-time verification. During
a run of the program, the verification annotations are checked for that specific
program execution at particular points of interest, such as when a JavaBIP com-
ponent executes a transition. The run-time verification support is set up in such
a way that it ignores any verification annotations that were already statically
verified, reducing the overhead of run-time verification.
This paper presents the use of deductive and run-time verification to prove
assumptions of JavaBIP models. We make the following contributions:
– We extend regular JavaBIP annotations with pre- and postconditions for
transitions and invariants for components and states. This allows checking
design assumptions, which are otherwise left implicit and unchecked.
– We extend VerCors to deductively verify a JavaBIP model, taking into ac-
count its FSM and glue structure.
– We add support for run-time verification to the JavaBIP engine.
– We link VerCors and the JavaBIP engine such that deductively proven an-
notations need not be monitored at run-time.
– Finally, we demonstrate our approach on a variant of the Casino case study
from the VerifyThis Collaborative Long Term Challenge.
JavaBIP meets VerCors 145
Tool binaries and case study sources are available through the artefact [7].
2 Related Work
There are several approaches to analyse behaviours of abstract models in the lit-
erature. Bliudze et al. propose an approach allowing verification of infinite state
BIP models in the presence of data transfer between components [8]. Abdellatif
et al. used the BIP framework to verify Ethereum smart contracts written in
Solidity [1]. Mavridou et al. introduce the VeriSolid framework, which generates
Solidity code from verified models [13]. André et al. describe a workflow to anal-
yse Kmelia models [4]. They also describe the COSTOTest tool, which runs tests
that interact with the model. Thus, these approaches do not consider verifica-
tion of model implementation, which is what we do with Verified JavaBIP. Only
COSTOTest establishes a connection between the model and implementation,
but it does not guarantee memory safety or correctness.
There is also previous work on combining deductive and runtime verification.
The following discussion is not exhaustive. Generally, these works do not support
concurrent Java and JavaBIP. Nimmer et al. infer invariants with Daikon and
check them with ESC/Java [14]. However, they do not check against an abstract
model, and the results are not used to optimize execution. Bodden et al. and
Stulova et al. optimize run-time checks using static analysis [12,16]. However,
Stulova et al. do not support state machines, and Bodden et al. do not support
data in state machines. The STARVOORS tool by Ahrendt et al. is comparable
to Verified JavaBIP [2]. Some minor differences include the type of state machine
used, and how Hoare triples are expressed. The major difference is that it is not
trivial to support concurrency in STARVOORS. VerCors and Verified JavaBIP
use separation logic, which makes concurrency support straightforward.
3 JavaBIP and Verification Annotations
JavaBIP annotations capture the FSM specification and describe the behaviour
of a component. They are attached to classes, methods or method parameters,
and were first introduced by Bliudze et al [10]. Listing 1 shows an example of
JavaBIP annotations. @ComponentType indicates a class is a JavaBIP component
and specifies its initial state. In the example this is the WAITING state. @Port
declares a transition label. Method annotations include @Transition, @Guard
and @Data. @Transition consists of a port name, start and end states, and
optionally a guard. The example transition goes from WAITING to PINGED when
the PING port is triggered. The transition has no guard so it may always be taken.
@Guard declares a method which indicates if a transition is enabled. @Data either
declares a getter method as outgoing data, or a method parameter as incoming
data. Note that the example does not specify when ports are activated. This is
specified separately from the JavaBIP component as glue [10].
We added component invariants and state predicates to Verified JavaBIP as
class annotations. @Invariant(expr) indicates expr must hold after each com-
ponent state change. @StatePredicate(state, expr) indicates expr must hold
in state state. Pre- and postconditions were also added to the @Transition an-
notation. They have to hold before and after execution of the transition. @Pure
146 S. Bliudze et al.
indicates that a method is side-effect-free, and is used with @Guard and @Data.
Annotation arguments should follow the grammar of Java expressions. We do
not support lambda expressions, method references, switch expressions, new,
instanceOf, and wildcard arguments. In addition, as VerCors does not yet sup-
port Java features such as generics and inheritance, models that use these cannot
be verified. These limitations might be lifted in the future.
4 Architecture of Verified JavaBIP
The architecture of Verified JavaBIP is shown in Figure 1. The user starts with a
JavaBIP model, optionally with verification annotations. The user then has two
choices: verify the model with VerCors, or execute it with the JavaBIP Engine.
We extended VerCors to transform the JavaBIP model into the VerCors in-
ternal representation, Common Object Language (COL). An example of this
transformation is given in Listing 2. If verification succeeds, the JavaBIP model
is memory safe, has no data races, and the components respect the properties
specified in the verification annotations. In this case, no extra run-time veri-
fication is needed. If verification fails, there are either memory safety issues,
components violate properties, or the prover timed out. In the first case, the
user needs to change the program or annotations and retry verification with
VerCors. This is necessary because memory safety properties cannot be checked
with the JavaBIP engine, and therefore safe execution of the JavaBIP model is
not guaranteed. In the second and third case, VerCors produces a verification
report with the verification result for each property.
We extended the JavaBIP engine with run-time verification support. If a
verification report is included with the JavaBIP model, the JavaBIP engine uses
it to only verify at run-time the verification annotations that were not verified
deductively. If no verification report is included, the JavaBIP engine verifies all
verification annotations at run time.
JavaBIP meets VerCors 147
References
1. Abdellatif, T., Brousmiche, K.L.: Formal verification of smart contracts based on
users and blockchain behaviors models. In: 9th IFIP International Conference on
New Technologies, Mobility and Security (NTMS), pp. 1–5. IEEE (Feb 2018).
https://doi.org/10.1109/NTMS.2018.8328737
2. Ahrendt, W., Chimento, J.M., Pace, G.J., Schneider, G.: Verifying data- and
control-oriented properties combining static and runtime verification: theory and
tools. Form. Methods Syst. Des. 51(1), 200–265 (Aug 2017). https://doi.org/10.
1007/s10703-017-0274-y
3. Amighi, A., Hurlin, C., Huisman, M., Haack, C.: Permission-based separation logic
for multithreaded Java programs. Logical Methods in Computer Science 11(1) (Feb
2015). https://doi.org/10.2168/LMCS-11(1:2)2015
4. André, P., Attiogbé, C., Mottu, J.M.: Combining techniques to verify service-
based components (Sep 2022), https://www.scitepress.org/Link.aspx?doi=10.
5220/0006212106450656, [Online; accessed 26. Sep. 2022]
5. Arbab, F.: Reo: A channel-based coordination model for component composition.
Mathematical Structures in Computer Science 14(3), 329–366 (2004). https://doi.
org/10.1017/S0960129504004153
6. Basu, A., Bozga, M., Sifakis, J.: Modeling heterogeneous real-time components
in BIP. In: 4th IEEE Int. Conf. on Software Engineering and Formal Methods
(SEFM06). pp. 3–12 (Sep 2006). https://doi.org/10.1109/SEFM.2006.27, invited
talk
7. Bliudze, S., van den Bos, P., Huisman, M., Rubbens, R., Safina, L.: Artefact of:
JavaBIP meets VerCors: Towards the Safety of Concurrent Software Systems in
Java (2023). https://doi.org/10.4121/21763274
8. Bliudze, S., Cimatti, A., Jaber, M., Mover, S., Roveri, M., Saab, W., Wang,
Q.: Formal verification of infinite-state BIP models. In: Finkbeiner, B., Pu, G.,
Zhang, L. (eds.) Automated Technology for Verification and Analysis. pp. 326–
343. Springer International Publishing, Cham (2015). https://doi.org/10.1007/
978-3-319-24953-7_25
9. Bliudze, S., Katsaros, P., Bensalem, S., Wirsing, M.: On methods and tools for
rigorous system design. Int. J. Softw. Tools Technol. Transf. 23(5), 679–684 (2021).
https://doi.org/10.1007/s10009-021-00632-0
10. Bliudze, S., Mavridou, A., Szymanek, R., Zolotukhina, A.: Exogenous coordination
of concurrent software components with JavaBIP. Software: Practice and Experi-
ence 47(11), 1801–1836 (Apr 2017). https://doi.org/10.1002/spe.2495
11. Blom, S., Darabi, S., Huisman, M., Oortwijn, W.: The VerCors tool set: Verification
of parallel and concurrent software. In: IFM. Lecture Notes in Computer Science,
vol. 10510, pp. 102–110. Springer (2017), https://link.springer.com/chapter/10.
1007/978-3-319-66845-1_7
12. Bodden, E., Lam, P., Hendren, L.: Partially Evaluating Finite-State Runtime Mon-
itors Ahead of Time. ACM Trans. Program. Lang. Syst. 34(2), 1–52 (Jun 2012).
https://doi.org/10.1145/2220365.2220366
13. Mavridou, A., Laszka, A., Stachtiari, E., Dubey, A.: VeriSolid: Correct-by-design
smart contracts for Ethereum. In: Financial Cryptography and Data Security,
pp. 446–465. Springer, Cham, Switzerland (Sep 2019). https://doi.org/10.1007/
978-3-030-32101-7_27
14. Nimmer, J.W., Ernst, M.D.: Static verification of dynamically detected pro-
gram invariants: Integrating Daikon and ESC/Java. Electronic Notes in Theoret-
ical Computer Science 55(2), 255–276 (2001). https://doi.org/https://doi.org/10.
150 S. Bliudze et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Model-based Player Experience Testing with
Emotion Pattern Verification
1 Introduction
tests are designed to address issues that can lead to degrading the human perfor-
mance during the game-play [10], whereas PX can target the emotional experi-
ence of a player which eventually influences the success or failure of a game in the
market [1]. This has led to the emergence of Games User Research (GUR) as an
approach to gain insights into PX which is tied to human-computer-interaction,
human factors, psychology and game development [14].
Validating a game design relies either on trained PX testers or acquiring in-
formation directly from players with methods such as interviews, questionnaires
and physiological measurements [40,37], which are labour-intensive, costly and
not necessarily representing all users profiles and their emotions. Moreover, such
tests need to be repeated after every design change to assure the PX is still
aligned with the design intention. Thus, GUR researchers have turned into de-
veloping AI-based PX testing methods. In particular, agent-based testing has
attracted attention because it opens new rooms for automated testing of PX by
imitating players while keeping the cost of labour and re-applying the tests low.
There exist appraisal theories of emotions that address the elicitation of
emotions and their impact on emotional responses. They indicate that emotions
are elicited by appraisal evaluation of events and situations [33]. Ortony, Clore,
and Collins (OCC) theory [43] is one of several widely known appraisal theories
in cognitive science that is also commonly used in modeling emotional agents
[15,9,47,42,12]. Despite the influence of emotions on forming the experience of
players [39,13], this approach has not been employed in PX testing [6].
In our automated PX testing approach, we opt for a model-driven approach
to model emotions. Theoretical models of human cognition, used for decades
in cognitive psychology, provide a more coherent outlook of cognitive processes.
In contrast, applying a data-driven (machine learning) approach is greatly con-
strained by the availability of experimental data. Inferring a cognitive process
from limited experimental data is an ill-posed problem [5] because such a process
is subjective. Individuals can evaluate the same event differently due to age, gen-
der, education, cultural traits, etc. For example, when a romantic relationship
ends, some individuals feel sadness, others anger, and some even experience relief
[48]. However, according to appraisal theories of emotions, common patterns can
be found in emergence of the same emotion. These patterns are given as a struc-
ture of emotions by the aforementioned OCC. Thus, a model-driven approach
derived form a well-grounded theory of emotions such as OCC, is sensible when
access to a sufficient data is not possible.
In this paper, we present an agent-based player experience testing framework
that allows to express emotional requirements as patterns and verify them on
executed test suites generated by model-based testing (MBT) approach. The
framework uses a computational model of emotions based on OCC theory [21]
to generate the emotional experience of agent players. Comparing to [21], this
paper contributes to expressing emotion patterns’ requirements and generating
covering test suites for verifying patterns on a game level. We show such a
framework allows game designers to verify the emotion patterns’ requirements
and gain insight on emotions the game induce, over time and over space.
Model-based Player Experience Testing with Emotion Pattern Verification 153
Revealing such patterns requires a test suite that can trigger enough diversity
in the game behavior and as a result in the evoked emotions. This is where the
model-based testing approach with its fast test suites generation can contribute.
In this paper, we employ an extended finite state machine (EFSM) model [18]
that captures all possible game play behaviors serving as a subset of human
behaviors, at some level of abstraction. We use a search based algorithm (SB)
for testing, more precisely multi objective search algorithm (MOSA) [44], and
linear temporal logic (LTL) for model checking (MC) [8,11] as two model-based
test suite generation techniques to investigate the ability of each generated test
suite in revealing variations of emotion e.g absence of an emotion in a corridor.
We apply test-cases distance metric to measure test suites’ diversity and the
distance between SB and MC test suites. Results on our 3D game case study
shows that SB and MC, due to their different techniques for test generation,
produce distinctive test cases which can identify different variations of emotions
over space and time, that cannot be identified by just one of the test suites.
The remainder of this paper is organized as follows. Section 2 explains the
computational model of emotions and the model-based testing approach. Section
3 presents the PX framework architecture. Section 4 describes our methodology
of expressing PX requirements using emotion patterns, test suites diversity mea-
surement, and the overall PX testing algorithm. Section 5 shows an exploratory
case study to demonstrate the emotion pattern verification using model-based
testing along with an investigation on results of SB and MC test suite gener-
ation techniques. Mutation testing is also addressed in this section to evaluate
the strength of the proposed approach. Section 6 gives an overview of related
work. Finally, Section 7 proposes future work and concludes the paper.
2 Preliminaries
This section summarizes the OCC computational model of emotions [21] and
the model-based testing as key components of our PX framework.
tal dynamism such as hazards. The event tick represents the passage of time.
The emotion model of an agent is defined as a 7-tuple transition system M :
– G is a set of goals, that the agent wants to achieve; each is a pair hid, xi of
a unique id and significance degree.
– S is the set of M ’s possible states; each is a pair hK, Emoi:
• K is a set of propositions the agent believes to be true. It includes, for
each goal g, a proposition status(g, p) indicating if g has been achieved
or failed, and a proposition P(g, v) with v∈[0..1], stating the agent’s
current belief on the likelihood of reaching this goal.
• Emo is the agent’s emotional state represented by a set of active emo-
tions, each is a tuple hety, w, g, t0 i, ety is the emotion type, w is the
intensity of the emotion respecting a goal g, and triggered time t0 .
– s0 ∈ S is the agent’s initial state.
– E specifies the types of events the agent can experience.
– δ is M ’s state transition function; to be elaborated later.
– Des is an appraisal function; Des(K, e, g) expresses the desirability, as a
numeric value, of an event e with respect to achieving a goal g, judged when
the agent believes K. OCC theory has more appraisal factors [43], but only
desirability matters for aforementioned types of emotion [21].
– Thres : thresholds for activating every emotion type.
The transition function δ updates the agent’s state hK, Emoi, triggered
by an incoming event e ∈ E as follows:
updated emotion Emo0
z }| {
e 0
hK , Emoi −−−−→ hK , newEmo(K, e, G) ⊕ decayed(Emo)i
where w is the intensity of the emotion ety towards the goal g ∈ G and t is the
current system. Upon an incoming event, the above function is called to check
the occurrence of new emotions as well as re-stimulation of existing emotions in
Model-based Player Experience Testing with Emotion Pattern Verification 155
where w0 is the initial intensity of ety for the goal g at time t0 ; this is stored in
emhistory. decayety which is a decay function defined as an inverse exponential
function over the peak of intensity (w0 ) at time t0 .
l/g/α
[2]. Transitions t in M take the form n −−−→ n0 where n and n0 are source and
destination abstract states of the transition, l is a label, g is a predicate over V
that guards the transition, and α is a function that updates the variables in V .
Figure 1 shows an example of a small level in a game called Lab recruits
4
which is the case study of this paper as well. A Lab recruits level is a maze
with a set of rooms and interactive objects, such as doors and buttons. A level
might also contain fire hazards The player’s goal is to reach the object gf0.
Access to it is guarded by door3 , so reaching it involves opening the door using
a button, which in turn is in a different room, guarded by another door, and so
on. Ferdous et al. [18] employs a combined search-based and model-based testing
for functional bug detection in this game using EFSM model (Figure 1). In the
model, all interactable objects are EFSM states: doors (3), buttons (4), and the
goal object gf0. For each doori , di p and di m are introduced to model the two
sides of the door. The model has three context variables representing the state
of each door (open/close). A solid edged transition on the model is unguarded,
modelling the agent’s trip from one object to another without walking through
a door. A dotted transition models traversing through a door when the door is
open. A dashed self loop transition models pressing a button; it toggles the status
of the doors connected to the pressed button. Notice that the model captures the
logical behavior of the game. It abstracts away the physical shape of the level,
which would otherwise make the model more complicated and prone to changes
during development. Given such a model, abstract test cases are constructed as
sequences of consecutive transitions in the model. This paper will extend the
EFSM model-based testing approach [18] for player experience testing.
Fig. 1: A game level in the Lab Recruits game and its EFSM model [18].
3 PX Testing Framework
The proposed automated PX testing framework aims to aid game designers for
PX assessment of their games by providing information on the time and place
of emerged emotions and their patterns which would ultimately determine the
general experience of player. E.g. if these patterns do not fulfill design intentions,
game properties can be altered and testing process can be repeated.
4
https://github.com/iv4xr-project/labrecruits
Model-based Player Experience Testing with Emotion Pattern Verification 157
Figure 2 shows the general architecture of the framework. There are four main
components: a Model-based Testing component for generating tests, the Model
of Emotions component implements the computational model of emotions from
Section 2.1, an Aplib basic test agent [45] for controlling the in-game player-
character, and the PX Testing Tool as an interface for a game designer towards
the framework. The designer needs to provide these inputs, see 1 in Figure 2:
– An EFSM that abstractly models the functional behavior of the game.
– A selection of game events that have impacts on the player’s emotions (e.g.
defeating an enemy, acquiring gold).
– Characteristics that the designer wants to address in the agent to resemble
a certain type of players, such as: a player’s goals and their priorities, the
player’s initial mood and beliefs before playing the game, and the desirability
of incoming events for the player. E.g. a player might experience a high level
of positive emotions on defeating an enemy, while for another player who
prefers to avoid conflicts, acquiring a gold bar could be more desirable.
Given the EFSM model, the Model-based testing component, 2 in Figure
2, generates a test suite consisting of abstract test cases to be executed on the
game under test (GUT). The test generation approach is explained in Section
4.1. Due to the abstraction of the model, emotion traces cannot be obtained
from pure on-model executions. They require the executions of the test cases on
the GUT. An adapter is needed to convert the abstract test cases into actual
instructions for the GUT. The Aplib basic test agent does this conversion.
Attaching the Model of Emotions to the basic test agent creates an emotional
test agent, 3 in Figure 2, which is able to simulate emotions based on incoming
events. Via a plugin, the emotional test agent is connected to the GUT. Each
test case of the test suite is then given to the agent for execution. The agent
computes its emotional state upon observing events and records it in a trace file.
Finally, when the whole test suite is executed, the PX Testing Tool analyzes
the traces to verify given emotional requirements and provide heat-maps and
timeline graphs of emotions for the given level ( 4 in Figure 2).
4 Methodology
This section describes the framework’s model-based test generation techniques
and our approach to measure a test suite’s diversity. Then, our approach for
expressing emotion pattern requirements and verifying them are explained.
Search Based Test generation Search based testing (SBT) formulates testing
problems as an optimization problem in which a search algorithm is used to find
an optimized solution, in the form of a test suite, that satisfies a given test
adequacy criterion encoded as a fitness function [36]. Meta-heuristic algorithms
such as genetic algorithm [23] and tabu [22,26] are commonly used for this. Our
framework uses an open source library EvoMBT [18] that comes with several state
of the art search algorithms e.g. MOSA [44]. We utilize this to produce a test
suite satisfying e.g. the criterion in Def.1 to represent players’ potential behavior
in the game, which are then executed to simulate their emotional experience.
To apply MOSA, individual encoding, search operators and a fitness func-
tion need to be defined. An individual I is represented as a sequence of EFSM
transitions. Standard crossover and mutation are used as the search operators.
MOSA treats each coverage target as an independent optimisation objective. For
each transition t, the fitness function measures how much of an individual I is
actually executable on the model and how close it is from covering t as in Def.1.
MOSA then evolves a population that minimize the distances to all the targets.
LTL model checking test generation Model checking is the second tech-
nique we use for test generation. This technique is originally introduced for
automated software verification that takes a finite state model of a program
as an input to check whether given specifications hold in the model [8]. Such
Model-based Player Experience Testing with Emotion Pattern Verification 159
φt = ¬g U (n1 ∧ X (n2 ∧ ¬g U g ))
where g is the goal state like gf0 in Figure 1. The model checking algorithm
checks whether ¬φt is valid on the EFSM model using depth-first traversal [29].
If it is not, a counter example is produced that visits t and terminates in g. An
extra iteration is added to find the shortest covering test case.
Diversity is an approach to measure the degree of variety of the control and data
flow in software or a game[41]. We use this approach to measure the diversity of
test suites obtained from the generators in Section 4.1. A test suite’s diversity
degree is the average distance between every pair of distinct test cases, which
can be measured in e.g. the Jaro distance metric. For a test case tc, let tc and
|tc| be its string representation and its length respectively. The Jaro distance
between two test cases of tci and tcj is calculated as follows:
(
1 , if m = 0
Dis Jaro ( tci , tcj ) = 1 (3)
1 − ( m
3 |tci |
+ m
|tcj |
+ m−t
m
) , if m 6= 0
where m is the number of matching symbols in two strings whose distance is less
than b|tci |/2c, assuming tci is the longer string; and t is half of the number of
transpositions. Then, the diversity of a test suite T S is a summation of distances
between every pair of distinct test cases, divided by the number such pairs:
P|T S| P|T S|
i=1 j=i+1 Dis Jaro (tci , tcj )
Divavg (T S) = |T S| ∗ (|T S|−1)
(4)
2
This is used in Section 5 to measure the distance between the test suites
generated by the two approaches (Section 4.1) provided by our framework, along
with their complementary effects on revealing different emotion patterns.
160 S. G. Ansari et al.
5 Case Study
This section presents an exploratory case study conducted to investigate the use
of a model-based PX testing framework5 for verifying emotion requirements in a
game-level and to investigate the difference between the search based generated
test suite and the model checker generated test suites on revealing emotion pat-
terns. Finally, we run mutation testing to evaluate the strength of our framework.
Figure 3 shows a test level called Wave-the-flag in the Lab Recruits, a config-
urable 3D game, designed for AI researchers to define their own testing problems.
It is a medium sized level, consisting
of a 1182 m2 navigable virtual floor,
8 rooms, 12 buttons, and 11 doors.
Its EFSM model consists of 35 states
and 159 transitions. The player starts
in the room marked green at the top,
and must find a ’goal flag’ gf0 marked
red in the bottom room to finish the
level. Doors and buttons form a puzzle
in the game. A human player needs to
disclose the connections between but-
tons and doors to open a path through Fig. 3: Wave-the-flag level.
5
https://doi.org/10.5281/zenodo.7506758
162 S. G. Ansari et al.
the maze to reach the aforementioned goal flag in a timely manner. The player
can earn points by opening doors and lose health in case of passing fire flames.
For the test agent, the latter is also observable as an event called Ouch. If the
player runs out of health, it loses the game. The player also has no prior knowl-
edge about the position of doors, buttons and the goal flag, nor the knowledge
on which buttons open which doors. Since there are multiple paths to reach the
target, depending on the path that the player chooses to explore, it might be able
to reach the goal without health loss, at one end of spectrum, or it can end up
dead at the other end. The EFSM model (not shown) of the Wave-the-flag level
is constructed similar to the running example in section 2.2. To add excitement,
Wave-the-flag also contains fire flames. However, these flames are not included
into the EFSM model because the placement and amount of these objects are
expected to change frequently during development. Keeping this information in
the EFSM model would force the designer to constantly update the model after
each change in flames. Thus, similar to the running example, the EFSM model
contains doors, buttons and goal flags.
In addition to the EFSM model, we need to characterize a player to do
PX testing ( 1 in Figure 2). Table 1 shows basic characteristics of a player,
defined with a set of parameters, to configure the emotion model of the agent
before the execution. The level designer determines values of these parameters.
After the execution of the model, we asked the designer to check the plausibility
their values by checking the emotional heat-map results. The designer checked
randomly selected number of test cases with their generated emotional heat maps
to check the occurrence of emotions are reasonable. Thus, the utilized values for
the following experiment is confirmed reasonable by the designer. Moreover, The
likelihood of reaching the goal gf0 is set to 0.5 in the initial state to model a
player who initially feels unbiased towards the prospect of finishing the level.
Thus, the agent feels an equal level w of hope and fear at the beginning.
Table 1: Configuration of Player Characterization. G is the agent’s goal set; it has one
goal for this level, namely reaching the goal-flag gf 0, s0 is the emotion model’s initial
state, a set of relevant events (E) needs to be defined by the designers: DoorOpen
event, triggered when a new door gets open, is perceived as increasing the likelihood
of reaching gf 0 by v1 in the model, Ouch event, that notifies fire burn, is perceived
as declining the likelihood of reaching gf 0 by v2 , GoalInSight event, triggered at the
first time the agent observes the goal gf 0 in its vicinity , is modelled as making the
agent believes that the likelihood of reaching the goal becomes certain (1), and finally
GoalAccomplished event is triggered when the goal gf 0 is accomplished. Des reflects
the desirability/undesirability of each event with respect to the goal and T hres is the
emotions’ activation thresholds. x, vi , and yi are constants determined by the designer.
Parameter Value
G g =< gf 0, x >∈ G
s0 likelihood(gf 0, 0.5) ∈ K0 ,
Emo0 = {< Hope, gf 0, w, 0 >, < F ear, gf 0, w, 0 >}
E = {DoorOpen, Ouch, GoalInSight, GoalAccomplished}
on DoorOpen event: likelihood(gf 0, +v1 ),
on fire burn in Ouch event: likelihood(gf 0, −v2 ) ,
on GoalInSight event: likelihood(gf 0, 1).
Des Des(K, DoorOpen, gf 0) = +y1 ,
Des(K, Ouch, gf 0) = −y2 ,
Des(K, GoalInSight, gf 0) = +y3
Thres 0
in average, its longest test case has in average 74.25 transitions. Finally, the last
row in Table 2 indicates the difference between SB and MC test suites. The
distance between two test suites is measured for every generated T SSB using
Equation 5 which brings about 0.214 (σ = 0.024) distance in average between
test cases of the two suites. Later, we investigate whether such a difference can
lead to differences at the execution level in emotion patterns.
Table 2: Characteristics of LTL-model checking-based and search-based test suites with
respect to the same coverage criterion.
Test suite size Divavg (T Si ) Shortest tc longest tc
T SM C 74 0.113 23 45
T SSBavg 54.6 0.192 17.7 74.25
Div avg (T SM C , T SSB ) 128.6 0.214
emotional intensity in heat maps. The heat maps of hope, joy and satisfaction
for these test suites show quite similar spatial information (only hope and joy
are shown in Figure 4). However, T SM C generally shows a higher level of hope
during the game-play (Figures 4a and 4b). So, if the designer verifies his level on
the presence and spatial distribution of intensified hope through the level, the
test cases produced by T SM C can expose these attributes better. This can be
explained by the model checker setup to find shortest test cases; some can then
open the next door sooner, raising hope before its intensity decays too much.
The maps also show a difference in the spatial coverage of T SSB and T SM C
(marked green in Figures 4a and 4b). The transition that traverses the corridor is
present in T SM C , but when the corresponding abstract test case is transformed
into an executable test case for APlib test agent, they also incorporate optimiza-
tion. So, it finds a more optimized way for execution by skipping the transition
that actually passes the corridor towards the room, if the next transition is to
traverse back along same corridor. The corridor is, however, covered by T SSB .
Fig. 4: Heat-map visualization of positive emotions for SBT and MC test suites.
The most striking differences between T SSB and T SM C are revealed in their
negative emotions’ heat-maps (Figure 5). Most places that are marked black
as distress-free by executed T SM C (Figure 5a) are actually highly distressful
positions for some test cases of T SSB . The presence of distress might be the
intended player experience, whereas its absence in certain places might actually
be undesirable. Upon closer inspection of individual test cases, it turns out that
the test cases of T SSB that pass e.g. the red regions in Figure 5a and 5b always
show distress in the marked corridor, whereas one test case in T SM C manages
to find a ’sneak route’ that passes the corridor without distress, and finishes the
level successfully. Thus, if the designer is looking for the possibility of absence of
distress in the sneak corridor, inspection of T SSB would not suffice. The heat-
maps of disappointment reveals another difference. While T SM C only finds one
location where the agent dies and feels disappointed, T SSB manages to find 3
more locations in the level with the disappointment state (Figure 5c).
The main reason behind those differences is that a sequence of transitions
results in experiencing an emotion in the agent, not just a single transition.
Furthermore, emotions intensity has a residual behavior which means a sequence
of transitions and behavior might result in an emotion which still remains in
the agent emotional state after some time. Thus, providing state coverage or the
Model-based Player Experience Testing with Emotion Pattern Verification 165
Fig. 5: Heat-map visualization of negative emotions for SBT and MC test suites.
at least once thought the execution, but the agent never reaches the goal with sat-
isfaction. Having Sat patterns failed to be witnessed, or U nSat patterns that are
witnessed, assist the designer to alter their game level and run the agent through
the level again. For example, here, the fail on Sat(J¬S) is an indication that the
designer needs to put some challenging objects like fire or enemies in the vicinity
of the goal gf 0.
Table 3: Emotion pattern check with T SM C and T SSB . H= hope, F = fear, J= joy,
D= distress, S= satisfaction, P = disappointment and ¬X = absence of emotion X.
Emotion patterns T SM C T SSB
Sat(¬DS) 4 4
U nSat(¬F S) 4 4
Sat(J¬S) 7 7
U nSat(JD) 4 4
Sat(JF S) 4 4
Sat(DHP ) 4 4
Sat(DHS) 4 4
Sat(DH¬DS) 7 4
Sat(F DHF J) 7 4
Sat(HF DDDHF J) 7 4
Sat(F DDHF P ) 7 7
Emotion patterns length=2 101/144 (70.2%) 101/144 (70.2%)
Emotion patterns length=3 88/150 (58.6%) 88/150 (58.6%)
Emotion patterns length=4 71/164 (43.2%) 72/164 (43.9%)
Emotion pattern length=5 60/177 (33.8%) 61/177 (34.4%)
Table 3 also shows the similar ratio of the pairwise combination of emotions
over various Sat(p) for the pattern p between length 2-5 by the T SSB and the
T SM C , indicating that both test suites can perform well to detect Sat-type
emotion patterns. However, there the last three patterns in Table 3 are covered
by T SSB but missed by T SM C . Thus, they are complementary, which makes it
reasonable to use both test suites for verifying emotion pattern requirements.
AW2WF z 72 34
RW2WF z 24 51
RW2WF z 48 17
RW2WF z 0 34
RW2WF z 48 0
AMRF z 24 51
RMRF z 24 0
RF z 48 17
RF z 0 51
RF z 0 51
Table 4: Mutation operators
Original
Code Description
RF Remove fire
RW2WF Relocate fire Emotion patterns
between walls Sat(¬DS) 4 4 4 4 4 4 4 4 4 4 4
RMRF Relocate fire in U nSat(¬F S) 4 7 4 4 4 4 4 4 4 4 4
middle of a room Sat(J¬S) 4 4 4 7 4 4 7 7 4 7 7
AMRF Add fire in U nSat(JD) 4 4 4 4 4 4 4 4 7 4 4
middle of a room Sat(JF S) 4 4 4 4 4 4 4 4 4 4 4
AW2WF Add fire Sat(DHP ) 4 4 4 4 4 4 4 4 4 4 4
between walls Sat(DHS) 4 4 4 4 4 4 4 4 4 4 4
Sat(DH¬DS) 4 4 4 4 4 4 4 4 4 4 4
Sat(F DHF J) 4 7 4 4 4 4 4 4 4 4 4
Sat(HF DDDHF J) 4 7 4 4 4 4 4 7 4 4 4
Sat(F DDHF P ) 4 7 7 7 4 4 4 4 7 4 4
6 Related Work
A number of research has been conducted on automated play testing to reduce
the cost of repetitive and labor-intensive functional testing tasks in video games
[35,54]. In particular, agent based testing has been a subject of recent research
to play and explore the game space on behalf of human players for testing pur-
poses. Ariyurek et al. [7] introduces Reinforcement Learning (RL) and Monte
Carlo Tree Search (MCTS) agents to detect bugs in video games automatically.
Stahlke et al. [51] presents a basis for a framework to model player’s memory
and goal-oriented decision-making to simulate human navigational behavior for
identifying level design issues. The framework creates an AI-agent that uses a
path finding heuristic to navigate a level, optimized by a given player charac-
teristics such as level of experience and play-style. Zhao et al. [55] intend to
create agents with human-like behavior for balancing games based on skill and
play-styles. These parameters are measured using introduced metrics to help
training the agents in four different case studies to test the game balance and to
imitate players with different play-styles. Gordillo et al. [24] addresses the game
state coverage problem in play-testing by introducing a curiosity driven rein-
168 S. G. Ansari et al.
forcement learning agent for a 3D game. The test agent utilizes proximal policy
optimization (PPO) with a curiosity factor reflected on the RL reward function
with frequency of a game state visit. Pushing the agent to have the exploratory
behaviour provides a better chance to explore unseen states for bugs.
Among game model-based testing, Iftikhar et al. [30] applies it on Mario
Brothers game for functional testing. The study uses UML states machine as a
game model for test case generation which manages to reveal faults. Ferdous et al.
[18] employs combined search-based and model-based testing for automated play-
testing using an EFSM. Search algorithms are compared regarding the model
coverage and bug detection. Note that while an EFSM provides paths through
a game, it can not reveal the experience of a player who navigates the path.
Despite some research on modeling human players and their behavior in agents
for automated functional play testing, there are a few research on automation of
PX evaluation. Holmgard et al. [28] propose to create procedural personas or player
characteristics for test agent to help game designers to develop game contents and
desirable level design for different players. The research proposes to create per-
sonas in test agents using MCTS with evolutionary computation for node selec-
tion. The result on MiniDungeons 2 game shows how different personas brings
about different behavior in response to game contents which can be seen as differ-
ent play-styles. Lee et al. [34] investigate a data-driven cognitive model of human
performance in moving-target acquisition to estimate the game difficulty for dif-
ferent players with different skill level. There is limited research on the emotion
prediction and its usage for automation of PX evaluation. Gholizadeh et al. [21]
introduce an emotional agent using a formal model of OCC emotions and propose
the potential use of such an agent for PX assessment. However, the approach lacks
automated path planning and reasoning, and hence it cannot do automated game-
play. Automatic coverage of game states and collecting all emerging emotions are
thus not supported which are addressed in this paper.
References
1. Agarwal, A., Meyer, A.: Beyond usability: evaluating emotional response as an
integral part of the user experience. In: CHI’09 Extended Abstracts on Human
Factors in Computing Systems, pp. 2919–2930. ACM New York, NY, USA (2009)
2. Alagar, V., Periyasamy, K.: Extended finite state machine. In: Specification of
software systems, pp. 105–128. Springer (2011)
3. Alves, R., Valente, P., Nunes, N.J.: The state of user experience evaluation practice.
In: Proceedings of the 8th Nordic Conference on Human-Computer Interaction:
Fun, Fast, Foundational. pp. 93–102 (2014)
4. Ammann, P.E., Black, P.E., Majurski, W.: Using model checking to generate tests
from specifications. In: Proceedings second international conference on formal en-
gineering methods (Cat. No. 98EX241). pp. 46–54. IEEE (1998)
5. Anderson, J.R.: Arguments concerning representations for mental imagery. Psy-
chological review 85(4), 249 (1978)
6. Ansari, S.G.: Toward automated assessment of user experience in extended reality.
In: 2020 IEEE 13th international conference on software testing, validation and
verification (ICST). pp. 430–432. IEEE (2020)
7. Ariyurek, S., Betin-Can, A., Surer, E.: Automated video game testing using syn-
thetic and humanlike agents. IEEE Transactions on Games 13(1), 50–67 (2019)
8. Baier, C., Katoen, J.P.: Principles of model checking. MIT press (2008)
9. Bartneck, C.: Integrating the occ model of emotions in embodied characters. In:
Proceeding of the Workshop on Virtual Conversational Characters: Applications,
Methods, and Research Challenges (2002)
10. Bevan, N.: What is the difference between the purpose of usability and user experi-
ence evaluation methods. In: Proceedings of the Workshop UXEM. vol. 9, pp. 1–4.
Citeseer (2009)
11. Callahan, J., Schneider, F., Easterbrook, S., et al.: Automated software testing
using model-checking. In: Proceedings 1996 SPIN workshop. vol. 353. Citeseer
(1996)
12. Demeure, V., Niewiadomski, R., Pelachaud, C.: How is believability of a virtual
agent related to warmth, competence, personification, and embodiment? Presence
20(5), 431–448 (2011)
13. Desmet, P., Hekkert, P.: Framework of product experience. International journal
of design 1(1) (2007)
14. Drachen, A., Mirza-Babaei, P., Nacke, L.E.: Games user research. Oxford Univer-
sity Press (2018)
15. Elliott, C.D.: The affective reasoner: a process model of emotions in a multiagent
system. Ph.D. thesis, Northwestern University (1992)
16. Ellsworth, P.C., Smith, C.A.: From appraisal to emotion: Differences among un-
pleasant feelings. Motivation and emotion 12(3), 271–302 (1988)
17. Fang, X., Chan, S., Brzezinski, J., Nair, C.: Development of an instrument to
measure enjoyment of computer game play. INTL. Journal of human–computer
interaction 26(9), 868–886 (2010)
18. Ferdous, R., Kifetew, F., Prandi, D., Prasetya, I., Shirzadehhajimahmood, S., Susi,
A.: Search-based automated play testing of computer games: A model-based ap-
proach. In: International Symposium on Search Based Software Engineering. pp.
56–71. Springer (2021)
19. Fraser, G., Arcuri, A.: Evosuite: automatic test suite generation for object-oriented
software. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th
European conference on Foundations of software engineering. pp. 416–419 (2011)
170 S. G. Ansari et al.
20. Gargantini, A., Heitmeyer, C.: Using model checking to generate tests from re-
quirements specifications. In: Software Engineering—ESEC/FSE’99. pp. 146–162.
Springer (1999)
21. Gholizadeh Ansari, S., Prasetya, I.S.W.B., Dastani, M., Dignum, F., Keller, G.: An
appraisal transition system for event-driven emotions in agent-based player expe-
rience testing. In: Engineering Multi-Agent Systems: 9th International Workshop,
EMAS 2021, Virtual Event, May 3–4, 2021, Revised Selected Papers. pp. 156–174.
Springer Nature (2021)
22. Glover, F.: Tabu search—part i. ORSA Journal on computing 1(3), 190–206 (1989)
23. Goldberg, D.E.: Genetic algorithms. Pearson Education India (2006)
24. Gordillo, C., Bergdahl, J., Tollmar, K., Gisslén, L.: Improving playtesting coverage
via curiosity driven reinforcement learning agents. arXiv preprint arXiv:2103.13798
(2021)
25. Guckelsberger, C., Salge, C., Gow, J., Cairns, P.: Predicting player experience
without the player. an exploratory study. In: Proceedings of the Annual Symposium
on Computer-Human Interaction in Play. pp. 305–315 (2017)
26. Harman, M., Jones, B.F.: Search-based software engineering. Information and soft-
ware Technology 43(14), 833–839 (2001)
27. Herzig, A., Lorini, E., Perrussel, L., Xiao, Z.: BDI logics for BDI architectures: old
problems, new perspectives. KI-Künstliche Intelligenz 31(1) (2017)
28. Holmgård, C., Green, M.C., Liapis, A., Togelius, J.: Automated playtesting with
procedural personas through mcts with evolved heuristics. IEEE Transactions on
Games 11(4), 352–362 (2018)
29. Holzmann, G.J.: The model checker spin. IEEE Transactions on software engineer-
ing 23(5), 279–295 (1997)
30. Iftikhar, S., Iqbal, M.Z., Khan, M.U., Mahmood, W.: An automated model based
testing approach for platform games. In: 2015 ACM/IEEE 18th International Con-
ference on Model Driven Engineering Languages and Systems (MODELS). pp.
426–435. IEEE (2015)
31. Jennett, C., Cox, A.L., Cairns, P., Dhoparee, S., Epps, A., Tijs, T., Walton, A.:
Measuring and defining the experience of immersion in games. International journal
of human-computer studies 66(9), 641–661 (2008)
32. Jia, Y., Harman, M.: An analysis and survey of the development of mutation
testing. IEEE transactions on software engineering 37(5), 649–678 (2010)
33. Lazarus, R.S., Folkman, S.: Stress, appraisal, and coping. Springer publishing com-
pany (1984)
34. Lee, I., Kim, H., Lee, B.: Automated playtesting with a cognitive model of senso-
rimotor coordination. In: Proceedings of the 29th ACM International Conference
on Multimedia. pp. 4920–4929 (2021)
35. Lewis, C., Whitehead, J., Wardrip-Fruin, N.: What went wrong: a taxonomy of
video game bugs. In: Proceedings of the fifth international conference on the foun-
dations of digital games. pp. 108–115 (2010)
36. McMinn, P.: Search-based software test data generation: a survey. Software testing,
Verification and reliability 14(2), 105–156 (2004)
37. Mirza-Babaei, P., Nacke, L.E., Gregory, J., Collins, N., Fitzpatrick, G.: How does
it play better? exploring user testing and biometric storyboards in games user
research. In: Proceedings of the SIGCHI conference on human factors in computing
systems. pp. 1499–1508 (2013)
38. Myers, G.J., Sandler, C., Badgett, T.: The art of software testing. John Wiley &
Sons (2011)
Model-based Player Experience Testing with Emotion Pattern Verification 171
39. Nacke, L., Lindley, C.A.: Flow and immersion in first-person shooters: measuring
the player’s gameplay experience. In: Proceedings of the 2008 conference on future
play: Research, play, share. pp. 81–88 (2008)
40. Nacke, L.E.: Games user research and physiological game evaluation. In: Game
user experience evaluation, pp. 63–86. Springer (2015)
41. Nikolik, B.: Test diversity. Information and Software Technology 48(11), 1083–1094
(2006)
42. Ochs, M., Pelachaud, C., Sadek, D.: An empathic virtual dialog agent to improve
human-machine interaction. In: Proceedings of the 7th international joint confer-
ence on Autonomous agents and multiagent systems-Volume 1. pp. 89–96 (2008)
43. Ortony, A., Clore, G., Collins, A.: The cognitive structure of emotions. cam (bridge
university press. Cambridge, England (1988)
44. Panichella, A., Kifetew, F.M., Tonella, P.: Automated test case generation as a
many-objective optimisation problem with dynamic selection of the targets. IEEE
Transactions on Software Engineering 44(2), 122–158 (2017)
45. Prasetya, I., Dastani, M., Prada, R., Vos, T.E., Dignum, F., Kifetew, F.: Aplib:
Tactical agents for testing computer games. In: International Workshop on Engi-
neering Multi-Agent Systems. pp. 21–41. Springer (2020)
46. Procci, K., Singer, A.R., Levy, K.R., Bowers, C.: Measuring the flow experience of
gamers: An evaluation of the dfs-2. Computers in Human Behavior 28(6), 2306–
2312 (2012)
47. Reilly, W.S.: Believable social and emotional agents. Tech. rep., Carnegie-Mellon
Univ Pittsburgh pa Dept of Computer Science (1996)
48. Roseman, I.J., Smith, C.A.: Appraisal theory. Appraisal processes in emotion: The-
ory, methods, research pp. 3–19 (2001)
49. Roseman, I.J., Spindel, M.S., Jose, P.E.: Appraisals of emotion-eliciting events:
Testing a theory of discrete emotions. Journal of personality and social psychology
59(5), 899 (1990)
50. Smith, C.A., Ellsworth, P.C.: Patterns of cognitive appraisal in emotion. Journal
of personality and social psychology 48(4), 813 (1985)
51. Stahlke, S.N., Mirza-Babaei, P.: Usertesting without the user: Opportunities and
challenges of an ai-driven approach in games user research. Computers in Enter-
tainment (CIE) 16(2), 1–18 (2018)
52. Utting, M., Legeard, B.: Practical model-based testing: a tools approach. Elsevier
(2010)
53. Vermeeren, A.P., Law, E.L.C., Roto, V., Obrist, M., Hoonhout, J., Väänänen-
Vainio-Mattila, K.: User experience evaluation methods: current state and devel-
opment needs. In: Proceedings of the 6th Nordic conference on human-computer
interaction: Extending boundaries. pp. 521–530 (2010)
54. Zarembo, I.: Analysis of artificial intelligence applications for automated testing of
video games. In: ENVIRONMENT. TECHNOLOGIES. RESOURCES. Proceed-
ings of the International Scientific and Practical Conference. vol. 2, pp. 170–174
(2019)
55. Zhao, Y., Borovikov, I., de Mesentier Silva, F., Beirami, A., Rupert, J., Somers,
C., Harder, J., Kolen, J., Pinto, J., Pourabolghasem, R., et al.: Winning is not ev-
erything: Enhancing game development with intelligent agents. IEEE Transactions
on Games 12(2), 199–212 (2020)
172 S. G. Ansari et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Opportunistic Monitoring of Multithreaded Programs
Chukri Soueidi( )
, Antoine El-Hokayem , and Yliès Falcone
Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France
{chukri.soueidi,antoine.el-hokayem,
ylies.falcone}@univ-grenoble-alpes.fr
1 Introduction
Guaranteeing the correctness of concurrent programs often relies on dynamic analysis
and verification approaches. Some approaches target generic concurrency errors such
as data races [29, 37], deadlocks [11], and atomicity violations [28, 47, 57]. Others tar-
get behavioral properties such as null-pointer dereferences [27], and typestate viola-
tions [36, 38, 55] and more generally order violations with runtime verification [42]. In
this paper, we focus on the runtime monitoring of general behavioral properties target-
ing violations that cannot be traced back to classical concurrency errors.
Runtime verification (RV) [9, 24, 25, 34, 42], also known as runtime monitoring, is
a lightweight formal method that allows checking whether a run of a system respects
a specification. The specification formalizes a behavioral property and is written in a
suitable formalism based for instance on temporal logic such as LTL or finite-state
machines [1, 45]. Monitors are synthesized from the specifications, and the program is
instrumented with additional code to extract events from the execution. These extracted
events generate the trace, which is fed to the monitors. From the monitor perspective,
the program is a black box and the trace is the sole system information provided.
To model the execution of a concurrent program, verification techniques choose
their trace collection approaches differently based on the class of targeted properties.
When properties require reasoning about concurrency in the program, causality must
be established during trace collection to determine the happens-before [40] relation be-
tween events. Data race detection techniques [29, 37] for instance require the causal
ordering to check for concurrent accesses to shared variables; as well as predictive ap-
proaches targeting behavioral properties such as [19, 38, 55] in order to explore other
feasible executions. Causality is best expressed as a partial order over events. Partial
orders are compatible with various formalisms for the behavior of concurrent programs
such as weak memory consistency models [2, 4, 46], Mazurkiewicz traces [32, 48], par-
allel series [43], Message Sequence Charts graphs [49], and Petri Nets [50]. However,
while the program behaves non-sequentially, its observation and trace collection is se-
quential. Collecting partial order traces often relies on vector clock algorithms to times-
tamp events [3,16,47,53] and requires blocking the execution to collect synchronization
actions such as locks, unlocks, reads, and writes. Hence, existing techniques that can
reason on concurrent events are expensive to use in an online monitoring setup. In-
deed, many of them are often intended for the design phase of the program and not in
production environments (see Section 5).
Other monitoring techniques relying on total-order formalisms such as LTL and fi-
nite state machines require linear traces to be fed to the monitors. As such they immedi-
ately capture linear traces from a concurrent execution without reestablishing causality.
Most of the top1 existing tools for the online monitoring of Java programs, these in-
clude tools such as Java-MOP [18, 30] and Tracematches [5], provide multithreaded
monitoring support using one or more of the following two modes. The per-thread
mode specifies that monitors are only associated with a given thread, and receive all
events of the given thread. This boils down to doing classical RV of single-threaded
programs, assuming each thread is an independent program. In this case, monitors are
unable to check properties that involve events across threads. The global monitoring
mode spawns a global monitor and ensures that the events from different threads are
fed to a central monitor atomically, by utilizing locks, to avoid data races. As such, the
monitored program execution is linearized so that it can be processed by the monitors.
In addition to introducing additional synchronization between threads inhibiting paral-
lelism, this monitoring mode forces events of interest to be totally ordered across the
entire execution, which oversimplifies and ignores concurrency.
Figure 1 illustrates a high-level view of a concurrent execution fragment of 1-Writer
2-Readers, where a writer thread writes to a shared variable, and two other reader
threads read from it. The reader threads share the same lock and can read concurrently
once one of them acquires it, but no thread can write nor read while a write is oc-
curring. We only depict the read/write events and omit lock acquires and releases for
brevity. In this execution, the writer acquires the lock first and writes (event 1), then
after one of the reader threads acquires the lock, they both concurrently read. The first
reader performs 3 reads (events 2, 4, and 5), while the second reader performs 2 reads
(events 3 and 6), after that the writer acquires the lock and writes again (event 7). A user
2 4 5
1 7
3 6
Fig. 1: Execution fragment of 1-Writer 2-Readers. Double circle: write, normal: read.
Numbers distinguish events. Events 2 and 6 (shaded) are example concurrent events.
1
Based on the first three editions of the Competition on Runtime Verification [7, 8, 26, 52].
Opportunistic Monitoring 175
A central observation we made is that when the program is free from generic con-
currency errors such as data races and atomicity violations, a monitoring approach can
be opportunistic and utilize the available synchronization in the program to reason about
high-level behavioral properties. In the previous example, we know that reads and writes
are guarded by a lock and do not execute concurrently (assuming we checked for data
races). We also know that the relative ordering of the reads between themselves is not
important to the property as we are only interested in counting that they all read the
latest write. As such, instead of blocking the execution at each of the 7 events to safely
invoke a global monitor and check for the property, we can have thread-local observa-
tions and only invoke the global monitor once either one of the readers acquires the
lock or when the writer acquires it (only 3 events). As such, in this paper, we propose
an approach to opportunistic runtime verification. We aim to (i) provide an approach
that enables users to arbitrarily reason about concurrency fragments in the program,
(ii) be able to monitor properties online without the need to record the execution, (iii)
utilize the existing tools and formalism prevalent in the RV community, and (iv) do so
efficiently without imposing additional synchronization.
1.l1t 0.l1c 0.i1 1.l1s 0.u1c 1.u1t r1 3.l1c 1.d1 1.u1s 3.u1c
Definition 1 (Action). An action is a tuple hlbl, id, ctxi, where: lbl is a label, id is a
unique identifier, and ctx is the context of the action.
The label captures an instruction name, function name, or specific task information
depending on the granularity of actions. Since the action is a runtime object, we use id
to distinguish two executions of the same syntactic element. Finally, the context (ctx)
is a set containing dynamic contexts such as a thread identifier (threadid), process
identifier (pid), resource identifier (resid), or a memory address. We use the notation
id.lblthreadid
resid to denote an action, omit resid when absent, and id when there is no
ambiguity. Furthermore, we use the notation a.threadid for a given action a to retrieve
the thread identifier in the context, and a.ctx(key) to retrieve any element in the context
associated with key.
Definition 2 (Concurrent Execution). A concurrent execution is a partially ordered
set of actions, that is a pair hA, →i, where A is a set of actions and → ⊆ A × A is a
partial order over A.
Two actions a1 and a2 are related (i.e., ha1 , a2 i ∈→) if a1 happens before a2 .
on a shared variable whose resource identifier is omitted for brevity. Second, the readers
acquire the lock s and perform a read on the same variable. Third, the writer performs
a second write on the variable.
In RV, we often do not capture the entire concurrent execution but are interested
in gathering a trace of the relevant parts of it. In our approach, a trace is also a con-
current execution defined over a subset of actions. Since the trace is the input to any
RV technique, we are interested in relating a trace to the concurrent execution, while
focusing on a subset of actions. For this purpose, we introduce the notions of soundness
and faithfulness. We first define the notion of trace soundness. Informally, a concurrent
execution is a sound trace if it does not provide false information about the execution.
Definition 3 (Trace Soundness). A concurrent trace tr = hAtr , →tr i is said to be a
sound trace of a concurrent execution e = hA, →i (written snd(e, tr )) iff (i) Atr ⊆ A
and (ii) →tr ⊆ →.
Intuitively, to be sound, a trace (i) should not capture an action not found in the
execution, and (ii) should not relate actions that are unrelated in the execution. While
a sound trace provides no incorrect information on the order, it can still be missing in-
formation about the order. In this case, we want to also express the ability of a trace to
capture all relevant order information. Informally, a faithful trace contains all informa-
tion on the order of events that occurred in the program execution.
Definition 4 (Trace Faithfulness). A concurrent trace tr = hAtr , →tr i is said to be
faithful to a concurrent execution e = hA, →i (written faith(e, tr )) iff →tr ⊇ (→
∩ Atr × Atr ).
3 Opportunistic Monitoring
We start with distinguishing threads and events from the execution. We then define
scopes that allow us to reason about properties over concurrent regions. We then devise
a generic approach to evaluate scope properties and perform monitoring.
Threads are typically created at runtime and have a unique identifier. We denote the set
of all thread ids by TID. They are subject to change from one execution to another, and
it is not known in advance how many threads will be spawned during the execution. As
such, it is important to design specifications that can handle threads dynamically.
can assign the spawn action of a reader (resp. writer) to be the method invocation of
Reader.run (Writer.run). Function spawn : S → T, assigns a thread type to a spawn
action. The threads that match a given type are determined based on the spawn action(s)
present during the execution. We note that a thread can have multiple types. To reference
all threads assigned a given type, we use function pool : T → 2TID . That is, given a
type t, a thread with threadid tid, we have tid ∈ pool(t) iff ∃a ∈ S : spawn(a) =
t ∧ a.threadid = tid. This allows a thread to have multiple types so that properties
operate on different events in the same thread.
Events As properties are defined over events, actions are typically abstracted into
events. As such, we define for each thread type t ∈ T, the alphabet of events: Et . Set Et
contains all the events that can be generated from actions for the particular thread type
t ∈ T. The empty event E is a special event that indicates that no events are matched.
Then, we assume a total function evt : A → {E} ∪ Et . The implementation of ev re-
lies on the specification formalism used, it is capable of generating events based on the
context of the action itself. For example, the conversion can utilize the runtime context
of actions to generate parametric events when needed. We illustrate a function ev that
matches using the label of an action in Ex. 2.
def
Example 2 (Events.). We identify for readers-writers (Ex. 1) two thread types: Trw =
def def
{reader, writer}. We are interested in the events Ereader = {read}, and Ewriter =
{write}. For a specification at the level of a given thread, we have either a reader or a
writer, and the event associated with the reader (resp. writer) is read (resp. write).
def read if a.lbl = “r”, def write if a.lbl = “w”,
evreader (a) = evwriter (a) =
E otherwise E otherwise.
We now define the notion of scope. A scope defines a projection of the concurrent
execution to delimit concurrent regions and allow verification to be performed at the
level of regions instead of the entire execution.
Scope Region A scope region selects actions of the concurrent execution delimited by
two successive SAs. We define two “special” synchronizing actions: begin, end ∈ A
common to all scopes that are needed to evaluate the first and last region. The actions
refer to the beginning and end of the concurrent execution, respectively.
Definition 5 (Scope Regions). Given a scope s and an associated index function idxs :
SAs → N\{0}, the scope regions are given by function Rs : codom(idxs )∪{0, |idxs |+
1} → 2A , defined as:
def
where: issync(a, i) = (syncs (a) = > ∧ idxs (a) = i).
Rs (i) is the i-th scope region, the set of all actions that happened between the two syn-
chronizing actions a and a0 , where idxs (a) = i and idxs (a0 ) = i+1 taking into account
the start and end of a program execution (i.e., actions begin and end, respectively).
Example 3 (Scope regions). For readers-writers (Ex. 1), we consider the resource ser-
vice lock (s) to be the one of interest, as it delimits the concurrent regions that allow
either a writer to write or readers to read. We label the scope by res for the remainder of
the paper. The synchronizing predicate syncres selects all actions with label l (lock ac-
quire) and with the lock id s present in the context of the action. The obtained sequence
of SAs is 0.l0s · 1.l1s · 2.l0s . The value of idxres for each of the obtained SAs is respec-
tively 1, 2, and 3. Every lock acquire delimits the regions of the concurrent execution.
The region k + 1 includes all actions between the two lock acquires 0.l0s and 1.l1s . That
is, Rres (k + 1) = {0.w0 , 0.u0s , 0.u0t , 1.l1t , 0.l1c , 0.i1 }. The region k + 2 contains two
concurrent reads: r1 , r2 .
Definition 6 (Scope fragment). The scope fragment associated with a scope region
def
Rs (i) is defined as Fs (i) = hRs (i), → ∩ Rs (i) × Rs (i)i.
Proposition 1 states that for a given scope, any fragment (obtained using Fs ) is a sound
and faithful trace of the concurrent execution. This is ensured by construction using
Definitions 5 and 6 which follow the same principles of the definitions of soundness
(Definition 3) and faithfulness (Definition 4).
Remark 1. In this paper, scopes regions are defined by the user by selecting the syn-
chronizing predicate as part of the specification. Given a property, regions should de-
limit events whose order is important for a property. For instance, for a property specify-
ing that “between each write, at least one read should occur”, the scope regions should
delimit read versus write events. Delimiting the read events themselves, performed by
180 C. Soueidi et al.
Fig. 3: Projected actions using the scope and local properties of 1-Writer 2-Readers. The
action labels l, w, r indicate respectively the following: lock, write, and read. Filled ac-
tions indicate actions for which function ev for the thread type returns an event. Actions
with a pattern background indicate the SAs for the scope.
different threads, is not significant. How to analyze the program to find and suggest
scopes for the user that are suitable for monitoring a given property is an interesting
challenge that we leave for future work. Moreover, we assume the program is properly
synchronized and free from data races.
Local Properties In a given scope region, we determine properties that will be checked
locally on each thread. A thread-local monitor checks a local property independently for
each given thread. These properties can be seen as the analogous of per-thread monitor-
ing applied between two SAs. For a specific thread, we have a guaranteed total order on
the local actions being formed. This ensures that local properties are compatible and can
be checked with existing RV techniques and formalisms. We refer to those properties
as local properties.
Definition 7 (Local property). A local property is a tuple htype, EVS, RT, evali with:
We use the dot notation: for a given property prop = htype, EVS, RT, evali we use
prop.type, prop.EVS, prop.RT, and prop.eval respectively.
Example 4 (At least one read). The property “at least one read”, defined for the thread
type reader, states that a reader must perform at least one read event. It can be ex-
pressed using the classical LTL3 [10] (a variant of linear temporal logic with finite-
def
trace semantics commonly used in RV) as ϕ1r = F(read) using the set of atomic
AP
propositions {read}. Let LTL3 ϕ denote the evaluation function of LTL3 using the
set of atomic propositions AP and a formula ϕ, and let B3 = {>, ⊥, ?} be the truth
domain where ? denotes an inconclusive verdict. To check on readers, we specify it as
the local property: hreader, {read}, B3 , LTL3 {read}
ϕ1r i. Similarly, we can define the local
specification for at least one write.
Opportunistic Monitoring 181
Scope Trace To evaluate a local property, we restrict the trace to the actions of a given
thread contained within a scope region. A scope trace is analogous to acquiring the
trace for per-thread monitoring [5, 30] in a given scope region (see Definition 5). The
scope trace is defined as a projection of the concurrent execution, on a specific thread,
selecting actions that fall between two synchronizing actions.
Definition 8 (Scope trace). Given a local property p = htype, EVS, RT, evali in a
scope region Rs with index i, a scope trace is obtained using the projection function
proj, which outputs the sequence of actions of length n for a given thread with tid ∈
TID that are associated with events for the property. We have: ∀` ∈ [0, n]
def filter(a0 ) · . . . · filter(an ) if i ∈ dom(Rs ) ∧ tid ∈ pool(type),
proj(tid, i, p, Rs ) =
E otherwise,
def e if evtype (a` ) ∈ EVS
with: filter(a` ) =
E otherwise,
Scope State A scope state aggregates the result of evaluating all local properties for a
given scope region. To define a scope state, we consider a scope s, with a list of local
properties hprop0 , . . . , propn i of return types respectively hRT0 , . . . , RTn i. Since a
local specification can apply to an arbitrary number of threads during the execution, for
each specification we create the type as a dictionary binding a threadid to the return
type (represented as a total function). We use the type na to determine a special type
indicating the property does not apply to the thread (as the thread type does not match
the property). We can now define the return type of evaluating all local properties as
def
RI = hTID → {na} ∪ RT0 , . . . , TID → {na} ∪ RTn i. Function states : RI → Is
processes the result of evaluating local properties to create a scope state in Is .
Example 6 (Scope state). We illustrate the scope state by evaluating the properties “at
least one read” (pr ) and “at least one write” (pw ) (Ex. 4) on scope region k + 2 in
Fig. 3. We have TID = {0, 1, 2}, we determine for each reader the trace (being (read)
for both), and the writer being empty (i.e. no write was observed). As such for property
pr (resp. pw ), we have the result of the evaluation [0 7→ na, 1 7→ >, 2 7→ >] (resp.
[0 7→ ?, 1 7→ na, 2 7→ na]). We notice that for property pr , the thread of type writer
evaluates to na, as it is not concerned with the property.
We now consider the state creation function states . We consider the following
atomic propositions activereader, activewriter, allreaders, and onewriter that indi-
cate respectively: at least one thread of type reader performed a read, at least one thread
of type writer performed a write, all threads of type reader (|pool(reader)|) performed
at least a read, and at most one thread of type writer performed a write. The scope
state in this case is a list of 4 boolean values indicating each atomic proposition respec-
tively. As such by counting the number of threads associated with >, we can compute
the Boolean value of each atomic proposition. For region k + 2, we have the following
state: h>, ⊥, >, ⊥i. We can establish a total order of scope states. For k + 1, k + 2 and
k + 3, we have the sequence h⊥, >, ⊥, >i · h>, ⊥, >, ⊥i · h⊥, >, ⊥, >i.
We are now able to define formally a scope by associating an identifier with a syn-
chronizing predicate, a list of local properties, a spawn predicate, and a scope property
evaluation function. We denote by SID the set of scope identifiers.
Definition 9 (Scope). A scope is a tuple hsid, syncsid , hprop1 , . . . , propn i, statesid ,
sevalsid i, where:
– sid ∈ SID is the scope identifier;
– syncsid : A → B2 is the synchronizing predicate that determines SAs;
– hprop0 , . . . , propn i is a list of local properties (Definition 7);
– statesid : hTID → {na} ∪ prop0 .RT, . . . , TID → {na} ∪ propn .RTi → Is is the
scope state creation function;
– sevalsid : N × Is → O is the evaluation function of the scope property over a
sequence of scope states.
then use statesid to generate the scope state for the region. After producing the sequence
of scope states, the function sevalsid evaluates the property at the level of a scope.
Definition 10 (Evaluating a scope property). Using the synchronizing predicate syncsid ,
we obtain the regions Rsid (i) for i ∈ [0, m] with m = |idxsid | + 1. The evaluation of
a scope property (noted res) for the scope hsid, syncsid , hprop0 , . . . , propn i, statesid ,
sevalsid i is computed as: ∀tid ∈ TID, ∀j ∈ [0, n]
Example 7 (Evaluating scope properties). We use LTL to formalize three scope prop-
erties based on the scope states from Ex. 6 operating on the alphabet {activereader,
activewriter, allreaders, onewriter}:
def
- Mutual exclusion between readers and writers: ϕ0 = activewriter XOR activereader.
def
- Mutual exclusion between writers: ϕ1 = activewriter =⇒ onewriter.
def
- All readers must read a written value: ϕ2 = activereader =⇒ allreaders.
Therefore the specification is: G(ϕ0 ∧ ϕ1 ∧ ϕ2 ). We recall that a scope state is a list
of boolean values for the atomic propositions in the following order: activereader,
activewriter, allreaders, and onewriter. The sequence of scope states from Ex. 6:
h⊥, >, ⊥, >i · h>, ⊥, >, ⊥i · h⊥, >, ⊥, >i complies with the specification.
Correctness of Scope Evaluation We assume that the SAs selected by the user in the
specification are totally ordered. This ensures that the order of the scope states is a total
order, it is then by assumption sound and faithful to the order of the SAs. However, it
is important to ensure that the actions needed to construct the state are captured faith-
fully and in a sound manner. We capture the partial order as follows: (1) actions of
different threads are captured in a sound and faithful manner between two successive
SAs (Proposition 1), and (2) actions of the same thread are captured in a sound and
faithful manner for that thread (Proposition 2). Furthermore, we are guaranteed by Def-
inition 10 that each local property evaluation function is passed to all actions relevant to
the given thread (and no other). As such, for the granularity level of the SAs, we obtain
all relevant order information.
0 1 2 0 1 2 0 1 2
reader - na na reader - > > reader - na na
Channel writer > - - writer na - - writer > - -
Scope channel. The scope channel stores information about the scope states during the
execution. We associate each scope with a scope channel that has its own timestamp.
The channel provides each thread-local monitor with an exclusive memory slot to write
its result when evaluating local properties. Each thread can only write to its associated
slot in the channel. The timestamp of the channel is readable by all threads participating
in the scope but is only incremented by the scope monitor, as we will see.
Example 8 (Scope channel). Figure 4 displays the channel associated with the scope
monitoring discussed in Ex. 6. For each scope region, the channel allows each monitor
an exclusive memory slot to write its result (if the thread is not sleeping). The slots
marked with a dash (-) indicate the absence of monitors. Furthermore, na indicates that
the thread was given a slot, but it did not write anything in it (see Definition 10).
For a timestamp t, local monitors no longer write any information for any scope
state with a timestamp inferior to t, this makes such states always consistent to be read
by any monitor associated with the scope. While this is not in the scope of the paper, it
allows monitors to effectively access past data of other monitors consistently.
type. Multiple such monitors can exist on a given thread, depending on the needed prop-
erties to check. These monitors are spawned on the creation of the thread. It receives an
event, performs checking, and can write its result in its associated scope channel at the
current timestamp.
Scope monitors. Scope monitors are responsible for checking the property at the level of
the scope. Upon reaching a synchronizing action by any of the threads associated with
the scope, the given thread will invoke the scope monitor. The scope monitor relies
on the scope channel (shared among all threads) to have access to all observations.
Additional memory can be allocated for its own state, but it has to be shared among
all threads associated with the scope. The scope monitor is invoked atomically after
reaching the scope synchronizing action. First, it constructs the scope state based on the
results of the thread-local monitors stored in the scope channel. Second, it invokes the
verification procedure on the generated state. Finally, before completing, it increments
the timestamp associated with the scope channel.
4.1 Readers-Writers
Experiment setup. For this experiment, we utilize the standard LTL3 semantics defined
over the B3 verdict domain. As such, all the local and scope property types are B3 . We
instrument readers-writers to insert our monitors and compare our approach to global
monitoring using a custom aspect written in AspectJ. In total, we have three scenarios:
non-monitored, global, and opportunistic. In the first scenario (non-monitored), we do
not perform monitoring. In the second and third scenarios, we perform global and op-
portunistic monitoring. We recall that global monitoring introduces additional locks at
the level of the monitor for all events that occur concurrently. We make sure that the
program is well synchronized and data race free with RVPredict [37].
Measures. To evaluate the overhead of our approach, we are interested in defining pa-
rameters to characterize concurrency regions found in readers-writers. We identify two
parameters: the number of readers (nreaders), and the width of the concurrency region
(cwidth). On the one hand, nreaders determines the maximum parallel threads that are
verifying local properties in a given concurrency region. On the other hand, cwidth
determines the number of reads each reader performs concurrently when acquiring the
lock. Parameter cwidth is measured in number of read events generated. By increasing
the size of the concurrency regions, we increase lock contention when multiple concur-
rent events cause a global monitor to lock. We use a number of writers equivalent to
nreaders ∈ {1, 3, 7, 15, 23, 31, 63, 127} and cwidth ∈ {1, 5, 10, 15, 30, 60, 100, 150}.
2
The artifact for this paper is available [56].
186 C. Soueidi et al.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1500 ●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
● ●
● ●
● ● ● ●
● ● ●
●
● ●
● ● ●
●
Execution Time (ms logscale)
●
● ●
● ●
● ● ●
● ●
● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ●
● ● ● ●
● ●
● ● ●
● ● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
Approach
● ●
● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
1000 ● ●
●
●
● ● ●
●
●
●
Non−Monitored
● ● ● ●
●
●
● ●
● ●
● ●
● ● ● ● ● ● ●
●
● ● ● ●
● ●
● ● Global
●
● ● ●
●
● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
●
● ●
● ● Opportunistic
● ●
● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ●
● ●
● ● ●
● ● ● ● ●
●
● ● ● ● ●
●
● ● ●
● ● ●
● ●
● ●
● ●
● ● ●
● ● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ● ●
●
● ● ●
● ●
● ●
● ●
●
● ● ● ●
● ● ●
● ●
● ● ●
● ●
● ● ● ● ● ● ●
●
● ●
● ● ●
● ●
● ● ●
●
● ● ● ● ● ●
● ●
●
● ● ●
● ●
● ● ●
●
● ● ●
● ● ● ●
● ● ●
●
● ●
● ●
● ●
● ● ● ●
●
● ●
● ●
● ●
● ● ● ●
●
● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
●
● ●
● ● ● ●
● ●
●
● ● ●
● ● ● ●
● ●
●
● ● ●
● ●
● ● ●
●
●
● ● ●
● ● ● ●
●
● ● ●
● ● ● ●
● ●
●
●
● ● ● ● ●
● ● ●
● ●
● ● ● ● ● ●
●
● ● ●
● ● ● ● ●
● ●
● ● ● ●
●
●
● ●
● ●
● ● ● ●
● ●
●
●
● ● ● ● ●
● ● ●
● ●
● ● ● ● ●
● ●
●
● ● ●
500 ●
●
● ●
●
● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ●
●
● ● ●
● ●
● ● ● ●
● ● ●
●
● ●
● ●
● ● ●
● ● ● ●
●
●
● ●
●
●
●
●
1 3 7 15 23 31 63 127
Number of Readers
Fig. 5: Execution time for readers-writers for non-monitored, global, and opportunistic
monitoring when varying the number of readers.
We perform a total of 100,000 writes and 400,000 reads, where reads are distributed
evenly across readers. We measure the execution time (in ms) of 50 runs of the program
for each of the parameters and scenarios.
Preliminary results. We report the results using the averages while providing the scat-
ter plots with linear regression curves in Figures 5, and 6. Figure 5 shows the overhead
when varying the number of readers (nreaders). We notice that for the base program
(non-monitored), the execution time increases as lock contention overhead becomes
more prominent and the JVM is managing more threads. In the case of global moni-
toring, as expected we notice an increasing overhead with the increase in the number
of threads. As more readers are executing, the program is being blocked on each read
which is supposed to be concurrent. For opportunistic, we notice a stable runtime in
comparison to the original program as no additional locks are being used; only the
delay to evaluate the local and scope properties. Figure 6 shows the overhead when
varying the width of the concurrency region (cwidth). We observe that for the base
program, the execution time decreases as more reads can be performed concurrently
without contention on the shared resource lock. In the case of global monitoring, we
also notice a slight decrease, while for opportunistic monitoring, we see a much greater
decrease. By increasing the number of concurrent events in a concurrency region, we
Opportunistic Monitoring 187
l l l l l l
l l
l l l
l l l l
l l
l l
l l l
l l
l l l l
l
l l l
l l
1500 l l
l l l
l l
l l
l l l
l l l l
l
l l l
l
l l
l l l
l l
l
l l
l
l l
l l l
l l l l
l
l l l
l
l l l
l
l l l l l
l l l
l l l l
l l l l
l l l l l
l l l l l
l l
l l l
l l l l
l l
l l
l l
l l l
l
l
l l l l l l l
l
l l l l l l l
l l l l l l l
l
l l l l l
l l
l l l l
l
l l l
l l l
l l l l
l l l l l l
l
l
l l l l
l l l l l l l
Execution Time (ms logscale)
l
l l l
l l l
l l
l l l
l l l l l l l
l
l l l l l l l
l
l l l l l
l
l l
l l
l l
l l l l l
l l
l l l
l l l l
l l
l l
l l l
l l
l
l l l
l l l
l l l
l l l
l l
l
l l l l
l l
l l l
l l
l l
l l l
l l l
l
l l l l l l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l Approach
l l l l l l
l l
l l
l
l
l l
l l l
l l l
l l
l l l
l l l
l l
l l
l
l l
l l
l l l
l l l
l l
l l l l
l l l l l Non−Monitored
l
l l l l l
l l
l l l
l l
l
l
l l
l l
l l l l l
l l l l l
l l
l l
l l l l l l l l
l l l l
l l
l l l Global
l l l l
l l
l l l
l l
l l
l l l l
l l l
l
l l l l
l l l
l l l
l
l l l l
l l l
l l
l
l l
l l l
l l
l l
l l l Opportunistic
l l l
l l
l l l l l l
l l
l l l
l l
l l
l l l l
l
l
l
l l l l l l
l l l l
l
l l l l
l l l l l
l l
l l
l l l
l
l l
l
l l l l
l l
l l
l
l
500 l l l
l
l l l
l
l l l l
l
l
l
l
l l l
l
l
l
l
l l l
l
l l
l l
l
l l
l
l
l l
l l l
l
l
l
l l
l l
l
l
l
l
l
l
l
l
1 5 10 15 30 60 100 150
Concurrency Region Size (events)
Fig. 6: Execution time varying the number of events in the concurrency region.
highlight the overhead introduced by locking the global monitor. We recall that a global
monitor must lock to linearize the trace, and as such interferes with concurrency. This
can be seen by looking at the two curves for global and opportunistic monitoring, we
see that opportunistic closely follows the speedup of the non-monitored program, while
global monitoring is much slower. For opportunistic monitoring, we expected a positive
performance payoff when events in concurrency regions are dense.
1000
10
1
2−bakery mergesort n−bakery ping−pong prods−cons
2-bakery and more overhead with opportunistic on n-bakery. This is because of the
small concurrency region (cwidth) which is equal to 1. As such, the overhead of eval-
uating local and scope monitors by several threads, having a cwidth of 1, exceeds the
gain in performance achieved by our approach and hence not fitting for opportunistic
monitoring.
We also monitor a textbook example of Ping-Pong algorithm [33] that is used for
instance in databases and routing protocols. The algorithm synchronizes, using reads
and writes on shared variables and busy waiting, between two threads producing events
pi for the pinging thread and po for the pong thread. We monitor for the alternation
def
property specified as ϕ = (ping =⇒ Xpong) ∧ (pong =⇒ Xping). We also
include a classic producer-consumer program from [35] which uses a concurrent FIFO
queue using locks and conditions. We monitor the precedence property, which specifies
the requirement that a consume (event c) is preceded by a produce (event p), expressed
in LTL as ¬c W p. For both above benchmarks, we observe less overhead when moni-
toring with opportunistic, since no additional locks are being enforced on the execution.
We also monitor a parallel mergesort algorithm which is a divide-and-conquer al-
gorithm to sort an array. The algorithm uses the fork-join framework [41] which re-
cursively splits the array into sorting tasks that are handled by different threads. We
are interested in monitoring if a forked task is returning a correctly sorted array be-
fore performing a merge. The monitoring step is expensive and linear in the size of the
array as it involves scanning it. For opportunistic, we use the joining of two subtasks
as our synchronizing action and deploy scope monitors at all levels of the recursive
hierarchy. We observe less overhead when monitoring with opportunistic than global
monitoring, as concurrent threads do not have to wait at each monitoring step. This
Opportunistic Monitoring 189
5 Related Work
extends the specification formalism past data races to target generic concurrency prop-
erties. GPredict presents a generic approach to reason about behavioral properties and
hence constitutes a monitoring solution when concurrency is present. Notably, GPredict
requires specifying thread identifiers explicitly in the specification. This makes specifi-
cations with multiple threads to become extremely verbose; unable to handle a dynamic
number of threads. For example, in the case of readers-writers, adding extra readers or
writers requires rewriting the specification and combining events to specify each new
thread. The approach behind GPredict can also be extended to become more expressive,
e.g. to support counting events to account for fairness in a concurrent setting. Further-
more, GPredict relies on recording a trace of a program before performing an offline
analysis to determine concurrency errors [36]. In addition to being incomplete due to
the possibility of not getting results from the constraint solver, the analysis from GPre-
dict might also miss some order relations between events resulting in false positives. In
general, the presented predictive tools are often designed to be used offline and unfor-
tunately, many of them are no longer maintained.
In [14,15], the authors present monitoring for hyperproperties written in alternation-
free fragments of HyperLTL [20]. Hyperproperties are specified over sets of execution
traces instead of a single trace. In our setup, each thread is producing its trace and
thus scope properties we monitor can be expressed in HyperLTL for instance. The time
occurrence of events will be delimited by concurrency regions and thus traces will con-
sist of propositions that summarize the concurrency region. We have yet to explore
the applicability of specifying and monitoring hyperproperties within our opportunistic
approach.
References
1. Patterns in property specifications for finite-state verification home page. https:
//matthewbdwyer.github.io/psp/patterns.html, https://matthewbdwyer.github.io/psp/patterns.
html
2. Adve, S.V., Gharachorloo, K.: Shared memory consistency models: a tutorial. Computer
29(12), 66–76 (Dec 1996)
3. Agarwal, A., Garg, V.K.: Efficient dependency tracking for relevant events in shared-memory
systems. In: Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of
Distributed Computing. p. 19–28. PODC ’05, Association for Computing Machinery, New
York, NY, USA (2005), https://doi.org/10.1145/1073814.1073818
4. Ahamad, M., Neiger, G., Burns, J.E., Kohli, P., Hutto, P.W.: Causal memory: definitions,
implementation, and programming. Distributed Computing 9(1), 37–49 (Mar 1995)
5. Allan, C., Avgustinov, P., Christensen, A.S., Hendren, L., Kuzins, S., Lhoták, O., de Moor,
O., Sereni, D., Sittampalam, G., Tibble, J.: Adding Trace Matching with Free Variables to
AspectJ. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented
Programming, Systems, Languages, and Applications. pp. 345–364. OOPSLA ’05, ACM
(2005)
6. Barringer, H., Falcone, Y., Havelund, K., Reger, G., Rydeheard, D.E.: Quantified Event Au-
tomata: Towards Expressive and Efficient Runtime Monitors. In: Giannakopoulou, D., Méry,
D. (eds.) FM 2012: Formal Methods - 18th International Symposium, Paris, France, Au-
gust 27-31, 2012. Proceedings. Lecture Notes in Computer Science, vol. 7436, pp. 68–84.
Springer (2012), https://doi.org/10.1007/978-3-642-32759-9 9
7. Bartocci, E., Bonakdarpour, B., Falcone, Y.: First international competition on software for
runtime verification. In: Bonakdarpour, B., Smolka, S.A. (eds.) Runtime Verification - 5th
International Conference, RV 2014, Toronto, ON, Canada, September 22-25, 2014. Proceed-
ings. Lecture Notes in Computer Science, vol. 8734, pp. 1–9. Springer (2014)
8. Bartocci, E., Falcone, Y., Bonakdarpour, B., Colombo, C., Decker, N., Havelund, K., Joshi,
Y., Klaedtke, F., Milewicz, R., Reger, G., Rosu, G., Signoles, J., Thoma, D., Zalinescu, E.,
Zhang, Y.: First international competition on runtime verification: rules, benchmarks, tools,
and final results of CRV 2014. International Journal on Software Tools for Technology Trans-
fer (Apr 2017)
9. Bartocci, E., Falcone, Y., Francalanza, A., Reger, G.: Introduction to runtime verification.
In: Bartocci, E., Falcone, Y. (eds.) Lectures on Runtime Verification - Introductory and Ad-
vanced Topics, Lecture Notes in Computer Science, vol. 10457, pp. 1–33. Springer (2018)
10. Bauer, A., Leucker, M., Schallhart, C.: Runtime verification for ltl and tltl. ACM Trans.
Softw. Eng. Methodol. 20(4), 14:1–14:64 (Sep 2011)
11. Bensalem, S., Havelund, K.: Dynamic deadlock analysis of multi-threaded programs. In:
Proceedings of the First Haifa International Conference on Hardware and Software Ver-
ification and Testing. p. 208–223. HVC’05, Springer-Verlag, Berlin, Heidelberg (2005),
https://doi.org/10.1007/11678779 15
12. Bianchi, F.A., Margara, A., Pezzè, M.: A survey of recent trends in testing concurrent soft-
ware systems. IEEE Transactions on Software Engineering 44(8), 747–783 (2018)
13. Bodden, E., Hendren, L., Lam, P., Lhoták, O., Naeem, N.A.: Collaborative Runtime Verifi-
cation with Tracematches. Journal of Logic and Computation 20(3), 707–723 (Jun 2010)
14. Bonakdarpour, B., Sanchez, C., Schneider, G.: Monitoring hyperproperties by combining
static analysis and runtime verification. In: Leveraging Applications of Formal Methods,
Verification and Validation. Verification: 8th International Symposium, ISoLA 2018, Limas-
sol, Cyprus, November 5-9, 2018, Proceedings, Part II. p. 8–27. Springer-Verlag, Berlin,
Heidelberg (2018), https://doi.org/10.1007/978-3-030-03421-4 2
192 C. Soueidi et al.
15. Brett, N., Siddique, U., Bonakdarpour, B.: Rewriting-based runtime verification for
alternation-free hyperltl. In: Proceedings, Part II, of the 23rd International Confer-
ence on Tools and Algorithms for the Construction and Analysis of Systems - Vol-
ume 10206. p. 77–93. Springer-Verlag, Berlin, Heidelberg (2017), https://doi.org/10.1007/
978-3-662-54580-5 5
16. Cain, H.W., Lipasti, M.H.: Verifying sequential consistency using vector clocks. In: Proceed-
ings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures.
p. 153–154. SPAA ’02, Association for Computing Machinery, New York, NY, USA (2002),
https://doi.org/10.1145/564870.564897
17. Chen, F., Roşu, G.: Parametric and sliced causality. In: Proceedings of the 19th International
Conference on Computer Aided Verification. p. 240–253. CAV’07, Springer-Verlag, Berlin,
Heidelberg (2007)
18. Chen, F., Roşu, G.: Java-MOP: A Monitoring Oriented Programming Environment for Java.
In: Tools and Algorithms for the Construction and Analysis of Systems. pp. 546–550. Lecture
Notes in Computer Science, Springer (Apr 2005)
19. Chen, F., Serbanuta, T.F., Rosu, G.: Jpredictor: A predictive runtime analysis tool for java.
In: Proceedings of the 30th International Conference on Software Engineering. p. 221–230.
ICSE ’08, Association for Computing Machinery, New York, NY, USA (2008), https://doi.
org/10.1145/1368088.1368119
20. Clarkson, M.R., Schneider, F.B.: Hyperproperties. J. Comput. Secur. 18(6), 1157–1210 (sep
2010)
21. Colombo, C., Pace, G.J., Schneider, G.: LARVA — Safer Monitoring of Real-Time Java
Programs (Tool Paper). In: Hung, D.V., Krishnan, P. (eds.) Seventh IEEE International Con-
ference on Software Engineering and Formal Methods, SEFM 2009, Hanoi, Vietnam, 23-27
November 2009. pp. 33–37. IEEE Computer Society (2009), https://doi.org/10.1109/SEFM.
2009.13
22. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Commun.
ACM 51(1), 107–113 (jan 2008), https://doi.org/10.1145/1327452.1327492
23. El-Hokayem, A., Falcone, Y.: Can we monitor all multithreaded programs? In: Colombo, C.,
Leucker, M. (eds.) Runtime Verification - 18th International Conference, RV 2018, Limas-
sol, Cyprus, November 10-13, 2018, Proceedings. Lecture Notes in Computer Science, vol.
11237, pp. 64–89. Springer (2018), https://doi.org/10.1007/978-3-030-03769-7 6
24. Falcone, Y., Havelund, K., Reger, G.: A tutorial on runtime verification. In: Broy, M., Peled,
D.A., Kalus, G. (eds.) Engineering Dependable Software Systems, NATO Science for Peace
and Security Series, D: Information and Communication Security, vol. 34, pp. 141–175. IOS
Press (2013)
25. Falcone, Y., Krstic, S., Reger, G., Traytel, D.: A taxonomy for classifying runtime verification
tools. In: Colombo, C., Leucker, M. (eds.) Runtime Verification - 18th International Confer-
ence, RV 2018, Limassol, Cyprus, November 10-13, 2018, Proceedings. Lecture Notes in
Computer Science, vol. 23, pp. 241–262. Springer (2018)
26. Falcone, Y., Nickovic, D., Reger, G., Thoma, D.: Second international competition on run-
time verification CRV 2015. In: Bartocci, E., Majumdar, R. (eds.) Runtime Verification - 6th
International Conference, RV 2015 Vienna, Austria, September 22-25, 2015. Proceedings.
Lecture Notes in Computer Science, vol. 9333, pp. 405–422. Springer (2015)
27. Farzan, A., Parthasarathy, M., Razavi, N., Sorrentino, F.: Predicting null-pointer dereferences
in concurrent programs. In: Proceedings of the ACM SIGSOFT 20th International Sympo-
sium on the Foundations of Software Engineering. FSE ’12, Association for Computing Ma-
chinery, New York, NY, USA (11 2012), https://doi.org/10.1145/2393596.2393651
28. Flanagan, C., Freund, S.N.: Atomizer: A dynamic atomicity checker for multithreaded pro-
grams. SIGPLAN Not. 39(1), 256–267 (jan 2004), https://doi.org/10.1145/982962.964023
Opportunistic Monitoring 193
29. Flanagan, C., Freund, S.N.: Fasttrack: Efficient and precise dynamic race detection. In: Pro-
ceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and
Implementation. p. 121–133. PLDI ’09, Association for Computing Machinery, New York,
NY, USA (2009), https://doi.org/10.1145/1542476.1542490
30. Formal Systems Laboratory: JavaMOP4 Syntax (2018), http://fsl.cs.illinois.edu/index.php/
JavaMOP4 Syntax
31. Gao, Q., Zhang, W., Chen, Z., Zheng, M., Qin, F.: 2ndstrike: Toward manifesting hidden
concurrency typestate bugs. In: Proceedings of the Sixteenth International Conference on
Architectural Support for Programming Languages and Operating Systems. ASPLOS XVI,
vol. 39, p. 239–250. Association for Computing Machinery, New York, NY, USA (mar 2011),
https://doi.org/10.1145/1950365.1950394
32. Gastin, P., Kuske, D.: Uniform satisfiability problem for local temporal logics over
Mazurkiewicz traces. Inf. Comput. 208(7), 797–816 (2010)
33. Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA, 1st edn. (1992)
34. Havelund, K., Goldberg, A.: Verify your runs. In: Meyer, B., Woodcock, J. (eds.) Verified
Software: Theories, Tools, Experiments, First IFIP TC 2/WG 2.3 Conference, VSTTE 2005,
Zurich, Switzerland, October 10-13, 2005, Revised Selected Papers and Discussions. Lecture
Notes in Computer Science, vol. 4171, pp. 374–383. Springer (2005)
35. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming, Revised Reprint. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edn. (2012)
36. Huang, J., Luo, Q., Rosu, G.: Gpredict: Generic predictive concurrency analysis. In: 37th
IEEE/ACM International Conference on Software Engineering, ICSE 2015, Volume 1. pp.
847–857 (2015)
37. Huang, J., Meredith, P.O., Rosu, G.: Maximal sound predictive race detection with con-
trol flow abstraction. SIGPLAN Not. 49(6), 337–348 (Jun 2014), https://doi.org/10.1145/
2594291.2594315
38. Joshi, P., Sen, K.: Predictive typestate checking of multithreaded java programs. In: Pro-
ceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software
Engineering. p. 288–296. ASE ’08, IEEE Computer Society, USA (2008), https://doi.org/10.
1109/ASE.2008.39
39. Lamport, L.: A new solution of dijkstra’s concurrent programming problem. Commun. ACM
17(8), 453–455 (aug 1974), https://doi.org/10.1145/361082.361093
40. Lamport, L.: Time, Clocks, and the Ordering of Events in a Distributed System. Commun.
ACM 21(7), 558–565 (Jul 1978), https://doi.org/10.1145/359545.359563
41. Lea, D.: A java fork/join framework. In: Proceedings of the ACM 2000 Java Grande Confer-
ence, San Francisco, CA, USA, June 3-5, 2000. pp. 36–43 (2000), https://doi.org/10.1145/
337449.337465
42. Leucker, M., Schallhart, C.: A brief account of runtime verification. The Journal of Logic
and Algebraic Programming 78(5), 293–303 (May 2009)
43. Lodaya, K., Weil, P.: Rationality in algebras with a series operation. Inf. Comput. 171(2),
269–293 (2001)
44. Luo, Q., Rosu, G.: Enforcemop: A runtime property enforcement system for multithreaded
programs. In: Proceedings of International Symposium in Software Testing and Analysis
(ISSTA’13). pp. 156–166. ACM (July 2013)
45. Manna, Z., Pnueli, A.: A hierarchy of temporal properties (invited paper, 1989). In: Pro-
ceedings of the Ninth Annual ACM Symposium on Principles of Distributed Computing. p.
377–410. PODC ’90, Association for Computing Machinery, New York, NY, USA (1990),
https://doi.org/10.1145/93385.93442
194 C. Soueidi et al.
46. Manson, J., Pugh, W., Adve, S.V.: The Java Memory Model. In: Proceedings of the 32nd
ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. pp. 378–
391. POPL ’05, ACM (2005)
47. Mathur, U., Viswanathan, M.: Atomicity Checking in Linear Time Using Vector Clocks, p.
183–199. Association for Computing Machinery, New York, NY, USA (2020), https://doi.
org/10.1145/3373376.3378475
48. Mazurkiewicz, A.W.: Trace theory. In: Brauer, W., Reisig, W., Rozenberg, G. (eds.) Petri
Nets: Central Models and Their Properties, Advances in Petri Nets 1986, Part II, Proceedings
of an Advanced Course, Bad Honnef, Germany, 8-19 September 1986. Lecture Notes in
Computer Science, vol. 255, pp. 279–324. Springer (1986)
49. Meenakshi, B., Ramanujam, R.: Reasoning about layered message passing systems. Com-
puter Languages, Systems & Structures 30(3-4), 171–206 (2004)
50. Nielsen, M., Plotkin, G.D., Winskel, G.: Petri nets, event structures and domains, part I.
Theor. Comput. Sci. 13, 85–108 (1981)
51. Reger, G., Cruz, H.C., Rydeheard, D.E.: MarQ: Monitoring at Runtime with QEA. In: Baier,
C., Tinelli, C. (eds.) Tools and Algorithms for the Construction and Analysis of Systems
- 21st International Conference, TACAS 2015, Held as Part of the European Joint Confer-
ences on Theory and Practice of Software, ETAPS 2015, London, UK, April 11-18, 2015.
Proceedings. Lecture Notes in Computer Science, vol. 9035, pp. 596–610. Springer (2015)
52. Reger, G., Hallé, S., Falcone, Y.: Third international competition on runtime verification
- CRV 2016. In: Falcone, Y., Sánchez, C. (eds.) Runtime Verification - 16th International
Conference, RV 2016, Madrid, Spain, September 23-30, 2016, Proceedings. Lecture Notes
in Computer Science, vol. 10012, pp. 21–37. Springer (2016)
53. Rosu, G., Sen, K.: An instrumentation technique for online analysis of multithreaded pro-
grams. In: 18th International Parallel and Distributed Processing Symposium, 2004. Pro-
ceedings. pp. 268– (2004)
54. Sen, K., Rosu, G., Agha, G.: Runtime safety analysis of multithreaded programs. SIGSOFT
Softw. Eng. Notes 28(5), 337–346 (Sep 2003), https://doi.org/10.1145/949952.940116
55. Serbanuta, T., Chen, F., Rosu, G.: Maximal causal models for sequentially consistent sys-
tems. In: Runtime Verification, Third International Conference, RV 2012, Istanbul, Turkey,
September 25-28, 2012, Revised Selected Papers. pp. 136–150 (2012), https://doi.org/10.
1007/978-3-642-35632-2 16
56. Soueidi, C., Falcone, Y.: Artifact Repostiory - Opportunistic Monitoring of Multithreaded
Programs (1 2023), https://doi.org/10.6084/m9.figshare.21828570
57. Wang, L., Stoller, S.: Runtime analysis of atomicity for multithreaded programs. IEEE Trans-
actions on Software Engineering 32(2), 93–110 (2006)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropri-
ate credit to the original author(s) and the source, provide a link to the Creative Commons license
and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder.
Parallel Program Analysis via Range Splitting
Jan Haltermann1( ) ?
, Marie-Christine Jakobs2 , Cedric Richter1 ,
and Heike Wehrheim1
1
University of Oldenburg, Department of Computing Science, Oldenburg, Germany
{jan.haltermann,cedric.richter,heike.wehrheim}@uol.de
2
Technical University of Darmstadt, Computer Science, Darmstadt, Germany
[email protected]
1 Introduction
quential combinations, where one tool starts with the full task, stores its partial
analysis result within some verification artefact, and the next tool then works
on the remaining task. In contrast, parallel execution of different tools is in the
majority of cases only done by portfolio approaches, simply running the different
tools on the same task in parallel. One reason for using portfolios when employ-
ing parallel execution is the fact that it is unclear how to best split a program
into parts on which different tools could work separately while still being able
to join their partial results into one for the entire program.
With ranged symbolic execution, Siddiqui and Khurshid [86] proposed one
such technique for splitting programs into parts. The idea of ranged symbolic
execution is to scale symbolic execution by splitting path exploration onto several
workers, thereby, in particular allowing the workers to operate in parallel. To this
end, they defined so-called path ranges. A path range describes a set of program
paths defined by two inputs to the program, where the path π1 triggered by the
first input is the lower bound and the path π2 for the second input is the upper
bound of the range. All paths in between, i.e., paths π such that π1 ≤ π ≤ π2
(based on some ordering ≤ on paths), make up a range. A worker operating
on a range performs symbolic execution on paths of the range only. In their
experiments, Siddiqui and Khurshid investigated one form of splitting via path
ranges, namely by randomly generating inputs, which then make up a number
of ranges.
In this paper, we generalize ranged symbolic execution to arbitrary analyses.
In particular, we introduce the concept of a ranged analysis to execute an ar-
bitrary analysis on a given range and compose different ranged analyses, which
can then operate on different ranges in parallel. Also, we propose a novel split-
ting strategy, which generates ranges along loop bounds. We implemented ranged
analysis in the software verification tool CPAchecker [21], which already pro-
vides a number of analyses, all defined as configurable program analyses (CPAs).
To integrate ranged analysis in CPAchecker, we defined a new range reduction
CPA, and then employed the built-in feature of analysis composition to combine
it with different analyses. The thus obtained ranged analyses are then run on
different ranges in parallel, using CoVeriTeam [20] as tool for orchestration. We
furthermore implemented two strategies for generating path ranges, our novel
strategy employing loop bounds for defining ranges plus the original random
splitting technique. A loop bound n splits program paths into ranges only en-
tering the loop at most n times and ranges entering for more than n times3 .
Our evaluation on SV-COMP benchmarks [36] first of all confirms the results
of Siddiqui and Khurshid [86] in that symbolic execution benefits from a ranged
execution. Second, our results show that a loop-bound based splitting strategy
brings an improvement over random splitting. Finally, we see that a composition
of ranged analyses can solve analysis tasks that none of the (different) constituent
analyses of a combination can solve alone.
3
Such splits can also be performed on intervals on loop bounds, thereby generating
more than two path ranges.
Parallel Program Analysis via Range Splitting 197
`0
x<y !(x<y)
1 int mid(int x, int y, int z)
2 {
3 if (x < y) `1 `2
y<z !(y<z) x<z !(x<z)
4 {
5 if (y < z) return y;
6 else if (x < z) return z; `3 `4 `5 `6
x<z !(x<z) y<z !(y<z)
7 else return x ;
x;
8 }
rn
9 `7 `8 `9 `10
re
else if (x < z) return x;
tu
tu
;
re
10 else if (y < z) return z; rn z
retu
rn
11 else return y; retu
x;
rn z
;
12 } `11
return y; return y;
Fig. 1: Example program mid (taken from [86]) and its CFA
2 Background
For the sake of presentation, we consider simple, imperative programs with a de-
terministic control-flow with one sort of variables (from some set V) only4 . For-
mally, we model a program by a control-flow automaton (CFA) P = (L, `0 , G),
where L ⊆ Loc is a subset of the program locations Loc (the program counter
values), `0 ∈ L represents the beginning of the program, and control-flow edges
G ⊆ L × Ops × L describe when which statements may be executed. Therein the
set of statements Ops contains all possible statements, e.g., assume statements
(boolean expressions over variables V, denoted by BExpr ), assignments, etc. We
expect that CFAs originate from program code and, thus, control-flow may only
branch at assume operations, i.e., CFAs P = (L, `0 , G) are deterministic in the
following sense. For all (`, op0 , `0 ), (`, op00 , `00 ) ∈ G either op0 = op00 ∧ `0 = `00 or
op0 , op00 are assume operations and op0 ≡ ¬(op00 ). We assume that there exists an
indicator function BP : G → {T, F, N } that reports the branch direction, either
N(one), T(rue), or F(alse). This indicator function assigns N to all edges without
assume operations and for any two assume operations (`, op0 , `0 ), (`, op00 , `00 ) ∈ G
with op0 6= op00 it guarantees BP ((`, op0 , `0 )) ∪ BP ((`, op00 , `00 )) = {T, F }. Since
CFAs are typically derived from programs and assume operations correspond
to the two evaluations of conditions of e.g., if or while statements, the assume
operation representing the true evaluation of the condition is typically assigned
T . We will later need this indicator function for defining path orderings.
Figure 1 shows our example program mid, which returns the middle value
of the three input values, and its CFA. For each condition of an if statement it
contains one assume edge for each evaluation of the condition, namely solid edges
labelled by the condition for entering the if branch after the condition evaluates
4
Our implementation supports C programs.
198 J. Haltermann et al.
to true and dashed edges labelled by the negated condition for entering the else
branch after the condition evaluates to false, i.e., the negated condition evaluates
to true. All other statements are represented by a single edge.
We continue with the operational semantics of programs. A program state is
a pair (`, c) of a program location ` ∈ L and a data state c from the set C of data
states that assign to each variable v ∈ V a value of the variable’s domain. Pro-
g1 g2 gn
gram execution paths π = (`0 , c0 ) −− → (`1 , c1 ) −−
→ . . . −−→ (`n , cn ) are sequences
of states and edges such that (1) they start at the beginning of the program
and (2) only perform valid execution steps that (a) adhere to the control-flow,
i.e., ∀1 ≤ i ≤ n : gi = (`i−1 , ·, `i ), and (b) properly describe the effect of the
operations, i.e., ∀1 ≤ i ≤ n : ci = spopi (ci−1 ), where the strongest postcondition
spopi : C * C is a partial function modeling the effect of operation opi ∈ Ops on
data states. Execution paths are also called feasible paths, and paths that fulfil
properties (1) and (2a) but violate property (2b) are called infeasible paths. The
set of all execution paths of a program P is denoted by paths(P ).
!(y<z)
path are πτ1 = (`0 , c1 ) −x<y−−→ (`1 , c1 ) −−−−−→ (`4 , c1 ) −x<z
−−→ (`7 , c1 ) −ret
−−→ z
(`11 , c1 ),
!(x<y)
where c1 = [x 7→ 0, y 7→ 2, z 7→ 1] and πτ2 = (`0 , c2 ) −−−−− → (`2 , c2 ) −x<z
−− →
ret x
(`5 , c2 ) −−−−→ (`11 , c2 ), where c2 = [x 7→ 1, y 7→ 0, z 7→ 2].
To run the configured analysis, one executes a meta reachability analysis, the
so-called CPA algorithm, configured by the CPA and provides an initial value
einit ∈ E which the analysis will start with. For details on the CPA algorithm,
we refer the reader to [17].
As part of our ranged analysis, we use the abstract domain and transfer
relation of a CPA V for value analysis [9] (also known as constant propaga-
tion or explicit analysis). An abstract state v of the value analysis ignores pro-
gram locations and maps each variable to either a concrete value of its do-
main or >, which represents any value. The partial order vV and the join
operator tV are defined variable-wise while ensuring that v vV v 0 ⇔ ∀v ∈
V : v(v) = v 0 (v) ∨ v 0 (v) = >7 and (v tV v 0 )(v) = v(v) if v(v) = v 0 (v) and
otherwise (v tV v 0 )(v) = >. The concretization of abstract state v contains
7
Consequently, ∀v ∈ V : >V (v) = >.
200 J. Haltermann et al.
R[π⊥ ,πτ2 ] × A1
π ⊥ , π τ1 res
ret z ret x ret z ret x ret z ret x ret z ret x ret z ret x ret z ret x
(a) R[π⊥ ,πτ2 ] for (b) R[πτ1 ,π> ] for (c) Composition of range
τ2 = {x : 1, y : 0, z : 2} τ1 = {x : 0, y : 2, z : 1} reductions
Fig. 3: Application of range reduction on the running example of Fig. 1
Note that > represents the value analysis state where no information on variables
is stored and ⊥ represents an unreachable state in the value analysis, which stops
the exploration of the path. Hence, the second case ensures that R[πτ1 ,π> ] also
visits the false-branch of a condition when the path induced by τ1 follows the
true-branch. Note that in case that V computes ⊥ as a successor state for a
assumption g with BP (g) = T , the exploration of the path is stopped, as πτ1
follows the false-branch (contained in the third case).
Upper Bound CPA. For the CPA range reduction R[π⊥ ,πτ2 ] we again borrow
all components of the value analysis except for the transfer relation τ2 . The
transfer relation τ2 isdefined as follows:
0
v = > ∧ v = >
(v, g, v 0 ) ∈ τ2 iff v 6= > ∧ v 0 = > ∧ BP (g) = T ∧ (v, g, ⊥) ∈ V
v 6= > ∧ v 0 6= ⊥ ∨ BP (g) 6= T ∧ (v, g, v 0 ) ∈ V
The second condition now ensures that R[π⊥ ,πτ2 ] also visits the true-branch of a
condition when πτ2 follows the false-branch.
8
Assuming that randomness is controlled through an input and hence the program is
deterministic.
Parallel Program Analysis via Range Splitting 203
4 Splitting
A crucial part of the ranged analysis is the generation of ranges, i.e., the splitting
of programs into parts that can be analysed in parallel. The splitting has to either
compute two paths or two test cases, both defining one range. Ranged symbolic
execution [86] employs a random strategy for range generation (together with
an online work-stealing concept to balance work among different workers). For
the work here, we have also implemented this random strategy, selecting random
paths in the execution tree to make up ranges. In addition, we propose a novel
strategy based on the number of loop unrollings. Both strategies are designed
to work “on-the-fly” meaning that none requires building the full execution tree
upfront, they rather only compute the paths or test cases that are used to fix a
range. Next, we explain both strategies in more detail, especially how they are
used to generate more than two ranges.
Bounding the Number of Loop Unrollings (Lb). Given a loop bound i ∈ N,
the splitting computes the left-most path in the program that contains exactly i
unrollings of the loop. If the program contains nested loops, each nested loop is
unrolled for i times in each iteration of the outer loop. For the computed path,
we (1) build its path formula using the strongest post-condition operator [46],
(2) query an SMT-solver for satisfiability and (3) in case of an answer SAT, use
the evaluation of the input variables in the path formula as one test case. In case
that the path formula is unsatisfiable, we iteratively remove the last statement
from the path, until a satisfying path formula is found. A test case τ determined
in this way defines two ranges, namely [π⊥ , πτ ] and [πτ , π > ]. In case that the
program is loop-free, the generation of a test case fails and we generate a single
range [π⊥ , π > ]. In the experiments, we used the loop bounds 3 (called Lb3) and
10 (called Lb10) with two ranges each. To compute more than two ranges, we
use intervals of loop bounds.
Generating Ranges Randomly (Rdm). The second splitting strategy selects
the desired number of paths randomly. At each assume edge in the program
204 J. Haltermann et al.
Ranged analysis
Off-the-shelf
Range
Task P[τ1 ,τ2 ] program
reduction res
analysis tool
τ1 , τ 2 Updated task
with reduced program
Ranges
Fig. 4: Construction of a ranged analysis from an off-the-shelf program analysis
(either a loop head or an if statement), it follows either the true- or the false-
branch with a probability of 50%, until it reaches a node in the CFA without
successor. Again, we compute the path formula for that path and build a test
case. This purely random approach is called Rdm.
Selecting the true- or the false-branch with the same probability may lead to
fairly short paths with few loop iterations. As the execution tree of a program
is often not balanced, it rather grows to the left (true-branches). Thus, we used
a second strategy based on random walks, which takes the true-branch with a
probability of 90%. We call this strategy Rdm9.
5 Implementation
dardized XML-based TEST-Comp test case format9 . In case that the splitter
fails (e.g. Lb3 cannot compute a test-case, if the program does not contain a
loop) our implementation executes the analysis A1 on the interval [π⊥ , π > ]. For
the evaluation, we used combinations of three existing program analyses within
the ranged analysis, briefly introduced next.
Symbolic Execution. Symbolic execution [73] analyses program paths based
on symbolic inputs. Here, states are pairs of a symbolic store, which describes
variable values by formulae on the symbolic inputs, and a path condition, which
tracks the executability of the path. Operations update the symbolic store and
at branching points the path condition is extended by the symbolic evaluation
of the branching condition. Furthermore, the exploration of a path is stopped
when it reaches the program end or its path condition becomes unsatisfiable.
Predicate Analysis. We use CPAchecker’s standard predicate analysis, which
is configured to perform model checking and predicate abstraction with ad-
justable block encoding [22] such that it abstracts at loop heads only. The
required set of predicates is determined by counterexample-guided abstraction
refinement [35], lazy refinement [64], and interpolation [63].
Bounded Model Checking. We use iterative bounded model checking (BMC).
Each iteration inspects the behaviour of the CFA unrolled up to loop bound k
and increases the loop bound in case no property violation was detected. To
inspect the behaviour, BMC first encodes the unrolled CFA and the property in
a formula using the unified SMT-based approach for software verification [15].
Thereafter, it checks the satisfiability of the formula encoding to detect property
violations.
For the evaluation, we build four different basic configurations and employed
our different range splitters: Ra-2Se and Ra-3Se which employ two resp. three
instances of symbolic execution in parallel, Ra-2bmc employing two instances
of BMC and Ra-Se-Pred that uses symbolic execution for the range [π⊥ , πτ ]
and predicate analysis on [πτ , π > ] for some computed test input τ .
6 Evaluation
different to [86] in that we limit the available CPU time, meaning that both
analyses, the default analysis and the composition of ranged analyses, have the
same resources and that we employ different analyses. Finally, we were inter-
ested in evaluating our novel splitting strategy, in particular in comparison to
the existing random strategy. To this end, we studied the following research
questions:
All experiments were run on machines with an Intel Xeon E3-1230 v5 @ 3.40
GHz (8 cores), 33 GB of memory, and Ubuntu 20.04 LTS with Linux kernel 5.4.0.
We use BenchExec [23] for the execution of our experiments to increase the
reproducibility of the results. In a verification run, a tool-configuration is given
a task (a program plus specification) and computes either a proof (if the program
fulfils the specification) or raises an alarm (if the specification is violated on the
program). We limit each verification run to 15 GB of memory, 4 CPU cores, and
15 min of CPU time, yielding a setup that is comparable to the one used in
SV-Comp. The evaluation is conducted on a subset of the SV-Benchmarks used
in the SV-Comp and all experiments were conducted once. It contains in total
5 400 C-tasks from all sub-categories of the SV-Comp category reach-safety [36].
The specification for this category, and hence for these tasks, states that all calls
to the function reach error are unreachable. Each task contains a ground truth
that contains the information, whether the task fulfils the specification (3 194
tasks) or not (2 206 tasks). All data collected is available in our supplementary
artefact [60].
Median factor of
Median factor of
1.5
100 1.2
1.0
0.8 0.9 0.8 0.8
10
10
10
0
50 0
10 00
0
-2
-5
90
-2
-5
90
<
-1
<
-1
10
20
10
20
0-
0-
50
10
Tasks solved in x seconds Tasks solved in x seconds
1 in CPU time by SymbolicExec in wall time by SymbolicExec
1 10 100 900
(a) For Ra-2Se-Lb3 (b) For Ra-2Se-Lb3
CPU time for SymbolicExec (s)
and CPU time and wall time
Fig. 5: Scatter plot comparing Fig. 6: Median factor of time increase for dif-
SymbExec and Ra-2Se-Lb3 ferent configurations of Ra-2Se
CPU and wall time taken. Most importantly, the impact gets smaller the longer
the analyses need to compute the result (the factor is constantly decreasing). For
tasks that are solved by SymbExec in more than 50 CPU seconds, Ra-2Se-Lb3
is as fast as SymbExec, for tasks solved in more than 100 CPU seconds it is
20% faster. As stated above, the CPU time consumed to computing a proof is
not affected by parallelization. Thus, when only looking at the time taken to
compute a proof, Ra-2Se-Lb3 takes as long as SymbExec after 50 CPU sec-
onds. In contrast, Ra-2Se-Lb3 is faster for finding alarms in that interval. A
more detailed analysis can be found in the artefact [60].
When comparing the wall time in Fig. 6b, the positive effect of the paral-
lelization employed in all configurations of a composition of ranged analyses gets
visible. Ra-2Se-Lb3 is faster than SymbExec, when SymbExec takes more
than 20 seconds in real time to solve the task. To emphasize the effect of the
parallelization, we used pre-computed ranges for Ra-2Se-Lb3. Now, Ra-2Se-
Lb3 takes only the 1.1-fold wall time in the median compared to SymbExec,
and is equally fast or faster for all tasks solved in more than ten seconds.
1
2.
1.5
5
1.1
Median factor of
Median factor of
1.
1.0 1.0
1..2
2
1.0
0
0.9
1.
10
2 0
10 00
0- 0
10
0
00
0
-2
500-5
90
-2
-5
90
<
<
-1
10
10
20
-
0-
50
10
Tasks solved in x seconds Tasks solved in x seconds
in wall time by BMC in wall time by :
(a) For Ra-2bmc-Rdm9 and wall time (b) For Ra-Se-Pred-Lb3 and wall time
Fig. 7: Median factor of time increase for different compositions of ranged anal-
yses
uses two instances of BMC (Ra-2bmc), the second one uses symbolic execution
on the interval [π⊥ , πτ ] and predicate analysis on the range [πτ , π > ] (Ra-Se-
Pred). We are again interested in effectiveness and efficiency.
Results for BMC. The upper part of Tab. 2 contains the results for a compo-
sition of ranged analyses using two instances of BMC. In contrast to Ra-2Se,
Ra-2bmc does not increase the number of overall correct verdicts compared to
Bmc. Ra-2bmc-Rdm9 computes 48 correct verdicts that are not computed by
Bmc, it also fails to compute the correct verdict in 77 cases solved by Bmc. Both
observations can mainly be explained from the fact that one analysis computes a
result for a task where the other runs into a timeout. Again, we observe that the
composition of ranged analyses computes additional alarms (here 36), as both
ranged analyses search in different parts of the program.
When comparing the efficiency, we notice that the CPU time consumed to
compute a result for Ra-2bmc-Rdm9 (and all other instances) is higher than
for Bmc. In average, the increase is 2.6, the median is 2.5, whereas the median
increase for tasks solved in more than 100 CPU seconds by Bmc is 1.1. For wall
time, where we depict the increases in Fig. 7a, the median overall increase is
1.9. This high overall increase is caused by the fact that Bmc can solve nearly
65% of all tasks within ten seconds wall time. Thus, the effect of computing the
splitting has a big impact on the factor. For more complex or larger instances,
where Bmc uses more time, the wall time of Ra-2bmc-Rdm9 is comparable, for
instances taking more than 100 seconds, both takes approximately the same time.
7 Related Work
Numerous approaches combine different verification techniques. Selective com-
binations [6,40,45,51,72,83,92] consider certain features of a task to choose the
best approach for that task. Nesting approaches [3,4,25,26,30,32,49,82,84] use
one or more approaches as components in a main approach. Interleaved ap-
proaches [1,2,5,10,42,50,55,58,62,68,75,78,90,97] alternate between different ap-
proaches that may or may not exchange information. Testification approaches
[28,29,39,43,52,74,81] often sequentially combine a verification and a validation
approach and prioritize or only report confirmed proofs and alarms. Sequen-
tial portfolio approaches [44,61] run distinct, independent analyses in sequence
while parallel portfolio approaches [91,12,57,65,66,96] execute various, indepen-
dent analyses in parallel. Parallel white-box combinations [7,9,37,38,54,56,59,79]
212 J. Haltermann et al.
run different approaches in parallel, which exchange information for the purpose
of collaboration. Next, we discuss cooperation approaches that split the search
space as we do.
A common strategy for dividing the search space in sequential or interleaved
combinations is to restrict the subsequent verifiers to the yet uncovered search
space, e.g., not yet covered test goals [12], open proof obligations [67], or yet
unexplored program paths [8,10,19,31,33,41,42,47,53,71]. Some parallel combi-
nations like CoDiDroid [80], distributed assertion checking [93], or the compo-
sitional tester sketched in conditional testing [12] decompose the verification
statically into separate subtasks. Furthermore, some techniques split the search
space to run different instances of the same analysis in parallel on different parts
of the program. For example, conditional static analysis [85] characterizes paths
based on their executed program branches and uses sets of program branches to
describe the split. Concurrent bounded model checking techniques [69,77] split
paths based on their thread interleavings. Yan et al. [95] dynamically split the
input space if the abstract interpreter returns an inconclusive result and analyses
the input partitions separately with the abstract interpreter. To realize parallel
test-case generation, Korat [76] considers different input ranges in distinct par-
allel instances. Parallel symbolic execution approaches [82,86,87,88,89,94] and
ranged model checking [48] split execution paths, thereby often partitioning the
execution tree. The set of paths are characterized by input constraints [89], path
prefixes [87,88], or ranges [82,86,94,48] and are either created statically from an
initial shallow symbolic execution [87,88,89] or tests [82,86,94] or dynamically
based on the already explored symbolic execution tree [27,34,82,86,98]. While
we reuse the idea of splitting the program paths into ranges [82,86,94,48], we
generalize the idea of ranged symbolic execution [82,86,94] to arbitrary analyses
and in particular allow to combine different analyses. Furthermore, we introduce
a new static splitting strategy along loop bounds.
8 Conclusion
Data Availability Statement. All experimental data and our open source
implementation are archived and available in our supplementary artefact [60].
Parallel Program Analysis via Range Splitting 213
References
1. Albarghouthi, A., Gurfinkel, A., Chechik, M.: From under-approximations to over-
approximations and back. In: Proc. TACAS. pp. 157–172. LNCS 7214, Springer
(2012). https://doi.org/10.1007/978-3-642-28756-5 12
2. Avgerinos, T., Rebert, A., Cha, S.K., Brumley, D.: Enhancing symbolic execution
with veritesting. In: Proc. ICSE. pp. 1083–1094. ACM (2014), https://doi.org/10.
1145/2568225.2568293
3. Baars, A.I., Harman, M., Hassoun, Y., Lakhotia, K., McMinn, P., Tonella, P., Vos,
T.E.J.: Symbolic search-based testing. In: Proc. ASE. pp. 53–62. IEEE (2011).
https://doi.org/10.1109/ASE.2011.6100119
4. Baluda, M.: EvoSE: Evolutionary symbolic execution. In: Proc. A-TEST. pp. 16–
19. ACM (2015), https://doi.org/10.1145/2804322.2804325
5. Beckman, N., Nori, A.V., Rajamani, S.K., Simmons, R.J.: Proofs from tests. In:
Proc. ISSTA. pp. 3–14. ACM (2008). https://doi.org/10.1145/1390630.1390634
6. Beyer, D., Dangl, M.: Strategy selection for software verification based on
boolean features: A simple but effective approach. In: Proc. ISoLA. pp. 144–159.
LNCS 11245, Springer (2018). https://doi.org/10.1007/978-3-030-03421-4 11
7. Beyer, D., Dangl, M., Wendler, P.: Boosting k-induction with continuously-refined
invariants. In: Proc. CAV. pp. 622–640. LNCS 9206, Springer (2015). https://doi.
org/10.1007/978-3-319-21690-4 42
8. Beyer, D., Henzinger, T.A., Keremoglu, M.E., Wendler, P.: Conditional model
checking: A technique to pass information between verifiers. In: Proc. FSE. ACM
(2012). https://doi.org/10.1145/2393596.2393664
9. Beyer, D., Henzinger, T.A., Théoduloz, G.: Program analysis with dynamic preci-
sion adjustment. In: Proc. ASE. pp. 29–38. IEEE (2008). https://doi.org/10.1109/
ASE.2008.13
10. Beyer, D., Jakobs, M.: CoVeriTest: Cooperative verifier-based testing. In:
Proc. FASE. pp. 389–408. LNCS 11424, Springer (2019). https://doi.org/10.1007/
978-3-030-16722-6 23
11. Beyer, D., Jakobs, M., Lemberger, T., Wehrheim, H.: Reducer-based construction
of conditional verifiers. In: Proc. ICSE. pp. 1182–1193. ACM (2018), https://doi.
org/10.1145/3180155.3180259
12. Beyer, D., Lemberger, T.: Conditional testing: Off-the-shelf combination of test-
case generators. In: Proc. ATVA. pp. 189–208. LNCS 11781, Springer (2019). https:
//doi.org/10.1007/978-3-030-31784-3 11
13. Beyer, D.: Progress on software verification: SV-COMP 2022. In: TACAS. Lecture
Notes in Computer Science, vol. 13244, pp. 375–402. Springer (2022). https://doi.
org/10.1007/978-3-030-99527-0 20
14. Beyer, D., Dangl, M., Dietsch, D., Heizmann, M.: Correctness witnesses: exchang-
ing verification results between verifiers. In: Proc. FSE. pp. 326–337. ACM (2016),
https://doi.org/10.1145/2950290.2950351
15. Beyer, D., Dangl, M., Wendler, P.: A unifying view on SMT-based software ver-
ification. J. Autom. Reasoning 60(3), 299–335 (2018), https://doi.org/10.1007/
s10817-017-9432-6
16. Beyer, D., Haltermann, J., Lemberger, T., Wehrheim, H.: Decomposing software
verification into off-the-shelf components: An application to CEGAR. In: Proc.
ICSE. ACM (2022). https://doi.org/10.1145/3510003.351006
17. Beyer, D., Henzinger, T.A., Théoduloz, G.: Configurable software verification:
Concretizing the convergence of model checking and program analysis. In:
214 J. Haltermann et al.
34. Ciortea, L., Zamfir, C., Bucur, S., Chipounov, V., Candea, G.: Cloud9: A software
testing service. OSR 43(4), 5–10 (2009), https://doi.org/10.1145/1713254.1713257
35. Clarke, E.M., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided
abstraction refinement. In: Proc. CAV. pp. 154–169. LNCS 1855, Springer (2000),
https://doi.org/10.1007/10722167 15
36. SV-Benchmarks Community: SV-Benchmarks (2022), https://gitlab.com/
sosy-lab/benchmarking/sv-benchmarks/-/tree/svcomp22
37. Cousot, P., Cousot, R.: Systematic design of program-analysis frameworks. In:
Proc. POPL. pp. 269–282. ACM (1979). https://doi.org/10.1145/567752.567778
38. Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., Ri-
val, X.: Combination of abstractions in the astrée static analyzer. In: Proc.
ASIAN’06. pp. 272–300. LNCS 4435, Springer (2008). https://doi.org/10.1007/
978-3-540-77505-8 23
39. Csallner, C., Smaragdakis, Y.: Check ’n’ crash: Combining static checking and test-
ing. In: Proc. ICSE. pp. 422–431. ACM (2005). https://doi.org/10.1145/1062455.
1062533
40. Czech, M., Hüllermeier, E., Jakobs, M., Wehrheim, H.: Predicting rankings of
software verification tools. In: Proc. SWAN. pp. 23–26. ACM (2017). https://doi.
org/10.1145/3121257.3121262
41. Czech, M., Jakobs, M., Wehrheim, H.: Just test what you cannot verify! In: Proc.
FASE. LNCS, vol. 9033, pp. 100–114. Springer (2015). https://doi.org/10.1007/
978-3-662-46675-9 7
42. Daca, P., Gupta, A., Henzinger, T.A.: Abstraction-driven concolic testing. In:
Proc. VMCAI. pp. 328–347. LNCS 9583, Springer (2016). https://doi.org/10.1007/
978-3-662-49122-5 16
43. Dams, D., Namjoshi, K.S.: Orion: High-precision methods for static error analysis
of C and C++ programs. In: Proc. FMCO. pp. 138–160. LNCS 4111, Springer
(2005). https://doi.org/10.1007/11804192 7
44. Dangl, M., Löwe, S., Wendler, P.: Cpachecker with support for recursive pro-
grams and floating-point arithmetic - (competition contribution). In: Proc.
TACS. pp. 423–425. LNCS 9035, Springer (2015), https://doi.org/10.1007/
978-3-662-46681-0 34
45. Demyanova, Y., Pani, T., Veith, H., Zuleger, F.: Empirical software metrics for
benchmarking of verification tools. In: Proc. CAV. pp. 561–579. LNCS 9206,
Springer (2015). https://doi.org/10.1007/978-3-319-21690-4 39
46. Dijkstra, E.W., Scholten, C.S.: Predicate Calculus and Program Semantics. Texts
and Monographs in Computer Science, Springer (1990). https://doi.org/10.1007/
978-1-4612-3228-5
47. Ferles, K., Wüstholz, V., Christakis, M., Dillig, I.: Failure-directed program trim-
ming. In: Proc. ESEC/FSE. pp. 174–185. ACM (2017), http://doi.acm.org/10.
1145/3106237.3106249
48. Funes, D., Siddiqui, J.H., Khurshid, S.: Ranged model checking. ACM SIGSOFT
Softw. Eng. Notes 37(6), 1–5 (2012), https://doi.org/10.1145/2382756.2382799
49. Galeotti, J.P., Fraser, G., Arcuri, A.: Improving search-based test suite generation
with dynamic symbolic execution. In: Proc. ISSRE. pp. 360–369. IEEE (2013),
https://doi.org/10.1109/ISSRE.2013.6698889
50. Gao, M., He, L., Majumdar, R., Wang, Z.: LLSPLAT: improving concolic testing
by bounded model checking. In: Proc. SCAM. pp. 127–136. IEEE (2016), https:
//doi.org/10.1109/SCAM.2016.26
216 J. Haltermann et al.
51. Gargantini, A., Vavassori, P.: Using decision trees to aid algorithm selection in
combinatorial interaction tests generation. In: Proc. ICST. pp. 1–10. IEEE (2015),
https://doi.org/10.1109/ICSTW.2015.7107442
52. Ge, X., Taneja, K., Xie, T., Tillmann, N.: Dyta: Dynamic symbolic execution
guided with static verification results. In: Proc. ICSE. pp. 992–994. ACM (2011).
https://doi.org/10.1145/1985793.1985971
53. Gerrard, M.J., Dwyer, M.B.: ALPACA: a large portfolio-based alternating condi-
tional analysis. In: Proc. ICSE. pp. 35–38. IEEE / ACM (2019), https://doi.org/
10.1109/ICSE-Companion.2019.00032
54. Godefroid, P., Klarlund, N., Sen, K.: Dart: Directed automated random test-
ing. In: Proc. PLDI. pp. 213–223. ACM (2005), https://doi.org/10.1145/1065010.
1065036
55. Godefroid, P., Nori, A.V., Rajamani, S.K., Tetali, S.: Compositional may-must
program analysis: Unleashing the power of alternation. In: Proc. POPL. pp. 43–
56. ACM (2010). https://doi.org/10.1145/1706299.1706307, http://doi.acm.org/
10.1145/1706299.1706307
56. Godefroid, P., Levin, M.Y., Molnar, D.A.: Automated whitebox fuzz testing. In:
Proc. NDSS. The Internet Society (2008), http://www.isoc.org/isoc/conferences/
ndss/08/papers/10 automated whitebox fuzz.pdf
57. Groce, A., Zhang, C., Eide, E., Chen, Y., Regehr, J.: Swarm testing. In: Proc.
ISSTA. pp. 78–88. ACM (2012), https://doi.org/10.1145/2338965.2336763
58. Gulavani, B.S., Henzinger, T.A., Kannan, Y., Nori, A.V., Rajamani, S.K.: Syn-
ergy: A new algorithm for property checking. In: Proc. FSE. pp. 117–127. ACM
(2006). https://doi.org/10.1145/1181775.1181790
59. Haltermann, J., Wehrheim, H.: CoVEGI: Cooperative Verification via Externally
Generated Invariants. In: Proc. FASE. pp. 108–129. LNCS 12649, Springer (2021),
https://doi.org/10.1007/978-3-030-71500-7 6
60. Haltermann, J., Jakobs, M., Richter, C., Wehrheim, H.: Replication package for
article ’Parallel Program Analysis via Range Splitting’ (Jan 2023). https://doi.
org/10.5281/zenodo.7189816
61. Heizmann, M., Chen, Y., Dietsch, D., Greitschus, M., Hoenicke, J., Li, Y., Nutz,
A., Musa, B., Schilling, C., Schindler, T., Podelski, A.: Ultimate automizer
and the search for perfect interpolants - (competition contribution). In: Proc.
TACAS. pp. 447–451. LNCS 10806, Springer (2018), https://doi.org/10.1007/
978-3-319-89963-3 30
62. Helm, D., Kübler, F., Reif, M., Eichberg, M., Mezini, M.: Modular collaborative
program analysis in OPAL. In: Proc. FSE. pp. 184–196. ACM (2020), https://doi.
org/10.1145/3368089.3409765
63. Henzinger, T.A., Jhala, R., Majumdar, R., McMillan, K.L.: Abstractions from
proofs. In: Proc. POPL. pp. 232–244. ACM (2004), https://doi.org/10.1145/
964001.964021
64. Henzinger, T.A., Jhala, R., Majumdar, R., Sutre, G.: Lazy abstraction. In: Proc.
POPL. pp. 58–70. ACM (2002), https://doi.org/10.1145/503272.503279
65. Holı́k, L., Kotoun, M., Peringer, P., Soková, V., Trtı́k, M., Vojnar, T.: Predator
shape analysis tool suite. In: Proc. HVC. pp. 202–209. LNCS 10028 (2016), https:
//doi.org/10.1007/978-3-319-49052-6 13
66. Holzmann, G.J., Joshi, R., Groce, A.: Swarm verification. In: Proc. ASE. pp. 1–6.
IEEE (2008). https://doi.org/10.1109/ASE.2008.9
67. Huster, S., Ströbele, J., Ruf, J., Kropf, T., Rosenstiel, W.: Using robustness testing
to handle incomplete verification results when combining verification and testing
Parallel Program Analysis via Range Splitting 217
techniques. In: Proc. ICTSS. pp. 54–70. LNCS 10533, Springer (2017), https://doi.
org/10.1007/978-3-319-67549-7 4
68. Inkumsah, K., Xie, T.: Improving structural testing of object-oriented programs
via integrating evolutionary testing and symbolic execution. In: Proc. ASE. pp.
297–306. IEEE (2008), https://doi.org/10.1109/ASE.2008.40
69. Inverso, O., Trubiani, C.: Parallel and distributed bounded model checking of
multi-threaded programs. In: Proc. PPoPP. pp. 202–216. ACM (2020), https:
//doi.org/10.1145/3332466.3374529
70. Jakobs, M.: P ART P W : From partial analysis results to a proof witness. In:
Proc. SEFM. pp. 120–135. LNCS 10469, Springer (2017), https://doi.org/10.1007/
978-3-319-66197-1 8
71. Jalote, P., Vangala, V., Singh, T., Jain, P.: Program partitioning: A frame-
work for combining static and dynamic analysis. In: Proc. WODA. pp. 11–
16. ACM (2006). https://doi.org/10.1145/1138912.1138916, http://doi.acm.org/
10.1145/1138912.1138916
72. Jia, Y., Cohen, M.B., Harman, M., Petke, J.: Learning combinatorial interaction
test generation strategies using hyperheuristic search. In: Proc. ICSE. pp. 540–550.
IEEE (2015), https://doi.org/10.1109/ICSE.2015.71
73. King, J.C.: Symbolic execution and program testing. Commun. ACM 19(7), 385–
394 (1976), https://doi.org/10.1145/360248.360252
74. Li, K., Reichenbach, C., Csallner, C., Smaragdakis, Y.: Residual investigation:
Predictive and precise bug detection. In: Proc. ISSTA. pp. 298–308. ACM (2012).
https://doi.org/10.1145/2338965.2336789
75. Majumdar, R., Sen, K.: Hybrid concolic testing. In: Proc. ICSE. pp. 416–426. IEEE
(2007), https://doi.org/10.1109/ICSE.2007.41
76. Misailovic, S., Milicevic, A., Petrovic, N., Khurshid, S., Marinov, D.: Parallel test
generation and execution with Korat. In: Proc. ESEC/FSE. pp. 135–144. ACM
(2007), https://doi.org/10.1145/1287624.1287645
77. Nguyen, T.L., Schrammel, P., Fischer, B., La Torre, S., Parlato, G.: Parallel bug-
finding in concurrent programs via reduced interleaving instances. In: Proc. ASE.
pp. 753–764. IEEE (2017). https://doi.org/10.1109/ASE.2017.8115686
78. Noller, Y., Kersten, R., Pasareanu, C.S.: Badger: Complexity analysis with fuzzing
and symbolic execution. In: Proc. ISSTA. pp. 322–332. ACM (2018), http://doi.
acm.org/10.1145/3213846.3213868
79. Noller, Y., Pasareanu, C.S., Böhme, M., Sun, Y., Nguyen, H.L., Grunske, L.: Hydiff:
Hybrid differential software analysis. In: Proc. ICSE. pp. 1273–1285. ACM (2020),
https://doi.org/10.1145/3377811.3380363
80. Pauck, F., Wehrheim, H.: Together strong: Cooperative android app analysis. In:
Proc. ESEC/FSE. pp. 374–384. ACM (2019), https://doi.org/10.1145/3338906.
3338915
81. Post, H., Sinz, C., Kaiser, A., Gorges, T.: Reducing false positives by combining
abstract interpretation and bounded model checking. In: Proc. ASE. pp. 188–197.
IEEE (2008). https://doi.org/10.1109/ASE.2008.29
82. Qiu, R., Khurshid, S., Pasareanu, C.S., Wen, J., Yang, G.: Using test ranges to
improve symbolic execution. In: Proc. NFM. pp. 416–434. LNCS 10811, Springer
(2018), https://doi.org/10.1007/978-3-319-77935-5 28
83. Richter, C., Hüllermeier, E., Jakobs, M., Wehrheim, H.: Algorithm selection for
software validation based on graph kernels. JASE 27(1), 153–186 (2020), https:
//doi.org/10.1007/s10515-020-00270-x
218 J. Haltermann et al.
84. Sakti, A., Guéhéneuc, Y., Pesant, G.: Boosting search based testing by using con-
straint based testing. In: Proc. SSBSE. pp. 213–227. LNCS 7515, Springer (2012).
https://doi.org/10.1007/978-3-642-33119-0 16
85. Sherman, E., Dwyer, M.B.: Structurally defined conditional data-flow static anal-
ysis. In: Proc. TACAS. pp. 249–265. LNCS 10806, Springer (2018), https://doi.
org/10.1007/978-3-319-89963-3 15
86. Siddiqui, J.H., Khurshid, S.: Scaling symbolic execution using ranged analysis.
In: Proc. SPLASH. pp. 523–536. ACM (2012), https://doi.org/10.1145/2384616.
2384654
87. Singh, S., Khurshid, S.: Parallel chopped symbolic execution. In: Proc.
ICFEM. pp. 107–125. LNCS 12531, Springer (2020), https://doi.org/10.1007/
978-3-030-63406-3 7
88. Singh, S., Khurshid, S.: Distributed symbolic execution using test-depth partition-
ing. CoRR abs/2106.02179 (2021), https://arxiv.org/abs/2106.02179
89. Staats, M., Pasareanu, S.S.: Parallel symbolic execution for structural test genera-
tion. In: Proc. ISSTA. pp. 183–194. ACM (2010), https://doi.org/10.1145/1831708.
1831732
90. Stephens, N., Grosen, J., Salls, C., Dutcher, A., Wang, R., Corbetta,
J., Shoshitaishvili, Y., Kruegel, C., Vigna, G.: Driller: Augmenting fuzzing
through selective symbolic execution. In: Proc. NDSS. The Internet So-
ciety (2016), http://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/
2017/09/driller-augmenting-fuzzing-through-selective-symbolic-execution.pdf
91. Tschannen, J., Furia, C.A., Nordio, M., Meyer, B.: Usable verification of object-
oriented programs by combining static and dynamic techniques. In: Proc.
SEFM. pp. 382–398. LNCS 7041, Springer (2011), https://doi.org/10.1007/
978-3-642-24690-6 26
92. Tulsian, V., Kanade, A., Kumar, R., Lal, A., Nori, A.V.: MUX: Algorithm selection
for software model checkers. In: Proc. MSR. p. 132–141. ACM (2014), https://doi.
org/10.1145/2597073.2597080
93. Yang, G., Do, Q.C.D., Wen, J.: Distributed assertion checking using symbolic exe-
cution. ACM SIGSOFT Softw. Eng. Notes 40(6), 1–5 (2015), https://doi.org/10.
1145/2830719.2830729
94. Yang, G., Qiu, R., Khurshid, S., Pasareanu, C.S., Wen, J.: A synergistic approach
to improving symbolic execution using test ranges. Innov. Syst. Softw. Eng. 15(3-
4), 325–342 (2019). https://doi.org/10.1007/s11334-019-00331-9
95. Yin, B., Chen, L., Liu, J., Wang, J., Cousot, P.: Verifying numerical programs
via iterative abstract testing. In: Proc. SAS. pp. 247–267. LNCS 11822, Springer
(2019), https://doi.org/10.1007/978-3-030-32304-2 13
96. Yin, L., Dong, W., Liu, W., Wang, J.: Parallel refinement for multi-threaded pro-
gram verification. In: Proc. ICSE. pp. 643–653. IEEE (2019), https://doi.org/10.
1109/ICSE.2019.00074
97. Yorsh, G., Ball, T., Sagiv, M.: Testing, abstraction, theorem proving: Better to-
gether! In: Proc. ISSTA. pp. 145–156. ACM (2006). https://doi.org/http://doi.
acm.org/10.1145/1146238.1146255
98. Zhou, L., Gan, S., Qin, X., Han, W.: Secloud: Binary analyzing using symbolic
execution in the cloud. In: Proc. CBD. pp. 58–63. IEEE (2013), https://doi.org/
10.1109/CBD.2013.31
Parallel Program Analysis via Range Splitting 219
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Runtime Enforcement Using Knowledge Bases
Eduard Kamburjan1( )
and Crystal Chang Din2
1
University of Oslo, Oslo, Norway
[email protected]
2
University of Bergen, Bergen, Norway
[email protected]
1 Introduction
Knowledge bases (KBs) are logic-based representations of both data and do-
main knowledge, for which there exists a rich toolset to query data and reason
about data semantically, i.e., in terms of the domain knowledge. This enables
domain users to interact with modern IT systems [39] without being exposed to
implementation details, as well as to make their domain knowledge available for
software applications. KBs are the foundation of many modern innovation drivers
and key technologies: Applications range from Digital Twin engineering [31], over
industry standards in robotics [23] to expert systems, e.g., in medicine [38].
The success story of KBs, however, is so far based on the use of domain
knowledge about static data. The connection to transition systems and pro-
grams beyond Prolog-style logic programming has just begun to be explored.
This is mainly triggered by tool support for developing applications that use
KBs [7,13,28], in a type-safe way [29,32].
In this work, we investigate how one can use domain knowledge about dy-
namic processes and formalize knowledge about the order of computations to
be performed. More concretely, we describe a runtime enforcement technique to
use domain knowledge to guide the selection of rules in a transition system, for
2 Preliminaries
We give some technical preliminaries for knowledge bases as well as transition
systems, as far as they are needed for our runtime enforcement technique.
Definition 1 (Domain Knowledge of Dynamic Processes). Domain knowl-
edge of dynamic processes (DKDP) is the knowledge about events and changes.
Example 1 (DKDP in Geology). DKDP describes knowledge about some tem-
poral properties in a domain. In geology, for example, this may be the knowledge
that a deposition of some geological layers in Cretaceous should happen after a
deposition in Jurassic, because the Cretaceous is after the Jurassic. This can be
deduced from, e.g., fossils found in the layers.
A description logic (DL) is a decidable fragment of first-order logic with
suitable expressive power for knowledge representation [3]. We do not commit to
any specific DL here, but require that for the chosen DL it is decidable to check
consistency of a KB, which we define next. A knowledge base is a collection of DL
axioms, over individuals (corresponding to first-order logic constants), concepts,
also called classes (corresponding to first-order logic unary predicates) and roles,
also called properties (corresponding to first-order logic binary predicates).
Definition 2 (Knowledge Base). A knowledge base (KB) K = (R, T , A) is a
triple of three sets of DL axioms, where the ABox A contains assertions over in-
dividuals, the TBox T contains axioms over concepts, and the RBox R contains
axioms over roles. A KB is consistent if no contradiction follows from it.
KBs can be seen as first-order logic theories, so we refrain from introducing
them fully formally and introduce them by examples throughout the article. The
Manchester syntax [25] is used for DL formulas in examples to emphasize that
they model knowledge, but we treat them as first-order logic formulas otherwise.
Runtime Enforcement Using Knowledge Bases 223
Rewrite rules map one term to another by unifying a subterm with the head
term. The matched subterm is then rewritten by applying the substitution to
the body term. Normally one would have additional conditions on the transition
rules, but these are not necessary to present semantical guiding.
Where r is the name of the rule, and thead , tbody ∈ T are the head and body terms.
A rule matches on a term t with Σ(t) = ∅, if there is a subterm ts of t,
such that thead = ts σ, for a suitable substitution σ. A rule produces a term t0 ,
by matching on subterm ts with substitution σ, and generating t0 by replacing
ts in t by ts σ 0 , where σ 0 is equal to σ for all v ∈ Σ(tbody ) ∩ Σ(thead ) and maps
v ∈ Σ(thead ) \ Σ(tbody ) to fresh term symbols. For production, we write
r,σ 0
t −−→ t0
Example 3. Let us assume 4 rules: a rule deposit that deposits a layer without
fossils, a rule depositStego that deposits a layer with Stegosaurus fossils, an
analogous rule depositTRex that deposits a layer with Tyrannosaurus fossils,
and a rule erode that removes the top layer of the deposition. One example re-
duction sequence, for some terms ti and with substitutions omitted, is as follows:
depositStego erode depositTRex
t0 −−−−−−−−→ t1 −−−→ t2 −−−−−−−→ t3
Runtime Enforcement Using Knowledge Bases 225
This would violate the domain knowledge, as we can derive a situation, where a
layer with Tyrannosaurus fossils is below a layer with Stegosaurus fossils, imply-
ing that the Cretaceous is before the Jurassic. This contradiction is captured by
the knowledge base in Fig. 1. The domain knowledge DKDP should prevent this
rule application at t3 to happen. To achieve this, i.e., enforce domain knowledge
at runtime, we must connect the trace with the KB. Specifically, we represent
the trace as a KB itself, i.e., instead of operating on a KB, we record the events
and generate a KB from a trace using a mapping.
For example, consider the left KB in Fig. 1. The upper part is (a part of)
our DKDP about geological ages, while the lower part is the KB mapped from
the trace. Together they form a KB. In the knowledge base of this example, we
add one layer that contains Stegosaurus fossils for each depositStego event and
analogously for depositTRex events. We also add the below relation between
two layers, if their events are ordered. So, if we would execute depositStego
after depositTRex, there would be two layers in the KB as shown in Fig. 1, with
corresponding fossils, connected using the below relation. On the right, the KB
is shown with the additional knowledge following from its axioms. In particular,
we can deduce that layer2 must be below layer1 using the axioms from Sec. 2.
This, in turn, makes the overall KB inconsistent, as below must be asymmetric.
We stress that consistency of the execution with the DKDP is a trace prop-
erty, it is reasoning about the events that happen regardless of the current state.
In our example, consider the situation, where the next event after t3 rule erode
triggers again, and then we consider rule depositStego. I.e., the following con-
tinuation of the trace
226 E. Kamburjan and C. C. Din
Its application to a trace hev(r1 ), ev(r1 ), ev(r2 )i is the set {P(A, B), P(B, A)}.
First-order matching mapping can also be applied to our running example.
Example 6. We continue with the trace from Ex. 4, extended with another event
ev(depositStego, v : layer2). We check whether adding an event to the trace
would result in a consistent KB by actually extending the trace for analysis. We
call this a hypothetical execution step.
The following mapping, which must be provided by the user adds the spatial
information about layers w.r.t. the fossils found within. The first-order logic
formula at the guard of the mapping expresses that an event of depositTRex
is found before the event of depositStego in the trace. Note that the given
set of axioms from the mapping faithfully describes the event structure of the
trace, i.e., the mapping could produce axioms which will cause inconsistency
w.r.t. the domain knowledge: Together with the DKDP, we can see that the
trace is mapped to an inconsistent knowledge base by adding 5 axioms. Note
that we do not generate one layer for each deposition event during simulation,
but only two specific ones, Layer(l1 ) and Layer(l2 ) in this case, for the relevant
information. One can extend mapping rules for the different cases (for instance,
depositStego before depositTRex, only depositTRex events, etc.), or use a
different mapping mechanism, which we discuss further in Sec. 6.
∃l1 , l2 . ∃i1 , i2 .
. .
tr [i1 ] = ev depositTRex, v : l1 ∧ tr [i2 ] = ev depositStego, v : l2 ∧ i1 < i2
7→l1 ,l2
Layer(l1 ), contains(l1 , Tyrannosaurus),
Layer(l2 ), contains(l2 , Stegosaurus), below(l1 , l2 )
We stress again that we are interested in trace properties, a layer may still
have had effects on the state despite being completely removed at one point (by
an erode event). Thus, we must consider the deposition event of a layer to check
the trace against the domain knowledge.
The guided transition systems extends the mapping of a basic transition
system, by additionally ensuring that the trace after executing the rule would
be mapped to a consistent knowledge base. This treats the domain knowledge
as an invariant that is enforced, i.e., a transition is only allowed if it indeed
preserves the invariant.
Definition 8 (Guided Transition System). Given a set of rules R, a map-
ping µ and a knowledge base K, the guided semantics is defined as a transition
system between pairs of terms t and traces tr . For each rule r ∈ R, we have one
guided rule (for consistency, cf. Def. 2):
r,σ
t −−→ t0 ev = ev(r, σ) µ(tr ◦ ev ) ∪ K is consistent
( kb)
r 0
(t, tr) →
− (t , tr ◦ ev )
Runtime Enforcement Using Knowledge Bases 229
The transition rule in Def. 8 uses the knowledge base directly to check consis-
tency, and while this enables to integrate domain knowledge into the system
directly, it also poses challenges from a practical point of view. First, the condi-
tion of the rule application is not specific to the change of the trace, and must
check the consistency of the whole knowledge base, which can be computation-
ally heavy. Second, the consistency check is performed at every step, for every
potential rule application. Third, the trace must be mapped whenever it is ex-
tended. Which means the same mapping computation that has been performed
in the previous step may be executed all over again.
To overcome these challenges, we provide a system that reduces consistency
checking by using well-formedness guards, which only require to evaluate an
expression over the trace without accessing the knowledge base. These guards
are transparent to the domain users, the system behaves the same as with the
consistency checks of the knowledge base. At its core, we use well-formedness
predicates, which characterize the relation of domain knowledge and mappings.
Using this definition we can slightly rewrite the rule of Def. 8: For every
starting term t0 , the set of generated traces is the same if the rule of Def. 8 is
replaced by the following one
r,σ
t −−→ t0 ev = ev(r, σ) wf (tr ◦ ev )
(wf)
r 0
(t, tr) →
− (t , tr ◦ ev )
For first-order matching mappings, we can generate the well-formedness pred-
icate by testing all possible extensions of the knowledge base upfront and defining
the guards of those sets that are causing inconsistency as non-well-formed.
Theorem 1. Let µ be a first-order matching mapping for some knowledge base
K. Let Ax = {ax 1 , . . . , ax n } be the set of all bodies in
S µ. Let Incons be the set
of all subsets of Ax, such that for each A ∈ Incons, a∈A a ∪ K is inconsistent.
Let guardA be the set of guards corresponding to each body in A. The following
predicate wf µ is a well-formedness predicate for µ and K.
_ ^
wf µ = ¬ ϕ
A∈Incons ϕ∈guardA
230 E. Kamburjan and C. C. Din
wf (hi) ≡ true
^ .
wf (tr ◦ ev ) ≡ wf (tr) ∧ (rule(ev ) = r) → wf r (tr, ev )
r∈R
r,σ
t −−→ t0 ev = ev(r, σ) wf r (tr, ev)
( wf-r)
r
− (t0 , tr ◦ ev )
(t, tr) →
The set of traces generated by this transition system from a starting term t0
is denoted G(R, wf , t0 ). Execution always starts with the empty trace.
Note that (a) we do use a specific well-formedness predicate per rule, and that
(b) we do not extend the trace tr in the premise as the rules in Def. 8 and Def. 9.
For deterministic predicates, only one trace is generated: G R, wf , t = 1.
When the programmer designs the mapping, the focus is on mapping enough
information to achieve inconsistency, to ensure that certain transition steps are
not performed. If the same mapping is to be used to retrieve results from the
computation, e.g., to query over the final trace, this may be insufficient. Next,
we discuss mappings that preserve more, or all information from the trace.
232 E. Kamburjan and C. C. Din
generalizes the set of fresh names from first-order matching mappings in Def. 7.
Based on this function definition, the example in Sec. 3 can be performed using
the transducing mapping διbelow
geo ,geo
. The connections between each pair of consec-
utive events in a trace, i.e., a layer is below another layer, is derived from the
axioms in the domain knowledge and is added as additional axioms to the KB.
Definition 13 (Knowledge Base for Traces). The knowledge base for traces
contains the concept Event modeling events, the concept Match modeling one pair
of variable and its matching terms, and the concept Term for terms. Furthermore,
the functional property appliesRule connects events to rule names (as strings),
the property match that connects the individuals for events with the individuals
for matches (i.e., an event with the pairs v : t of a variable and the term as-
signed to this variable), the property var that connects matches and variables
(as strings), and term that connects matches and terms.
We remind that KBs only support binary predicates and we cannot avoid
formalizing the concept of a match, which connects three parts: event, variable
and term. The direct mapping lessens the workload for the programmer further:
it requires no additional input and can be done fully automatically. It is a pre-
defined mapping for all programs and is defined by instantiating a transducing
mapping using the next role and pre-defined functions direct and ιdirect for
and ι. Also, we must generate additional fresh individuals for the matches. The
formal definition of the pre-defined functions for the direct mapping is as follows.
where δ(tj ) deterministically generates the axioms for the tree structure of the
term tj according to Def. 3 and η(tj ) returns the individual of the head of tj .
234 E. Kamburjan and C. C. Din
The properties match, var and term connect each event with its parameters.
For example, the match v : layer0 of the first event in Ex. 4, generates
match(e1, match0 1), var(match0 1, “v”), term(match0 1, layer0)
where e1 is the representation of the event and match0 1 is the representation
of the match in the KB. The complete direct mapping is given in the following
example.
Example 10. The direct mapping of Ex. 4 is as follows. We apply the direct
function to all three events, where each event has one parameter.
n
Event(e1), Event(e2), Event(e3), Next(e1, e2), Next(e2, e3), appliesRule(e1, “depositStego”),
appliesRule(e2, “erode”), appliesRule(e3, “depositTRex”), match(e1, m1), var(m1, “v”),
term(m1, layer0), match(e2, m2), var(m2, “v”), term(m2, layer0), match(e3, m3),
o
var(m3, “v”), term(m3, layer1)
7 Discussion
Querying and Stability. The mapping can be used by the domain users to interact
with the system. For one, it can be used to retrieve the result of the computation
using the vocabulary of a domain. For example, the following SPARQL [44] query
retrieves all depositions generated during the Jurassic:
SELECT ?l WHERE {?l a Layer. ?l during Jurassic}
Indeed, one of the main advantages of knowledge bases is that they enable
ontology-based data access [46]: uniform data access in terms of a given do-
main. Another possibility is to use justifications [5]. Justifications are minimal
sets of axioms responsible for entailments over a knowledge base, e.g., to find
out why it is inconsistent. They are able to explain, during an interaction, why
certain steps are not possible.
The programmers do not need to design a complete knowledge base – for
many domains knowledge bases are available, for example in form of indus-
trial standards [26,23]. For more specific knowledge bases, clear design principles
based on experiences in ontology engineering are available [17]. Note that these
KBs are stable and do rarely change. Our system requires a static domain knowl-
edge, as changes in the DKDP can invalidate traces during execution without
executing a rule, which is, thus, not a limitation if one uses stable ontologies.
The direct mapping uses a fixed vocabulary, but one can formulate the con-
nection to the domain knowledge by using additional axioms. In Ex. 10, one can
declare every event to be a layer. The axiom for depositStego is as follows.
appliesRule value “depositStego” SubClassOf contains value Stegosaurus
0
1 2 3 4 5 6
n
Fig. 3. Runtime comparison.
are matched on by some erode event. We next discuss some of the considerations
when choosing the style of mapping, and the limitations of each.
There are, thus, two styles to connect trace and domain knowledge: One
can add axioms connecting the vocabulary of traces with the vocabulary of the
DKDP (direct mapping), or one can translate the trace into the vocabulary of
the DKDP (first-order matching mapping, transducing mappings).
The two styles require different skills from the programmer to interact with
the domain knowledge: The first style requires to express a trace as part of
the domain as a set of ABox axioms, while the second one requires to connect
general traces to the domain using TBox axioms. Thus, the second style operates
on a higher level of abstraction and we conjuncture that such mappings may
require more interaction with the domain expert and a deeper knowledge about
knowledge graphs. However, the same insights needed to define the TBox axioms,
are also needed to define the guards of a first-order matching mapping.
Naming Schemes. The transducing mappings and the first-order matching map-
ping have different naming schemes. A transducing mapping, and thus, a direct
mapping, generate a new name per event, while the first-order matching map-
ping generates a fixed number of new names per rule: A transducing mapping
can extract quite extensive knowledge from a trace, with the direct mapping
giving a complete representation of it in a KB. As discussed, this requires the
user to define general axioms. A first-order matching mapping must work with
less names, and extract less knowledge from a trace. Its design requires to choose
the right amount of abstraction to detect inconsistencies.
system needs more than 409s for n = 7 and shows the expected blow-up due to
the N2ExpTime-completeness of reasoning in the logic underlying OWL [30]. The
guarded system based on SMT similarly shows a non-linear behavior, but scales
better then the guided system. For the evaluation, we ran each system for every
n three times and averaged the numbers, using a Ubuntu 21.04 machine with an
i7-8565U CPU and 32GB RAM. As we can see, the guarded system allows for an
implementation that does not rely on an external, general-purpose reasoners to
evaluate the guards and increases the scalability of the system, while the guided
system does not scale even for small system and KBs.
8 Related Work
Runtime enforcement is a vast research field, for a recent overview we refer to the
work of Falcone and Pinisetty [22], and give the related work for combinations
of ontologies/knowledge bases and transitions systems in the following.
Concerning the combination of ontologies/knowledge bases and business pro-
cess modeling, Corea et al. [16] point out that current approaches lack the foun-
dation to annotate and develop ontologies together with business process rules.
Our approach focuses explicitly on automating the mapping, or support devel-
opers in its development in a specific context, thus satisfying requirement 1 and
7 in their gap analysis for ontology-based business process modelling. Note that
most work in this domain uses ontologies for the process model itself, similar
to the ontology we give in Def. 13 and Def. 13 (e.g., Rietzke et al. [36]) or the
current state (e.g., Corea and Delfmann [15]), not the trace. We refer to the
survey of Corea et al. for a detailed overview.
Compared with existing simulators of hydrocarbon exploration [20,47], which
formalized the domain knowledge of geological processes directly in the transition
rules, we propose a general framework to formalize the domain knowledge in a
knowledge base which is independent from the term rewriting system. This clear
separation of concerns makes it easier for domain users to use the knowledge
base for simulation without having the ability to program.
Tight interactions between programming languages, or transition systems,
beyond logical programming and knowledge bases have recently received increas-
ing research attention. The focus of the work of Leinberger [29,32] is the type
safety of loading RDF data from knowledge bases into programming languages.
Kamburjan et al. [28] semantically lift states for operations on the KB represen-
tation of the state, but are not able to access the trace. In logic programming,
a concurrent extension of Golog [33] is extended to verify CTL properties with
description logic assertions by Zarrieß and Claßen [48].
Cauli at al. [12] use knowledge bases to reason about the security properties of
deployment configuration in the cloud, a high level representation of the overall
system. As for traces, Pattipati et al. [34] introduce a debugger for C programs
that operates on logs, i.e., special Traces. Their system operates post-execution
and cannot guide the system. Al Haider et al. [1] use a similar technique to
investigate logged traces of a program.
Runtime Enforcement Using Knowledge Bases 237
9 Conclusion
We present a framework to use domain knowledge about dynamic processes to
guide the execution of generic transition systems through runtime enforcement.
We give a transformation to use of rule specific guards instead of using the do-
main knowledge directly as a consistency invariant over knowledge bases. The
transformation is transparent and the domain user can interact with the system
without being aware of the transformation or implementation details. To reduce
the working load on the programmer, we discuss semi-automatic design of map-
pings using transducing approaches and a pre-defined direct mapping. We also
discuss further alternatives, such as additional axioms on the events, and the
use of local well-formedness predicates for certain classes of mappings.
Future Work. We plan to investigate how our system can interact with knowledge
base evolution [24], a more declarative approach for changes in knowledge bases,
as well as other approaches to modeling sequences in knowledge bases [40].
References
1. N. Al Haider, B. Gaudin, and J. Murphy. Execution trace exploration and analysis
using ontologies. In RV, volume 7186 of LNCS, pages 412–426. Springer, 2011.
2. Apache Foundation. Apache jena. https://jena.apache.org/.
3. F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider,
editors. The Description Logic Handbook: Theory, Implementation, and Applica-
tions. Cambridge University Press, 2003.
4. F. Baader, S. Ghilardi, and C. Lutz. LTL over description logic axioms. ACM
Trans. Comput. Log., 13(3):21:1–21:32, 2012.
5. F. Baader and B. Hollunder. Embedding defaults into terminological knowledge
representation formalisms. J. Autom. Reason., 14(1):149–180, 1995.
6. F. Baader and M. Lippmann. Runtime verification using the temporal description
logic ALC-LTL revisited. J. Appl. Log., 12(4):584–613, 2014.
7. S. Baset and K. Stoffel. Object-oriented modeling with ontologies around: A survey
of existing approaches. Int. J. Softw. Eng. Knowl. Eng., 28(11-12):1775–1794, 2018.
8. B. Beckert and D. Bruns. Dynamic logic with trace semantics. In CADE, volume
7898 of LNCS, pages 315–329. Springer, 2013.
9. S. Brandt, E. G. Kalayci, R. Kontchakov, V. Ryzhikov, G. Xiao, and M. Za-
kharyaschev. Ontology-based data access with a horn fragment of metric temporal
logic. In AAAI, pages 1070–1076. AAAI Press, 2017.
10. S. Brandt, E. G. Kalayci, V. Ryzhikov, G. Xiao, and M. Zakharyaschev. Querying
log data with metric temporal logic. J. Artif. Intell. Res., 62:829–877, 2018.
11. R. Bubel, C. C. Din, R. Hähnle, and K. Nakata. A dynamic logic with traces
and coinduction. In TABLEAUX, volume 9323 of LNCS, pages 307–322. Springer,
2015.
12. C. Cauli, M. Li, N. Piterman, and O. Tkachuk. Pre-deployment security assessment
for cloud services through semantic reasoning. In CAV (1), volume 12759 of LNCS,
pages 767–780. Springer, 2021.
13. K. L. Clark and F. G. McCabe. Ontology oriented programming in go! Appl.
Intell., 24(3):189–204, 2006.
14. M. Clavel, F. Durán, S. Eker, P. Lincoln, N. Martı́-Oliet, J. Meseguer, and C. L.
Talcott. Reflection, metalevel computation, and strategies. In All About Maude,
volume 4350 of LNCS, pages 419–458. Springer, 2007.
15. C. Corea and P. Delfmann. Detecting compliance with business rules in ontology-
based process modeling. In J. M. Leimeister and W. Brenner, editors, To-
wards Thought Leadership in Digital Transformation: 13. Internationale Tagung
Wirtschaftsinformatik, WI 2017, St.Gallen, Switzerland, February 12-15, 2017,
2017.
16. C. Corea, M. Fellmann, and P. Delfmann. Ontology-based process modelling - will
we live to see it? In A. K. Ghose, J. Horkoff, V. E. S. Souza, J. Parsons, and
J. Evermann, editors, Conceptual Modeling - 40th International Conference, ER
2021, Virtual Event, October 18-21, 2021, Proceedings, volume 13011 of Lecture
Notes in Computer Science, pages 36–46. Springer, 2021.
17. J. Davies, R. Studer, and P. Warren. Semantic Web technologies: trends and re-
search in ontology-based systems. John Wiley & Sons, 2006.
18. L. M. de Moura and N. S. Bjørner. Z3: an efficient SMT solver. In TACAS, volume
4963 of Lecture Notes in Computer Science, pages 337–340. Springer, 2008.
19. C. C. Din, R. Hähnle, E. B. Johnsen, K. I. Pun, and S. L. T. Tarifa. Locally
abstract, globally concrete semantics of concurrent programming languages. In
TABLEAUX, volume 10501 of LNCS, pages 22–43. Springer, 2017.
Runtime Enforcement Using Knowledge Bases 239
40. E. D. Valle, S. Ceri, F. van Harmelen, and D. Fensel. It’s a streaming world!
reasoning upon rapidly changing information. IEEE Intell. Syst., 24(6), 2009.
41. W. M. P. van der Aalst. Business process management: A comprehensive survey.
ISRN Software Engineering, 2013:507984, Feb 2013.
42. W3C, OWL Working Group. Web ontology language. https://www.w3.org/OWL.
43. W3C, RDF Working Group. Resource description framework. https://www.w3.
org/RDF.
44. W3C, SPARQL Working Group. Sparql 1.1 query language. https://www.w3.
org/TR/sparql11-query/.
45. P. A. Walega, B. C. Grau, M. Kaminski, and E. V. Kostylev. Datalogmtl: Com-
putational complexity and expressive power. In IJCAI, pages 1886–1892. ijcai.org,
2019.
46. G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, and M. Za-
kharyaschev. Ontology-based data access: A survey. In J. Lang, editor, IJCAI
2018, pages 5511–5519. ijcai.org, 2018.
47. I. C. Yu, I. Pene, C. C. Din, L. H. Karlsen, C. M. Nguyen, O. Stahl, and A. Latif.
Subsurface Evaluation Through Multi-scenario Reasoning, pages 325–355. Springer
International Publishing, Cham, 2021.
48. B. Zarrieß and J. Claßen. Verification of knowledge-based programs over descrip-
tion logic actions. In IJCAI, pages 3278–3284. AAAI Press, 2015.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Specification and Validation of Normative Rules
for Autonomous Agents
1 Introduction
AI and autonomous robots are being adopted in applications from health and
social care, transportation, infrastructure maintenance. In these applications,
the autonomous agents are often required to perform normative tasks that raise
social, legal, ethical, empathetic, and cultural (SLEEC) concerns [2]. There is
widespread agreement that these concerns must be considered throughout the de-
velopment of the agents [3,4], and numerous guidelines propose high-level princi-
ples that reflect them [5,6,7,8]. However, to follow these guidelines, the engineers
developing the control software of autonomous agents need methods and tools
that support formalisation, validation and verification of SLEEC requirements.
The SLEECVAL tool introduced in our paper addresses this need by enabling
the specification and validation of SLEEC rules, i.e., nonfunctional requirements
focusing on SLEEC principles. To best of our knowledge, our tool is novel in its
support for the formalisation and validation of normative rules for autonomous
agents, and represents a key step towards an automated framework for specify-
ing, validating and verifying autonomous agent compliance with such rules.
SLEECVAL is implemented as an Eclipse extension, and supports the defi-
nition of SLEEC rules in a domain-specific language (DSL). Given a set of such
rules, the tool extracts their semantics in tock-CSP [9]—a discrete-time variant
of the CSP process algebra [10], and uses the CSP refinement checker FDR4 [11]
to detect conflicting and redundant rules, providing counterexamples when such
problems are identified. Our SLEECVAL tool and case studies, together with
a description of its DSL syntax (BNF Grammar) and tock-CSP semantics are
publicly available on our project webpage [12] and GitHub repository [13].
//Conflicting Rules
RuleA when OpenCurtainsRequested then CurtainsOpened within 3 seconds
RuleB when OpenCurtainsRequested and userUndressed then not CurtainsOpened
//Redundant Rules
RuleC when DressingStarted then DressingFinished
RuleD when DressingStarted then DressingFinished within 2 minutes
1 // CONFLICT CHECKING
2 SLEECRuleARuleB = timed priority(intersectionRuleARuleB)
3 assert SLEECRuleARuleB:[deadlock-free]
4 // REDUNDANCY CHECKING
5 SLEECRuleCRuleD = timed priority(intersectionRuleCRuleD)
6 assert not MSN::C3(SLEECRuleCRuleD) [T= MSN::C3(SLEECRuleD)
SLEEC Framework
SLEEC Tock CSP
SLEEC Document Script
Experts Pass
3 Evaluation
Case studies. We used SLEECVAL to specify and validate SLEEC rules sets
for agents in two case studies presented next and summarised in Table 1.
Case study 1. The autonomous agent from the first case study is an assistive
dressing robot from the social care domain [16]. The robot needs to dress a user
with physical impairments with a garment by performing an interactive process
that involves finding the garment, picking it, and placing it over the user’s arms
and torso. The SLEEC specification for this agent comprises nine rules, a subset
of which is shown in Fig. 1. SLEECVAL identified four pairs of conflicting rules
and two pairs of redundant rules in the initial version of this SLEEC specification
including the conflicting rules RuleA and RuleB, and the redundant rules RuleC
and RuleD from Fig. 2a.
Case study 2. The autonomous agent from the second case study is a firefighter
drone whose detailed description is available at [17]. Its model identifies 21
robotic-platform services (i.e., capabilities) corresponding to sensors, actuators,
and an embedded software library of the platform. We consider scenarios in
which the firefighter drone interacts with several stakeholders: human firefight-
ers, humans affected by a fire, and teleoperators.
In these scenarios, the drone surveys a building where a fire was reported
to identify the fire location, and it either tries to extinguish a clearly identified
fire using its small on-board water reservoir, or sends footage of the surveyed
building to teleoperators. If, however, there are humans in the video stream,
there are privacy (ethical and/or legal) concerns. Additionally, the drone sounds
an alarm when its battery is running out. There are social requirements about
sounding a loud alarm too close to a human. The SLEEC specification for this
agent consists of seven rules, within which SLEECVAL identified one conflict
(between the rules shown in Fig. 4) and seven redundancies. The conflict is due
to the fact that Rule3 requires that the alarm is triggered (event SoundAlarm)
when the battery level is critical (signalled by the event BatteryCritical) and either
the temperature is great than 35◦ C or a person is detected, while the defeater
from Rule7 prohibits the triggering of the alarm when a person is detected.
Overheads. The overheads of the SLEECVAL validation depend on the com-
plexity and size of the SLEEC specifications, which preliminary discussions with
stakeholders suggested might include between several tens and a few hundred
rules. In our evaluation, the checks of the 27 assertions from the assistive robot
246 S. G. Yaman et al.
case study and of the 63 assertions from the firefighter drone case study were
performed in under 30s and 70s, respectively, on a standard MacBook laptop.
As the number of checks is quadratic in the size of the SLEEC rule set, the
time required to validate a fully fledged rule set of, say, 100–200 rules should not
exceed tens of minutes on a similar machine.
Usability. We have conducted a preliminary study in which we have asked eight
tool users (including lawyers, philosophers, computer scientists, roboticists and
human factors experts) to assess the SLEECVAL usability and expressiveness,
and to provide feedback to us. In this trial, the users were asked to define SLEEC
requirements for autonomous agents used in their projects, e.g. autonomous cars
and healthcare systems. The feedback received from these users can be summa-
rized as follows: (1) SLEECVAL is easy to use and the language is intuitive;
(2) The highlighting of keywords, errors messages and warnings is particularly
helpful in supporting the definition of a comprehensive and valid SLEEC speci-
fication; (3) Using the FDR4 output (e.g., counterexamples) directly is useful as
a preliminary solution, but more meaningful messages are required to make rule
conflicts and redundancies easier to comprehend and fix.
4 Conclusion
We have introduced SLEECVAL, a tool for definition and validation of nor-
mative rules for autonomous agents. SLEECVAL uses a DSL for encoding of
timed SLEEC requirements, and provides them with a tock-CSP semantics that
is automatically calculated by SLEECVAL, as are checks for conflicts and re-
dundancy between rules. We also presented the results from the SLEECVAL use
for an assistive dressing robot and a firefighter drone.
In the future, we will consider uncertainty in the agents and their environ-
ments by extending the SLEEC DSL with probability constructs. Additionally,
we will develop a mechanism to annotate rules with labels that can be used to
provide more insightful feedback to SLEEC experts. Finally, a systematic and
comprehensive user study is also planned as future work. Our vision is to auto-
mate the whole process in Fig. 3 with a suggestive feedback loop allowing users
to address validation issues within their rule sets.
Acknowledgements
This work was funded by the Assuring Autonomy International Programme, and
the UKRI project EP/V026747/1 ‘Trustworthy Autonomous Systems Node in
Resilience’.
SLEECVAL for autonomous agents 247
References
1. Nordmann, A., Hochgeschwender, N., Wigand, D., Wrede, S.: A survey on domain-
specific modeling and languages in robotics. Journal of Software Engineering for
Robotics 7(1), 75–99 (2016)
2. Townsend, B., Paterson, C., Arvind, T., Nemirovsky, G., Calinescu, R., Cav-
alcanti, A., Habli, I., Thomas, A.: From pluralistic normative principles
to autonomous-agent rules. Minds and Machines (2022), https://cutt.ly/
SLEEC-rule-elicitation, Minds & Machines 32, 683–715
3. Dennis, L.A., Fisher, M., Winfield, A.: Towards verifiably ethical robot behaviour.
In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence
(2015)
4. Floridi, L., Cowls, J., Beltrametti, M., Chatila, R., Chazerand, P., Dignum, V.,
Luetge, C., Madelin, R., Pagallo, U., Rossi, F., et al.: An ethical framework for a
good AI society: Opportunities, risks, principles, and recommendations. In: Ethics,
Governance, and Policies in Artificial Intelligence, pp. 19–39. Springer (2021)
5. Future of Life Institute: ASILOMAR AI Principles. https://futureoflife.org/
2017/08/11/ai-principles/ (2017), accessed 31 March 2022
6. IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems: Ethically
Aligned Design – Version II (2017), https://standards.ieee.org/news/2017/
ead_v2/
7. BS8611, B.: Robots and robotic devices, guide to the ethical design and application
of robots and robotic systems. British Standards Institute (2016)
8. UNESCO: Recommendation on the Ethics of Artificial Intelligence.
https://unesdoc.unesco.org/ark:/48223/pf0000380455 (2021), Accessed: 2022-
03-18, Document code: SHS/BIO/REC-AIETHICS/2021
9. Baxter, J., Ribeiro, P., Cavalcanti, A.: Sound reasoning in tock-csp. Acta Informat-
ica 59(1), 125–162 (2022). https://doi.org/10.1007/s00236-020-00394-3, https:
//doi.org/10.1007/s00236-020-00394-3
10. Roscoe, A.W.: The Theory and Practice of Concurrency. Prentice Hall (1997)
11. Gibson-Robinson, T., Armstrong, P., Boulgakov, A., Roscoe, A.: FDR3 — A Mod-
ern Refinement Checker for CSP. In: Ábrahám, E., Havelund, K. (eds.) Tools and
Algorithms for the Construction and Analysis of Systems. Lecture Notes in Com-
puter Science, vol. 8413, pp. 187–201 (2014)
12. SLEECVAL project webpage (2022), sleec.github.io
13. SLEECVAL GitHub repository (2022), anonymous.4open.science/r/SLEEC-tool
14. Brunero, J.: Reasons and Defeasible Reasoning. The Philosophical Quarterly
72(1), 41–64 (04 2021). https://doi.org/10.1093/pq/pqab013, https://doi.org/
10.1093/pq/pqab013
15. Domain-specific language development. https://www.eclipse.org/Xtext (2022),
[Online accessed: 13 October 2022]
16. Camilleri, A., Dogramadzi, S., Caleb-Solly, P.: A study on the effects of cognitive
overloading and distractions on human movement during robot-assisted dressing.
Frontiers in Robotics and AI 9 (May 2022), https://eprints.whiterose.ac.uk/
187214/
17. MBZIRC-OC, “THE CHALLENGE 2020. https://www.mbzirc.com/
grand-challenge (2020), [Online accessed: 13 October 2022]
248 S. G. Yaman et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Towards Log Slicing
1 Introduction
When debugging failures in software systems of various scales, the logs generated
by executions of those systems are invaluable [5]. For example, given an error
message recorded in a log, an engineer can diagnose the system by reviewing
log messages recorded before the error occurred. However, the sheer volume of
the logs (e.g., 50 GB/h [9]) makes it infeasible to review all of the log messages.
Considering that not all log messages are necessarily related to each other, in
this paper we lay the foundations for answering the following question: can we
automatically identify log messages related to a specific message (e.g., an error
message)?
A similar question for programs is already addressed by program slicing [2,14].
Using this approach, given a program composed of multiple program statements
and variables, we can identify a set of program statements (i.e., a program slice)
that affect the computation of specific program variables (at specific positions
in the source code).
Inspired by program slicing, in this paper we take initial steps towards de-
veloping a novel approach, called log slicing. We also highlight a key issue to
(1) logger . info ( " check memory status : % s " % mem . status )
(2) db = DB . init ( mode = " default " )
(3) logger . info ( " DB connected with mode : % s " % db . mode )
(4) item = getItem ( db )
(5) logger . info ( " current item : % s " % item )
(6) if check ( item ) is " error " :
(7) logger . error ( " error in item : % s " % item )
be addressed by further research. Once this issue has been addressed, we expect
log slicing to be able to identify the log messages related to a given problematic
log message by using static analysis of the code that generated the log. Further,
since we will be using static analysis of source code, we highlight that our ap-
proach is likely to be restricted to identifying problems that can be localised at
the source code level.
The rest of the paper is structured as follows: Section 2 illustrates a motivat-
ing example. Section 3 sketches an initial approach for log slicing, while Section 4
shows its application to the example, and discusses limitations and open issues.
Section 5 discusses related work. Section 6 concludes the paper.
2 Motivating Example
The last log message “error in item: pencil” in Lex indicates an error.
Calling this log message merr , let us suppose that a developer is tasked with
addressing the error by reviewing the log messages leading up to merr . Though
we have only four messages in Lex , it is infeasible in practice to review a huge
amount of log messages generated by complex software systems. Furthermore,
it is not necessary to review all log messages generated before merr since only a
subset of them is related to merr ; for example, if we look at Lex and Pex together,
we can see that the first log message “check memory status: okay” does not
contain information that is relevant to the error message, merr . In particular,
we can see this by realising that the variable mem logged in the first log message
does not affect the computation of the variable item logged in the error message.
Ultimately, if we can automatically filter out such unrelated messages, with
the goal of providing a log to the developer that only contains useful log messages,
then the developer will better investigate and address issues in less time. We thus
arrive at the central problem of this short paper: How does one determine which
log messages are related to a certain message of interest?
An initial, naive solution would be to use keywords to identify related mes-
sages. In our example log Lex , one could use the keyword “pencil” appearing
in the error message to identify the messages related to the error, resulting in
only the third log message. However, if we look at the source code in Pex , we
can notice that the second log message “DB connected with mode: default”
could be relevant to the error because this message was constructed using the
db variable, which is used to compute the value of variable item. This example
highlights that keyword-based search cannot identify all relevant log messages,
meaning that a more sophisticated approach to identifying relevant log messages
is needed.
3 Log Slicing
A key assumption in this work is that it is possible to associate each log message
with a unique logging statement in source code. We highlight that, while we do
not describe a solution here, this is a reasonable assumption because there is
already work on identifying the mapping between logging statements and log
messages [4,11]. Therefore, we simply assume that the mapping is known.
Under this assumption, we observe that the relationship among messages in
the log can be identified based on the relationship among their corresponding
logging statements in the source code. Hence, we consider two distinct layers: the
program layer, where program statements and variables exist, and the log layer,
where log messages generated by the logging statements of the program exist.
To present our log slicing approach, as done in Section 2, let us denote a
program P as a sequence of program statements and a log L as a sequence of log
messages. Also, we say a program (slice) P 0 is a subsequence of P , denoted by
P 0 @ P , if all statements of P 0 are in P in the same order. Further, we extend
containment to sequences and write s ∈ P when, with P = hs1 , . . . , su i, there is
some k such that sk = s. The situation is similar for a log message m contained
252 J. H. Dawes et al.
The result of this procedure would be a log slice that contains log messages that
are relevant to mj .
We highlight that defining the relation relevanceP for a program P (intu-
itively, deciding whether the information written to a log by a logging statement
is relevant to the computation being performed by some non-logging statement)
is a central problem in this work, and will be discussed in more depth in the
next section.
Fig. 3. Program slice Sr of the program Pex when s7 and its variable item are used as
the slicing criterion
statement that writes a message m to the log, then, assuming that the only
way in which a logging statement can use a variable is to add information to
the message that it writes to the log, the set vars(s) corresponds to the set of
variables used to construct the message m. If s is a non-logging statement, then
vars(s) represents the set of variables used by s.
Now, let us consider a logging statement sl , that writes a message ml to the
log, and a non-logging statement sr . We define relevanceP 6 over the statements
in a program P by hsl , sr i ∈ relevanceP if and only if vars(sl ) ∩ vars(sr ) 6= ∅. In
other words, a logging statement is relevant to a non-logging statement whenever
the two statements share at least one variable.
Step 1. Under our assumption that log messages can be mapped to their gen-
erating logging statements, we can immediately map m4 to s7 ∈ Pex . Once we
have identified the logging statement s7 that generated m4 , we slice Pex back-
wards, using s7 and its variable item as the slicing criterion. This would yield
the program slice Sr = hs2 , s4 , s6 , s7 i as shown in Figure 3.
Fig. 5. Log slicing result from Lex when m4 is the message of interest
Step 3. Using Sl = hs3 , s5 , s7 i, we now remove log messages from Lex that were
generated by logging statements not included in Sl . The result is the sliced log
in Figure 5.
Further Limitations. While this heuristic takes a step towards inspecting the
semantic content of log messages, rather than relying on shared variables, initial
implementation efforts have demonstrated the following limitations:
More Issues. In Section 3, we assumed that the mapping between log messages
and the corresponding logging statements that generated the log messages is
known. However, determining the log message that a given logging statement
might generate can be challenging, especially when the logging statement has a
non-trivial structure. For example, while some logging statements might consist
of a simple concatenation of a string and a variable value, others might involve
nested calls of functions from a logging framework. This calls for more studies
on finding the correspondence between logging statements and log messages.
256 J. H. Dawes et al.
5 Related Work
Log Analysis. The relationship between log messages has also been studied in
various log analysis approaches (e.g., performance monitoring, anomaly detec-
tion, and failure diagnosis), especially for building a “reference model” [12] that
represents the normal behavior (in terms of logged event flows) of the system
under analysis. However, these approaches focus on the problem of identifying
whether log messages co-occur (that is, one is always seen in the neighbourhood
of the other) without accessing the source code [6,10,13,17,18]. On the other
hand, we consider the computational relationship between log messages to filter
out the log messages that do not affect the computation of the variable values
recorded in a given log message of interest.
6 Conclusion
In this short paper, we have taken the first steps in developing log slicing, an
approach to helping software engineers in their log-based debugging activities.
Towards Log Slicing 257
Log slicing starts from a log message that has been selected as indicative of a
failure, and uses static analysis of source code (whose execution generated the
log in question) to throw away log entries that are not relevant to the failure.
In giving an initial definition of the log slicing problem, we highlighted the
central problem of this work: defining a good relevance relation. The provisional
definition of relevance that we gave in Section 4.1 proved to be limited in that it
required logging statements to use variables when constructing their log message.
To remedy the situation, we introduced a frequency and proximity-based heuris-
tic in Section 4.3. While this approach could improve on the initial definition of
relevance, it possessed various limitations that we summarised.
Ultimately, as part of future work, we intend to investigate better definitions
of relevance between logging statements and non-logging statements. If we were
to carry on with the same idea for the heuristic (using frequency and proximity),
future work would involve 1) finding a suitable way to define tokens; 2) reducing
identification of coincidental associations between tokens and variables (i.e., re-
ducing false positives); and 3) attempting to identify associations between tokens
and variables with a lower frequency.
References
1. van der Aalst, W.M.P.: Distributed process discovery and conformance checking.
In: de Lara, J., Zisman, A. (eds.) Fundamental Approaches to Software Engineer-
ing. pp. 1–25. Springer Berlin Heidelberg, Berlin, Heidelberg (2012)
2. Agrawal, H., Horgan, J.R.: Dynamic program slicing. SIGPLAN Not. 25(6),
246–256 (jun 1990). https://doi.org/10.1145/93548.93576, https://doi.org/
10.1145/93548.93576
3. Basin, D., Caronni, G., Ereth, S., Harvan, M., Klaedtke, F., Mantel, H.: Scalable
offline monitoring. In: Bonakdarpour, B., Smolka, S.A. (eds.) Runtime Verification.
pp. 31–47. Springer International Publishing, Cham (2014)
4. Bushong, V., Sanders, R., Curtis, J., Du, M., Cerny, T., Frajtak, K., Bures, M.,
Tisnovsky, P., Shin, D.: On matching log analysis to source code: A systematic
mapping study. In: Proceedings of the International Conference on Research in
Adaptive and Convergent Systems. p. 181–187. RACS ’20, Association for Comput-
ing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3400286.
3418262, https://doi.org/10.1145/3400286.3418262
5. He, S., He, P., Chen, Z., Yang, T., Su, Y., Lyu, M.R.: A survey on automated log
analysis for reliability engineering. ACM Comput. Surv. 54(6) (Jul 2021). https:
//doi.org/10.1145/3460345
6. Jia, T., Yang, L., Chen, P., Li, Y., Meng, F., Xu, J.: Logsed: Anomaly diagnosis
through mining time-weighted control flow graph in logs. In: 2017 IEEE 10th In-
ternational Conference on Cloud Computing (CLOUD). pp. 447–455. IEEE, IEEE,
Honolulu, CA, USA (2017). https://doi.org/10.1109/CLOUD.2017.64
258 J. H. Dawes et al.
7. Liu, Z., Xia, X., Lo, D., Xing, Z., Hassan, A.E., Li, S.: Which variables should I
log? IEEE Transactions on Software Engineering 47(9), 2012–2031 (2021). https:
//doi.org/10.1109/TSE.2019.2941943
8. Messaoudi, S., Shin, D., Panichella, A., Bianculli, D., Briand, L.C.: Log-based
slicing for system-level test cases. In: Proceedings of the 30th ACM SIGSOFT In-
ternational Symposium on Software Testing and Analysis. p. 517–528. ISSTA 2021,
Association for Computing Machinery, New York, NY, USA (2021). https://doi.
org/10.1145/3460319.3464824, https://doi.org/10.1145/3460319.3464824
9. Mi, H., Wang, H., Zhou, Y., Lyu, M.R.T., Cai, H.: Toward fine-grained, unsu-
pervised, scalable performance diagnosis for production cloud computing systems.
IEEE Transactions on Parallel and Distributed Systems 24(6), 1245–1255 (2013).
https://doi.org/10.1109/TPDS.2013.21
10. Nandi, A., Mandal, A., Atreja, S., Dasgupta, G.B., Bhattacharya, S.: Anomaly
detection using program control flow graph mining from execution logs. In: 2016
26nd ACM International Conference on Knowledge Discovery and Data Mining
(SIGKDD). pp. 215–224. KDD ’16, Association for Computing Machinery, New
York, NY, USA (2016). https://doi.org/10.1145/2939672.2939712
11. Schipper, D., Aniche, M., van Deursen, A.: Tracing back log data to its log
statement: From research to practice. In: 2019 IEEE/ACM 16th International
Conference on Mining Software Repositories (MSR). pp. 545–549 (2019). https:
//doi.org/10.1109/MSR.2019.00081
12. Shin, D., Bianculli, D., Briand, L.: PRINS: scalable model inference for
component-based system logs. Empirical Software Engineering 27(4), 87
(2022). https://doi.org/10.1007/s10664-021-10111-4, https://doi.org/10.
1007/s10664-021-10111-4
13. Tak, B.C., Tao, S., Yang, L., Zhu, C., Ruan, Y.: Logan: Problem diagnosis in the
cloud using log-based reference models. In: 2016 IEEE International Conference
on Cloud Engineering (IC2E). pp. 62–67 (2016). https://doi.org/10.1109/IC2E.
2016.12
14. Weiser, M.: Program slicing. IEEE Trans. Softw. Eng. 10(4), 352–357 (Jul 1984).
https://doi.org/10.1109/TSE.1984.5010248, https://doi.org/10.1109/TSE.
1984.5010248
15. Yuan, D., Zheng, J., Park, S., Zhou, Y., Savage, S.: Improving software diag-
nosability via log enhancement. ACM Trans. Comput. Syst. 30(1) (Feb 2012).
https://doi.org/10.1145/2110356.2110360
16. Zhao, X., Rodrigues, K., Luo, Y., Stumm, M., Yuan, D., Zhou, Y.: Log20: Fully
automated optimal placement of log printing statements under specified overhead
threshold. In: 2017 26th Symposium on Operating Systems Principles (SOSP). p.
565–581. SOSP ’17, Association for Computing Machinery, New York, NY, USA
(2017). https://doi.org/10.1145/3132747.3132778
17. Zhao, X., Rodrigues, K., Luo, Y., Yuan, D., Stumm, M.: Non-Intrusive
performance profiling for entire software stacks based on the flow recon-
struction principle. In: 12th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 16). pp. 603–618. USENIX Associa-
tion, Savannah, GA (Nov 2016), https://www.usenix.org/conference/osdi16/
technical-sessions/presentation/zhao
18. Zhou, P., Wang, Y., Li, Z., Tyson, G., Guan, H., Xie, G.: Logchain: Cloud
workflow reconstruction & troubleshooting with unstructured logs. Computer
Networks 175, 107279 (2020). https://doi.org/https://doi.org/10.1016/j.
comnet.2020.107279, https://www.sciencedirect.com/science/article/pii/
S1389128619316731
Towards Log Slicing 259
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Vamos:
Middleware for Best-Effort Third-Party Monitoring
1 Introduction
(keeping pace at low overhead), (ii) flexibility (compatibility with a wide range
of heterogeneous event sources that connect the monitor with the monitored
software, and with a wide range of formal specification languages that can be
compiled into Vamos), and (iii) ease-of-use (the middleware relieves the designer
of the monitor from system and code instrumentation concerns).
All of these goals are fairly standard, but Vamos’ particular design tradeoffs
center around making it as easy as possible to create a best-effort third-party
monitor of actual software without investing much time into low-level details of
instrumentation or load management. In practice, instrumentation—enriching
the monitored system with code that is gathering observations on whose basis
the monitor generates verdicts—is a key part of writing a monitoring system
and affects key performance characteristics of the monitoring setup [11]. These
considerations become even more important in third-party monitoring, where the
limited knowledge of and access to the monitored software may force the monitor
to spend more computational effort to re-derive information that it could not
observe, or combine it from smaller pieces obtained from more (and different)
sources. By contrast, current implementations of monitor specification languages
mostly offer either very targeted instrumentation support for particular systems
or some general-purpose API to receive events, or both, but little to organize
multiple heterogeneous event streams, or to help with the kinds of best-effort
performance considerations that we are concerned with. Thus, Vamos fills a gap
left open by existing tools.
Our vision for Vamos is that users writing a best-effort third-party monitor
start by selecting configurable instrumentation tools from a rich collection. This
collection includes tools that periodically query webservices, generate events for
relevant system calls, observe the interactions of web servers with clients, and
of course standard code instrumentation tools. The configuration effort for each
such event source largely consists of specifying patterns to look for and what
events to generate for them. Vamos then offers a simple specification language
for filtering and altering events coming from the event sources, and simple yet
expressive event recognition rules that produce a single, global event stream
by combining events from a (possibly dynamically changing) number of event
sources. Lastly, monitoring code as it is more generally understood—which could
be written directly or generated from existing tools for run-time verification like
LTL formulae [47], or stream verification specifications [8] such as TeSSLa [41]—
processes these events to generate verdicts about the monitored system.
Vamos thus represents middleware between event sources that emit events
and higher-level monitoring code, abstracting away many low-level details about
the interaction between the two. Users can employ both semi-synchronous and
completely asynchronous [11] interactions with any or all event sources. Between
these two extremes, to decouple the higher-level monitoring code’s performance
from the overhead incurred by the instrumentation, while putting a bound on
how far the monitoring code can lag behind the monitored system, we provide a
simple load-shedding mechanism that we call autodrop buffers, which are buffers
that drop events when the monitoring code cannot keep up with the rate of in-
262 M. Chalupa et al.
coming events, while maintaining summarization data about the dropped events.
This summarization data can later be used by our event recognition system when
it is notified that events were dropped; some standard monitoring specification
systems can handle such holes in their event streams automatically [32,42,54].
The rule-based event recognition system allows grouping and ordering buffers
dynamically to prioritize or rotate within variable sets of similar event sources,
and specifying patterns over multiple events and buffers, to extract and combine
the necessary information for a single global event stream.
Data from event sources is transferred to the monitor using efficient lock-free
buffers in shared memory inspired by Cache-Friendly Asymmetric Buffers [29].
These buffers can transfer over one million events per second per event source
on a standard desktop computer. Together with autodrop buffers, this satisfies
our performance goal while keeping the specification effort low. As such, Va-
mos resembles a single-consumer version of an event broker [18,58,48,55,26,1]
specialized to run-time monitoring.
The core features we built Vamos around are not novel on their own, but
to the best of our knowledge, their combination and application to simplify
best-effort third-party monitoring setups is. Thus, we make the following contri-
butions:
2 Architectural Overview
Writing a run-time monitor can be a complex task, but many tools to express
logical reasoning over streams of run-time observations [19,34,16,49,24,27,41]
exist. However, trying to actually obtain a concrete stream of observations from
a real system introduces a very different set of concerns, which in turn have a
huge effect on the performance properties of run-time monitoring [11].
The goal of Vamos is to simplify this critical part of setting up a monitoring
system, using the model shown in Figure 1. On the left side, we assume an arbi-
trary number of distinct event sources directly connected to the monitor. This is
particularly important in third-party monitoring, as information may need to be
collected from multiple different sources instead of just a single program, but can
be also useful in other monitoring scenarios, e.g. for multithreaded programs.
Vamos : Middleware for Best-Effort Third-Party Monitoring 263
present in the monitored system but cannot be observed, or, worse, the monitor
may have to consider multiple different possibilities if information cannot be reli-
ably recomputed. However, as part of our performance goal, we want the monitor
to not lag too far behind the monitored system. Therefore, our design splits the
monitoring system into the performance and correctness layers. In between the
two, events may be dropped as a simple load-shedding strategy.
The performance layer, on the other hand, sees all events and processes each
event stream in parallel. Stream processors enable filtering and altering the events
that come in, reducing pressure and computational load on the correctness layer.
This reflects that in third-party monitoring, observing coarse-grained event types
like system calls may yield many uninteresting events. For example, all calls to
read may be instrumented, but only certain arguments make them interesting.
A Simple Example Listing 1.1 shows a full Vamos specification (aside from
the definition of custom monitoring code in a C function called CheckOp). Stream
types describe the kinds of events and the memory layout of their data that can
appear in a particular buffer; in this example, streams of type Observation
contain only one possible event named Op with two fields of type int. For source
buffers—created using event source descriptions as in line 2—these need to be
based on the specification of the particular event source. Each event source is
associated with a stream processor; if none is given (as in this example), a default
one simply forwards all events to the corresponding arbiter buffer, here specified
as an autodrop buffer that can hold up to 16 events and when full keeps dropping
them until there is again space for at least four new events. Using an autodrop
buffer means that in addition to the events of the stream type, the arbiter may
see a special hole event notifying it that events were dropped. In this example,
the arbiter simply ignores those events and forwards all others to the monitor,
which runs in parallel to the arbiter with a blocking event queue of size two, and
whose behavior we implemented directly in C code between $$ escape characters.
3 Efficient Instrumentation
Our goals for the performance of the monitor are to not incur too much overhead
on the monitored system, and for the monitor to be reasonably up-to-date in
terms of the lag between when an event is generated and when it is processed. The
Vamos : Middleware for Best-Effort Third-Party Monitoring 265
key features Vamos offers to ensure these properties while keeping specifications
simple are related to the performance layer, which we discuss here.
Even when instrumenting things like system calls, in order to extract informa-
tion from them in a consistent state, the monitored system will have to be briefly
interrupted while the instrumentation copies the relevant data. A common solu-
tion is to write this data to a log file that the monitor is incrementally processing.
This approach has several downsides. First, in the presence of multiple threads,
accesses to a single file require synchronization. Second, the common use of string
encodings requires extra serialization and parsing steps. Third, file-based buffers
are typically at least very large or unbounded in size, so slower monitors even-
tually exhaust system resources. Finally, writing to the log uses relatively costly
system calls. Instead, Vamos event sources transmit raw binary data via chan-
nels implemented as limited-size lock-free ring buffer in shared memory, limiting
instrumentation overhead and optimizing throughput [29]. To avoid expensive
synchronization of different threads in the instrumented program (or just to
logically separate events), Vamos also allows dynamically allocating new event
sources, such that each thread can write to its own buffer(s). The total number
of event sources may therefore vary across the run of the monitor.
For each event source, Vamos allocates a new thread in the performance
layer to process events from this source2 . In this layer, event processors can
filter and alter events before they are forwarded to the correctness layer, all in
a highly parallel fashion. A default event processor simply forwards all events.
The computations done here should be done at the speed at which events are
generated on that particular source, otherwise the source buffer will fill up and
eventually force the instrumentation to wait for space in the buffer.
events is lost, which can help to reduce the impact of load shedding. At mini-
mum, the existence of the hole event alone makes a difference in monitorability
compared to not knowing whether any events have been lost [35], and is used as
such in some monitoring systems [32,42,54].
In addition to autodrop buffers, arbiter buffers can also be finite-size buffers
that block when space is not available, or ininite-size buffers. The former may
slow down the stream processor and ultimately the event source, while the latter
may accumulate data and exhaust available resources. For some event sources,
this may not be a big risk, and it eliminates the need to deal with hole events.
This rule matches multiple events on two different buffers (In and Out), describ-
ing a series of user input and program output events that together form a single
higher-level event SawTransfer, which is forwarded to the monitor component
of the correctness layer. Rules do not necessarily consume the events they have
looked at; some events may also just serve as a kind of lookahead. The “|” charac-
ter in the events sequence pattern separates the consumed events (left) from the
lookahead (right). Code between $$ symbols can be arbitrary C code with some
special constructs, such as the $yield statement (to forward events) above.
The rule above demonstrates the basic event-recognition capabilities of ar-
biters. By ordering the rules in a certain way, we can also prioritize processing
events from some buffers over others. Rules can also be grouped into rule sets
that a monitor can explicitly switch between in the style of an automaton.
Vamos : Middleware for Best-Effort Third-Party Monitoring 267
The rules shown so far only refer to arbiter buffers associated with specific,
named event sources. As we mentioned before, Vamos also supports creating
event sources dynamically during the run of the monitoring system. To be able
to refer to these in arbiter rules, we use an abstraction we call buffer groups.
As the name suggests, buffer groups are collections of arbiter buffers whose
membership can change at run time. They are the only way in which the arbiter
can access dynamically created event sources, so to allow a user to distinguish
between them and manage associated data, we extend stream types with stream
fields that can be read and updated by arbiter code. Buffer groups are declared
for a specific stream type, and their members have to have that stream type 3 .
Therefore, each member offers the same stream fields, which we can use to com-
pare buffers and order them for the purposes of iterating through the buffer
group. Now the arbiter rules can also be choice blocks with more rules nested
within them, as follows (Both is a buffer group and pos is a stream field):
1 choose F,S from Both {
2 on F : Prime(n,p) | where $$ $F.pos < $S.pos $$
3 $$ ... $$
4 on F : hole(n) |
5 $$ $F.pos = $F.pos + n; $$
6 }
This rule is a slightly simplified version of one in the Primes example in Section 6.
This example does not use dynamically created buffers, but only has two event
sources, and uses the ordering capabilities of buffer groups to prioritize between
the buffers based on which one is currently “behind” (expressed in the stream
field pos, which the buffer group Both is ordered by). The choose rule tries to
instantiate its variables with distinct members from the buffer group, trying out
permutations in the lexicographic extension of the order specified for the buffer
group. If no nested rule matches for a particular instantiation, the next one in
order is tried, and the choose rule itself fails if no instantiation finds a match.
To handle dynamically created event sources, corresponding stream processor
rules specify a buffer group to which to add new event sources, upon which the
arbiter can access them through choose rules. In most cases, we expect that
choose blocks are used to instantiate a single buffer, in which case we only need
to scan the buffer group in its specified order. Here, a round-robin order allows
for fairness, while field-based orderings allow more detailed control over buffers
prioritization, as it may be useful to focus on a few buffers at the expense of
others, as in our above example.
Another potential option for ordering schemes for buffer groups could be
based on events waiting in them, or even the values of those events’ associated
data. Vamos currently does not support this because it makes sorting much more
3
Note that stream processors may change the stream type between the source buffer
and arbiter buffer, so event sources may use different types, but their arbiter buffers
may be grouped together if processed accordingly.
268 M. Chalupa et al.
5 Implementation
The compiler takes a Vamos specification described in the previous sections and
turns it into a C program. It does some minimal checking, for example whether
events used in parts of the program correspond to the expected stream types,
but otherwise defers type-checking to the C compiler. The generated program
expects a command-line argument for each specified event source, providing the
name of the source buffer created by whatever actual event source is used. Event
sources signal when they are finished, and the monitor stops once all event
sources are finished and all events have been processed.
The default way of using TeSSLa for online monitoring is to run an offline
monitor incrementally on a log file of serialized event data from a single global
4
Entries have the size of the largest event consisting of its fixed-size fields and iden-
tifiers for variable-sized data (strings) transported in separately managed memory.
Vamos : Middleware for Best-Effort Third-Party Monitoring 269
event source. A recent version of TeSSLa [33] allows generating Rust code for
the stream processing system with an interface to provide events and drive the
stream processing directly. Our compiler can generate the necessary bridging
code and replace the monitor component in Vamos with a TeSSLa Rust moni-
tor. We used TeSSLa as a representative of higher-level monitoring specification
tools; in principle, one could similarly use other standard monitor specification
languages, thus making it easier to connect them to arbitrary event sources.
6 Evaluation
Our stated design goals for Vamos were (i) performance, (ii) flexibility, and
(iii) ease-of-use. Of these, only the first is truly quantitative, and the major-
ity of this section is devoted to various aspects of it. We present a number of
benchmark programs, each of which used Vamos to retrieve events from differ-
ent event sources and organize them for a higher-level monitor in a different way,
which provides some qualitative evidence for its flexibility. Finally, we present a
case study to build a best-effort data-race monitor (Section 6.4), whose relative
simplicity provides qualitative evidence for Vamos’ ease of use.
In evaluating performance, we focus on two critical metrics:
Our core claim is that Vamos allows building useful best-effort third-party
monitors for programs that generate hundreds of thousands of events per second
without a significant slow down of the programs beyond the unavoidable cost of
generating events themselves. We provide evidence that corroborates this claim
based on three artificial benchmarks that vary various parameters and one case
study implementation of a data race monitor that we test on 391 benchmarks
taken from SV-COMP 2022 [7].
Our first experiment is meant to establish the basic capabilities of our arbiter
implementation. An event source sends 10 million events carrying a single 64-bit
number (plus 128 bits of metadata), waiting for some number of cycles between
270 M. Chalupa et al.
100
Processed events (%)
80
Waiting [cyc.]
0 60
60 10 70
20 80
40 30 90
40 100
50 200
101 102 103
Arbiter buffer size
Fig. 2. The percentage of events that reached the final stage of the monitor in a stress
test where the source sends events rapidly. Parameters are different arbiter buffer sizes
(x-axis) and the delay (Waiting) of how many empty cycles the source waits between
sending individual events. The shading around lines shows the 95 % confidence interval
around the mean of the measured value. The source buffer was 8 pages large, which
corresponds to a bit over 1 300 events.
each event. The performance layer simply forwards the events to autodrop buffers
of a certain size, the arbiter retrieves the events, including holes, and forwards
them to the monitor, which keeps track of how many events it saw and how
many were dropped. We varied the number of cycles and the arbiter buffer sizes
to see how many events get dropped because the arbiter cannot process them
fast enough—the results can be seen in Figure 2.
At about 70 cycles of waiting time, almost all events could be processed
even with very small arbiter buffer sizes (4 and up). In our test environment,
this corresponds to a delay of roughly 700 ns between events, which means that
Vamos is able to transmit approximately 1.4 million of events per second.
6.2 Primes
As a stress-test where the monitor actually has some work to do, this benchmark
compares two parallel runs of a program that generates streams of primes and
prints them to the standard output, simulating a form of differential monitor-
ing [45]. The task of the monitor is to compare their output and alert the user
whenever the two programs generate different data. Each output line is of the
form #n : p, indicating that p is the nth prime. This is easy to parse using reg-
ular expressions, and our DynamoRIO-based instrumentation tool simply yields
events with two 32-bit integer data fields (n and p).
While being started at roughly the same time, the programs as event sources
run independently of each other, and scheduling differences can cause them to
run out of sync quickly. To account for this, a Vamos specification needs to al-
locate large enough buffers to either keep enough events to make up for possible
scheduling differences, or at least enough events to make it likely that there is
Vamos : Middleware for Best-Effort Third-Party Monitoring 271
Native Monitor
Arbiter buff. size 100
12
128
10 80
8 1024
2048 60
6
4 40
2 20
0
10k 20k 30k 40k 10k 20k 30k 40k 10k 20k 30k 40k
Primes generated Primes generated Primes generated
Fig. 3. Overheads (left) and percentage of found errors (right) in the primes benchmark
for various numbers of primes and arbiter buffer sizes relative to DynamoRIO-optimized
but not instrumented runs. DynamoRIO was able to optimize the program so much
that the native binary runs slower than the instrumented one.
some overlap between the parts of the two event streams that are not automat-
ically dropped. The arbiter uses the event field for the index variable n to line
up events from both streams, exploiting the buffer group ordering functionality
described in Section 4.2 to preferentially look at the buffer that is “behind”, but
also allowing the faster buffer to cache a limited number of events while waiting
for events to show up on the other one. Once it has both results for the same
index, the arbiter forwards a single pair event to the monitor to compare them.
Figure 3 shows results of running this benchmark in 16 versions, generating
between 10 000 and 40 000 primes with arbiter buffer sizes ranging between 128
and 2024 events. The overheads of running the monitor are small, do not differ
between different arbiter buffer sizes, and longer runs amortize the initial cost
of dynamic instrumentation. We created a setting where one of the programs
generates a faulty prime about once every 10 events and measured how many
of these discrepancies the monitor can find (which depends on how many events
are dropped). Unsurprisingly, larger buffer sizes are better at balancing out the
scheduling differences that let the programs get out of sync. As long as the
programs run at the same speed, there should be a finite arbiter buffer size that
counters the desynchronization. In these experiments, this size is 512 elements.
100 100
Checked events (%)
Fig. 4. Percentage of primes checked and errors found (of 40 000 events in total) by
the TeSSLa monitor for different arbiter specifications and arbiter buffer sizes.
1. the forward arbiter just forwards events as they come; it is equivalent to writ-
ing a script that parses output of generators and (atomically) feeds events
into a pipe from which TeSSLa reads events.
2. the alternate arbiter always forwards the event from the stream where we
have seen fewer events so far; if streams happen to be aligned (that is, contain
no or only equally-sized hole events), the events will perfectly alternate.
3. the align arbiter is the one we used in our original implementation to intel-
ligently align both streams
Figure 4 shows the impact of these different arbiter designs on how well the
monitor is able to do its task, and that indeed more active arbiters yield better
results—without them, the streams are perfectly aligned less than 1% of the time.
While one could write similar functionality to align different, unsynchronized
streams in TeSSLa directly, the language does not easily support this. As such,
a combination of TeSSLa and Vamos allows simpler specifications in a higher-
level monitoring language, dealing with the correct ordering and preprocessing
of events on the middleware level.
6.3 Bank
50
50
100
4 16 32 64 128 256 512 1024 2048 4 16 32 64 128 256 512 1024 2048
Arbiter buffer size Arbiter buffer size
Fig. 5. Results of monitoring a simple banking simulator with Vamos monitor (left)
and TeSSLa monitor (right). Boxplots show the difference in the number of reported
errors versus the number of errors the application made, in percent.
fails, this amount provides an upper bound on an account’s balance, and any
higher successive withdrawal attempt must surely fail too.
In the spirit of third-party monitoring, however, the stateful interface does
not necessarily make it easy to derive these higher level events. For example,
there is no individual confirmation that says that the withdrawal of some amount
from some account was successful or not. Instead, the user selects an account,
then the withdraw action, is then prompted which amount they would like to
withdraw from said account, and after entering said amount, the system only
displays a message that the withdrawal failed or was successful. The event source
parses each individual step and provides them on two separate streams, one for
the inputs and one for the outputs. This is where Vamos’ higher-level event
recognition capabilities (see also the example in Section 4.1) allow the arbiter
to recognize the higher-level events to forward to the monitor, which itself is
therefore again much easier to specify.
To conduct measurements, we randomly generated 10 000 (well-formed) in-
puts and fed them to the banking application as fast as possible. We also let
the application generate erroneous outputs (wrong balances, swapping success
and failure messages) at random and measured how many those our best-effort
third-party monitor was able to detect. The size of the source buffer was one
page (128 events) and we varied the size of arbiter buffers from 4 to 2048.
The heavyweight instrumentation we used in this scenario caused the bank-
ing application to run through its script about 40% slower than without instru-
mentation for all sizes of the arbiter buffer, which is more than in our other
benchmarks, but seems still plausible for interactive programs, and could be
much more optimized. Our second metric is how many errors the monitor actu-
ally detects. Figure 5 shows this for both the monitor we described above and
a TeSSLa variant that only considers exact knowledge about account balances
(no upper or lower bounds) and thus finds fewer errors, demonstrating both an
alternate monitor design and the use of our TeSSLa connector. The results vary
quite a bit with arbiter buffer sizes and between runs, and the monitor may re-
port more errors than were inserted into the run. This is because, first, especially
274 M. Chalupa et al.
with smaller buffer sizes, the autodrop buffers may drop a significant portion (up
to 60% at arbiter buffer size 4, 5% at size 256) of the events, but the moni-
tor needs to see a contiguous chunk of inputs and outputs to be able to gather
enough information to find inconsistencies. Second, some errors cause multiple
inconsistencies: when a transfer between accounts is misreported as successful
or failed when the opposite is true, the balances (or bounds) of two accounts
are wrong. Overall, both versions of the monitor were able to find errors with
even smaller sizes of arbiter buffers, and increasing numbers improved the results
steadly, matching the expected properties of a best-effort third-party monitor.
While our other benchmarks were written artificially, we also used Vamos to de-
velop a best-effort data race monitor. Most tools for dynamic data race detection
use some variation of the Eraser algorithm [51]: obtain a single global sequence
of synchronization operations and memory accesses, and use the former to estab-
lish happens-before relationships whenever two threads access the same memory
location in a potentially conflicting way. This entails keeping track of the last ac-
cessing threads for each location, as well as of the ways in which any two threads
might have synchronized since those last accesses. Implemented naïvely, every
memory access causes the monitor to pause the thread and atomically update
the global synchronization state. Over a decade of engineering efforts directed
at tools like ThreadSanitizer [52] and Helgrind [57] have reduced the resulting
overhead, but it can still be substantial.
Vamos enabled us to develop a similar monitor at significantly reduced engi-
neering effort in a key area: efficiently communicating events to a monitor run-
ning in parallel in its own process, and building the global sequence of events.
To build our monitor, we used ThreadSanitizer’s source-code-based approach 5
to instrument relevant code locations, and for each such location, we reduce
the need for global synchronization to fetching a timestamp from an atomi-
cally increased counter. Based on our facilities for dynamically creating event
sources, each thread forms its own event source to which it sends events. In the
correctness layer, the arbiter builds the single global stream of events used by
our implementation of a version of the Goldilocks [22] algorithm (a variant of
Eraser [51]), using the timestamps to make sure events are processed in the right
order. Autodrop buffers may drop some events to avoid overloading the moni-
tor; when this happens to a thread, we only report data races that the algorithm
finds if all involved events were generated after the last time that events were
dropped. This means that our tool may not find some races, often those that
can only be detected looking at longer traces. However, it still found many races
in our experiments, and other approaches to detecting data races in best-effort
ways have similar restrictions [56].
Our implementation (contained in our artifact [12]) consists of:
5
This decision was entirely to reduce our development effort; a dynamic instrumen-
tation source could be swapped in without other changes.
Vamos : Middleware for Best-Effort Third-Party Monitoring 275
100 VAMOS
60
TSan
Time [s]
40 Helgrind
10 1
20
10 2
0
0 200 400 38 36 34 5 5 5 67 67 66 8 10 10 0 0 3
# of benchmarks race race no race no race timeout
correct wrong correct wrong
Fig. 6. Comparing running times of the three tools on all 391 benchmarks (left) and the
correctness of their verdicts on the subset of 118 benchmarks for which it was possible
to determine the ground truth (right). Race vs. no race indicates whether the tool
found at least one data race, correct vs. wrong indicates whether that verdict matches
the ground truth. For benchmarks with unknown ground truth, the three tools agreed
on the existence of data races more than 99% of the time.
7 Related Work
store (parts of) traces and additional information for the longer term. Mod-
ern industrial implementations of this concept, like Apache Flink [1], are built
for massively parallel stream processing in distributed systems, supporting arbi-
trary applications but providing no special abstractions for monitoring, in con-
trast to more run-time-monitoring-focused implementations like ReMinds [58].
Complex event recognition systems also sometimes provide capabilities for load-
shedding [59], of which autodrop buffers are the simplest version. Most event
recognition systems provide more features than Vamos, but are also harder to
set up for monitoring; in contrast, Vamos offers a simple specification language
that is efficient and still flexible enough for many monitoring scenarios.
Stream Run-Time Verification LoLa [19,24], TeSSLa [41], and Striver [27]
are stream run-time verification [8] systems that allow expressing a monitor as
a series of mutually recursive data streams that compute their current values
based on each other’s values. This requires some global notion of time, as the
streams are updated with new values at time ticks and refer to values in other
streams relative to the current tick, which is not necessarily available in a het-
erogeneous setting. Stream run-time verification systems also do not commonly
support handling variable numbers of event sources. Some systems allow for dy-
namically instantiating sub-monitors for parts of the event stream [3,6,49,24] in
a technique called parametric trace slicing [15]. This is used for logically split-
ting the events on a given stream into separate streams, making them easier
to reason about, and can sometimes be exploited for parallelizing the monitor’s
work. These additional streams are internal to the monitoring logic, in contrast,
Vamos’ ability to dynamically add new event sources affects the monitoring
system’s outside connections, while, internally, the arbiter still unifies the events
coming in on all such connections into one global stream.
todrop buffers instead trade precision for avoiding this kind of overhead. Aside
from the survey, some systems (like TeSSLa [41]) incrementalize their default of-
fline behavior to provide a monitor that may eventually significantly lag behind
the monitored system.
Executing monitoring code or even just writing event data to a file or sending
it over the network is costly in terms of overhead, even more so if multiple threads
need to synchronize on the relevant code. Ha et al. proposed Cache-Friendly
Asymmetric Buffering [29] to run low-overhead run-time analyses on multicore
platforms. They only transfer 64-bit values, which suffices for some analyses, but
not for general-purpose event data. Our adapted implementation thus has to do
some extra work, but shares the idea of using a lock-free single-producer-single-
consumer ring buffer for low overhead and high throughput.
While we try to minimize it, we accept some overhead for instrumentation
as given. Especially in real-time systems, some run-time monitoring solutions
adjust the activation status of parts of the instrumentation according to some
metrics of overhead, inserting hole events for phases when instrumentation is
deactivated [5,31,2]. In contrast, the focus of load-shedding through autodrop
buffers is on ensuring that the higher-level part of the monitor is working with
reasonably up-to-date events while not forcing the monitored system to wait.
For monitors that do not rely on extensive summarization of dropped events,
the two approaches could easily be combined.
8 Conclusion
We have presented Vamos, which we designed as middleware for best-effort
third-party run-time monitoring. Its goal is to significantly simplify the instru-
mentation part of monitoring, broadly construed as the gathering of high-level
observations that serve as the basis for traditional monitoring specifications, par-
ticularly for best-effort third-party run-time monitoring, which may often need
some significant preprocessing of the gathered information, potentially collected
from multiple heterogeneous sources. We have presented preliminary evidence
that the way we built Vamos can handle large numbers of events and lets us
specify a variety of monitors with relative ease. In future work, we plan to apply
Vamos’ to more diverse application scenarios, such as multithreaded webservers
processing many requests in parallel, or embedded software, and to integrate our
tools with other higher-level languages. If a system’s behavior conforms to the
278 M. Chalupa et al.
References
14. Chen, F., Roşu, G.: Java-MOP: A monitoring oriented programming environ-
ment for java. In: TACAS 2005. pp. 546–550 (2005). https://doi.org/10.1007/
978-3-540-31980-1_36
15. Chen, F., Rosu, G.: Parametric trace slicing and monitoring. In: TACAS 2009. pp.
246–261 (2009). https://doi.org/10.1007/978-3-642-00768-2_23
16. Colombo, C., Pace, G.J., Schneider, G.: LARVA — safer monitoring of real-time
java programs (tool paper). In: SEFM 2009. pp. 33–37 (2009). https://doi.org/10.
1109/SEFM.2009.13
17. Convent, L., Hungerecker, S., Leucker, M., Scheffel, T., Schmitz, M., Thoma, D.:
TeSSLa: Temporal stream-based specification language. In: SBMF 2018. pp. 144–
162 (2018). https://doi.org/10.1007/978-3-030-03044-5_10
18. Cugola, G., Margara, A.: Processing flows of information: From data stream to
complex event processing. ACM Computing Surveys 44(3), 15:1–15:62 (2012).
https://doi.org/10.1145/2187671.2187677
19. D’Angelo, B., Sankaranarayanan, S., Sánchez, C., Robinson, W., Finkbeiner, B.,
Sipma, H.B., Mehrotra, S., Manna, Z.: LOLA: runtime monitoring of synchronous
systems. In: TIME 2005. pp. 166–174 (2005). https://doi.org/10.1109/TIME.2005.
26
20. De Bus, B., Chanet, D., De Sutter, B., Van Put, L., De Bosschere, K.: The design
and implementation of FIT: A flexible instrumentation toolkit. In: PASTE 2004.
p. 29–34 (2004). https://doi.org/10.1145/996821.996833
21. Drusinsky, D.: Monitoring temporal rules combined with time series. In: CAV 2003.
pp. 114–117 (2003). https://doi.org/10.1007/978-3-540-45069-6_11
22. Elmas, T., Qadeer, S., Tasiran, S.: Goldilocks: A race and transaction-aware
java runtime. In: PLDI 2007. p. 245–255 (2007). https://doi.org/10.1145/1250734.
1250762
23. Eustace, A., Srivastava, A.: ATOM: A flexible interface for building
high performance program analysis tools. In: USENIX 1995. pp. 303–314
(1995), https://www.usenix.org/conference/usenix-1995-technical-conference/
atom-flexible-interface-building-high-performance
24. Faymonville, P., Finkbeiner, B., Schirmer, S., Torfah, H.: A stream-based spec-
ification language for network monitoring. In: RV 2016. pp. 152–168 (2016).
https://doi.org/10.1007/978-3-319-46982-9_10
25. Francalanza, A., Seychell, A.: Synthesising correct concurrent runtime monitors.
Formal Methods in System Design 46(3), 226–261 (2015). https://doi.org/10.1007/
s10703-014-0217-9
26. Giatrakos, N., Alevizos, E., Artikis, A., Deligiannakis, A., Garofalakis, M.: Com-
plex event recognition in the big data era: A survey. The VLDB Journal 29(1),
313–352 (July 2019). https://doi.org/10.1007/s00778-019-00557-w
27. Gorostiaga, F., Sánchez, C.: Striver: Stream runtime verification for real-
time event-streams. In: RV 2018. pp. 282–298 (2018). https://doi.org/10.1007/
978-3-030-03769-7_16
28. Gregg, B.: DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X, and FreeBSD.
Prentice Hall (2011)
29. Ha, J., Arnold, M., Blackburn, S.M., McKinley, K.S.: A concurrent dynamic anal-
ysis framework for multicore hardware. In: OOPSLA 2009. pp. 155–174 (2009).
https://doi.org/10.1145/1640089.1640101
30. Havelund, K., Rosu, G.: Monitoring Java programs with Java pathexplorer. In: RV
2001. pp. 200–217 (2001). https://doi.org/10.1016/S1571-0661(04)00253-1
280 M. Chalupa et al.
31. Huang, X., Seyster, J., Callanan, S., Dixit, K., Grosu, R., Smolka, S.A., Stoller,
S.D., Zadok, E.: Software monitoring with controllable overhead. International
Journal on Software Tools for Technology Transfer 14(3), 327–347 (2012). https:
//doi.org/10.1007/s10009-010-0184-4
32. Joshi, Y., Tchamgoue, G.M., Fischmeister, S.: Runtime verification of LTL on
lossy traces. In: SAC 2017. p. 1379–1386 (2017). https://doi.org/10.1145/3019612.
3019827
33. Kallwies, H., Leucker, M., Schmitz, M., Schulz, A., Thoma, D., Weiss, A.: TeSSLa
- an ecosystem for runtime verification. In: RV 2022. pp. 314–324 (2022). https:
//doi.org/10.1007/978-3-031-17196-3_20
34. Karaorman, M., Freeman, J.: jMonitor: Java runtime event specification and mon-
itoring library. In: RV 2004. pp. 181–200 (2005). https://doi.org/10.1016/j.entcs.
2004.01.027
35. Kauffman, S., Havelund, K., Fischmeister, S.: What can we monitor over unreliable
channels? International Journal on Software Tools for Technology Transfer 23(4),
579–600 (2021). https://doi.org/10.1007/s10009-021-00625-z
36. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: An
overview of AspectJ. In: ECOOP 2001. pp. 327–353 (2001). https://doi.org/10.
1007/3-540-45337-7_{1}{8}
37. Kim, M., Kannan, S., Lee, I., Sokolsky, O., Viswanathan, M.: Java-MaC: A run-
time assurance tool for Java programs. In: RV 2001. pp. 218–235 (2001). https:
//doi.org/10.1016/s1571-0661(04)00254-3
38. Kim, M., Kannan, S., Lee, I., Sokolsky, O., Viswanathan, M.: Computational anal-
ysis of run-time monitoring - fundamentals of java-mac. In: RV 2002. pp. 80–94
(2002). https://doi.org/10.1016/S1571-0661(04)80578-4
39. Kim, M., Viswanathan, M., Ben-Abdallah, H., Kannan, S., Lee, I., Sokolsky, O.:
Formally specified monitoring of temporal properties. In: ECRTS 1999. pp. 114–
122 (1999). https://doi.org/10.1109/EMRTS.1999.777457
40. Lattner, C., Adve, V.S.: LLVM: A compilation framework for lifelong program
analysis & transformation. In: CGO 2004. pp. 75–88 (2004). https://doi.org/10.
1109/CGO.2004.1281665
41. Leucker, M., Sánchez, C., Scheffel, T., Schmitz, M., Schramm, A.: TeSSLa: runtime
verification of non-synchronized real-time streams. In: SAC 2018. pp. 1925–1933
(2018). https://doi.org/10.1145/3167132.3167338
42. Leucker, M., Sánchez, C., Scheffel, T., Schmitz, M., Thoma, D.: Runtime verifica-
tion for timed event streams with partial information. In: RV 2019. pp. 273–291
(2019). https://doi.org/10.1007/978-3-030-32079-9_16
43. Luk, C., Cohn, R.S., Muth, R., Patil, H., Klauser, A., Lowney, P.G., Wallace, S.,
Reddi, V.J., Hazelwood, K.M.: Pin: building customized program analysis tools
with dynamic instrumentation. In: PLDI 2005. pp. 190–200 (2005). https://doi.
org/10.1145/1065010.1065034
44. Mansouri-Samani, M., Sloman, M.: Monitoring distributed systems. IEEE Network
7(6), 20–30 (1993). https://doi.org/10.1109/65.244791
45. Muehlboeck, F., Henzinger, T.A.: Differential monitoring. In: RV 2021. pp. 231–243
(2021). https://doi.org/10.1007/978-3-030-88494-9_12
46. Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic bi-
nary instrumentation. In: PLDI 2007. pp. 89–100 (2007). https://doi.org/10.1145/
1250734.1250746
47. Pnueli, A., Zaks, A.: PSL model checking and run-time verification via testers. In:
FM 2006. pp. 573–586 (2006). https://doi.org/10.1007/11813040_38
Vamos : Middleware for Best-Effort Third-Party Monitoring 281
48. Rabiser, R., Guinea, S., Vierhauser, M., Baresi, L., Grünbacher, P.: A comparison
framework for runtime monitoring approaches. Journal of Systems and Software
125, 309–321 (2017). https://doi.org/10.1016/j.jss.2016.12.034
49. Reger, G., Cruz, H.C., Rydeheard, D.: MarQ: Monitoring at runtime with QEA. In:
TACAS 2015. pp. 596–610 (2015). https://doi.org/10.1007/978-3-662-46681-0_55
50. Rosenberg, C.M., Steffen, M., Stolz, V.: Leveraging DTrace for runtime verification.
In: RV 2016. pp. 318–332 (2016). https://doi.org/10.1007/978-3-319-46982-9_20
51. Savage, S., Burrows, M., Nelson, G., Sobalvarro, P., Anderson, T.: Eraser: A dy-
namic data race detector for multithreaded programs. ACM Transactions on Com-
puter Systems 15(4), 391–411 (November 1997). https://doi.org/10.1145/265924.
265927
52. Serebryany, K., Iskhodzhanov, T.: ThreadSanitizer: Data race detection in practice.
In: WBIA 2009. p. 62–71 (2009). https://doi.org/10.1145/1791194.1791203
53. Stoller, S.D., Bartocci, E., Seyster, J., Grosu, R., Havelund, K., Smolka, S.A.,
Zadok, E.: Runtime verification with state estimation. In: RV 2011. pp. 193–207
(2012). https://doi.org/10.1007/978-3-642-29860-8_15
54. Taleb, R., Khoury, R., Hallé, S.: Runtime verification under access restric-
tions. In: FormaliSE@ICSE 2021. pp. 31–41 (2021). https://doi.org/10.1109/
FormaliSE52586.2021.00010
55. Tawsif, K., Hossen, J., Raja, J.E., Jesmeen, M.Z.H., Arif, E.M.H.: A review on
complex event processing systems for big data. In: CAMP 2018. pp. 1–6 (2018).
https://doi.org/10.1109/INFRKM.2018.8464787
56. Thokair, M.A., Zhang, M., Mathur, U., Viswanathan, M.: Dynamic race detection
with O(1) samples. PACMPL 7(POPL) (January 2023). https://doi.org/10.1145/
3571238, https://doi.org/10.1145/3571238
57. Valgrind: Helgrind (2023), https://valgrind.org/docs/manual/hg-manual.html
58. Vierhauser, M., Rabiser, R., Grünbacher, P., Seyerlehner, K., Wallner, S., Zeisel,
H.: ReMinds: A flexible runtime monitoring framework for systems of systems.
Journal of Systems and Software 112, 123–136 (2016). https://doi.org/10.1016/j.
jss.2015.07.008
59. Zhao, B., Viet Hung, N.Q., Weidlich, M.: Load shedding for complex event pro-
cessing: Input-based and state-based techniques. In: ICDE 2020. pp. 1093–1104
(2020). https://doi.org/10.1109/ICDE48307.2020.00099
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Yet Another Model! A Study on Model’s
Similarities for Defect and Code Smells
1 Introduction
presence of code smells and anti-patterns are normally related to defecting code
[24,34,49,51]. Code smells are symptoms of decisions on the implementation that
may degrade the code quality [22]. Anti-patterns are the misuse of solutions to
recurring problems [9]. For instance, Khomh et al. (2012) found that classes
classified as God Classes are more defect-prone than classes that are not smelly.
In this paper, we refer to code smells and anti-patterns as code smells.
One technique to mitigate the impact of defects and code smells is the appli-
cation of strategies that anticipate problematic code [47], usually with the use of
machine learning models that predict a defect or code smell
[12,13,14,26,35,45,47,52,73]. Training and evaluating machine learning models
is a hard task, since (i) it needs a large dataset, to avoid overfitting; (ii) the pro-
cess of obtaining the labels and features to serve as input is costly, and it requires
the use of different tools to support it; (iii) setting up the environment for train-
ing and evaluating models is time-consuming and computationally expensive,
even though some tools help to automatize the process, and; (iv) understanding
the importance of the features and how they affect the model is complex [39].
With these difficulties in mind, our goal is to identify a set of features that can
be used by developers to simplify the process of defect and code smell prediction.
To simplify, we aim at reducing the number of features that need to be collected
to predict or identify possible candidates to present defects and code smells,
through an analysis of model redundancies. To the best of our knowledge, no
other studies have investigated similarities between the defect and code smell
models. Instead, most studies focus on proposing and assessing the performance
of models that predict defects or code smells [27,35,41,44]. In this work, we
fill this gap through an analysis of which features are redundant or different
in models built for defects and for seven code smells. Even more, we highlight
which quality attributes are relevant to their prediction. This analysis is possible
by the use of the SHAP technique, which determines the contribution of each
feature to the prediction. As a result, using SHAP allows the verification of the
features that contributed the most to the prediction and whether the features
had high or low values.
To achieve our goal, we use a subset of 14 open-source Java systems that
had its features and defects annotated [15,16]. We then employ the Organic tool
[48] to detect nine code smells. We merged three of these smells due to similar
definitions. After merging the data, we train and evaluate an ensemble machine
learning model composed of five algorithms for each of our targets, i.e., defects
and code smells. After evaluating the performance of our ensemble, we apply
the SHAP technique to identify which features are relevant for each model.
Finally, we analyze the results in terms of: (i) which features are relevant for
each model; (ii) which features contribute the most for two or more models to
identify redundancies in the models; (iii) which quality attributes are important
to the defect and code smell prediction.
Our main findings are: (i) from the seven code smells evaluated, we identified
that the most similar models to the Defect are the God Class, Refused Bequest,
and Spaghetti Code; (ii) Nesting Level Else-If (NLE) and Comment Density
284 G. Santos et al.
(CD) are the most important features; (iii) most features have high values, ex-
cept on Refused Bequest; (iv) we identified sets of features that are common
in trios of problems, such as API Documentation (AD), which is important for
Defects, God Class, and Refused Bequest; (v) documentation, complexity, and
size are the quality attributes that contribute the most for the prediction of de-
fects and code smells; (vi) the intersection of features between the defects and
code smells ranges from 40% for Refused Bequest to 60% of the God Class. We
also contributed to the community by providing an extension of the previous
dataset of defects [15,16] through the addition of nine smells, available in our
online appendix [64]. As a consequence of these analyses, we obtained a smaller
set of features that contributes to the prediction of defects and code smells. De-
velopers and researches may train machine learning models with less effort using
these findings, or they may use these features to identify possible candidates for
introducing defects and code smells.
We organize the remainder of this work as follows. Section 2 describes the
background of our work. Section 3 shows how we structured the methodology.
Then, Section 4 presents the results of our evaluation comparing the defect
model with the code smells. Section 5 discusses the main threats to validity of
our investigation. Section 6 presents the related work our investigation is based
on. Finally, Section 7 concludes this paper with remarks for further explorations
about the subject.
2 Background
2.1 Defects
Brown et al. [9] proposed a catalog of anti-patterns, that are solutions to recur-
ring problems based on design patterns, but instead of providing reusable code,
it impacts negatively on the source code. Later, Fowler [22] introduced the code
smells as symptoms of sub-optimal decisions in the software implementation that
leads to code quality degradation. Since our defect dataset is class-level, we only
consider the problems related to classes. In our work, we considered the following
A Study on Model’s Similarities for Defect and Code Smells 285
smells: Refused Bequest (RB), Brain Class (BC), Class Data Should be Private
(CP), Complex Class (CC), Data Class (DC), God Class (GC), Lazy Class (LC),
Spaghetti Code (SC), and Speculative Generality (SG). The definitions of the
problems presented in this paper are: God Class is a large class that has too
many responsibilities and centralizes the module functionality [61]. Refused Be-
quest is a class that does not use its parent behavior [22]. Spaghetti Code is a
class that has methods with large and unique multistage process flow [9]. Due to
space constraints, the definitions of all evaluated problems can be found in our
replication package [64].
3 Study Design
In this paper, we investigate the similarities and redundancies between the soft-
ware features used to predict defects and code smells. We can use this information
to simplify the prediction model or identify possible candidates for introducing
defects or smells. We employed data preparation to find the software features
for the defect and code smell prediction models. Therefore, our main objective
is to examine the software features applied for both predictions. Our paper in-
vestigates the following research questions.
RQ1. Are the defect and class-level code smell models explainable?
RQ2. Which software features are present in both defect and code smell models?
RQ3. Which software quality attributes are more relevant for the prediction of
both defects and code smells?
3.2 Data
analysis of the defects and code smells. Finally, the open-source data facilitates
the collection of code smells.
Data Collection. The first step of our study is to collect the data about the
code smells to merge with the defect data [15]. We applied the Organic tool [48]
to detect the code smells. As all projects are available on GitHub, we manually
cloned the source code matching the project version included in the dataset.
Since most of the systems in the original dataset have less than 1000 classes
(20 systems), we collected data from the ones with more than 1000 classes (14
projects). We decided to focus on these projects because they represent 75%
of the entire defect data and are readily available on GitHub. Additionally, we
matched the name of the detected instances of code smells to the class name
present in our defect dataset. Hence, independently of whether a class had a
smell or not, we only consider it if the match was found in both datasets (i.e.,
the one with the defects and the one with the code smells). In the case that we
could not find a match, we do not consider the class for further investigation. We
use this approach to avoid bias as it would be unfair to determine that a class
that Organic could not find in the defect dataset is non-smelly. Furthermore,
this approach decreased the number of classes for most of the projects.
CP: Class Data Should be Private; DC: Data Class; GC: God Class; LC: Lazy Class;
RB: Refused Bequest; SC: Spaghetti Code; SG: Speculative Generality.
Organic collects a wide range of code smells, including method and class
ones. However, as the defect dataset is class-level, we only use the code smells
found in classes. For this reason, we obtained the ground truth of nine smells,
as described in Section 2.2. After collecting the data, we merged three code
smells: Brain Class (BC), God Class (GC), and Complex Class (CC) into one
A Study on Model’s Similarities for Defect and Code Smells 287
done by two software specialists that did not participate in the previous valida-
tion.
In the end, the developers agree that all GC classified by the tool was correct
(i.e., 18 out of 18 responses). For RB, the developers agree in 14 out of the
18 software classes (meaning that approximately 77% of developers agree with
the tool). Finally, SC is slightly worse, where the developers classified 13 out of
the 18 classes as SC. Thus, SC classes achieved an agreement of 72% between
the developers and the tool. The results demonstrate that Organic can identify
code smells with an appropriate level of accuracy (around 84% of agreement
between them). For this reason, we conclude that the Organic data is adequate
to represent code smells.
Although the literature proposes many quality attributes to group software fea-
tures [4,8,68], we focus on the quality attributes previously discussed in the
selected dataset [15,16]. These quality attributes cluster the entire collection of
software features. Therefore, we separate the aforementioned software features
into seven quality attributes: (i) Complexity, (ii) Coupling, (iii) Size, (iv) Doc-
umentation, (v) Clone, (vi) Inheritance, and (vii) Cohesion. Table 2 presents
the quality attributes with their definition and reference. The complete list of
software features (66 in total) and the quality attributes are available under the
replication package of this study [64].
(i) (ii)
(iii)
Feature Engineering
- Feature Selection
- Correlation-Threshold
- Multicollinearity
18,963 - 56
In the end, we executed data imputation to track the missing values, but
the dataset had none.
Data Exploration. In the second step of the machine learning processing, we
executed the data exploration. Therefore, we used one-hot encoding [38] to
the type feature, which stores information about the class type. For instance,
we created two new features for class and interface types. Subsequently, we
applied data normalization using Standard Scaler [59]. Finally, we employed
Synthetic Minority Oversampling Technique (SMOTE) [70] to deal with the
imbalanced nature of the dataset. Table 1 summarizes the imbalanced nature
of the targets compared to the data collection. For instance, from 19K classes,
only 757 present Spaghetti Code (almost 4% of classes).
Feature Engineering. In the final step, we applied feature engineering to se-
lect the relevant software features. As a result, we executed feature selection,
correlation analysis, and multicollinearity thresholds. First, the feature selec-
tion technique chooses a subset of software features from the combination of
various permutation importance techniques, including Random Forest, Ad-
aboost, and Linear correlation. Second, we checked the correlation between
the subset of software features (99% of threshold). In doing so, we removed
five software features (LLDC, TNLPA, TNA, TNPA, and TCLOC) because
they were highly correlated with other software features (LDC, CLOC, NA,
NLPA, and NPA). Additionally, we set the multicollinearity threshold to
85%, meaning that we remove software features with a correlation higher
than the threshold. In the end, we ended up with 56 software features.
for each model. After experimenting with all the targets, we observed that five
models are able of achieving good performance independently of the target (i.e.,
defects or code smells): Random Forest [23], LightGBM [32], Extra Trees [10],
Gradient Boosting Machine [72], and KNN [80]. For this reason, these models
are carried out for the ensemble model. The data on the performance of the
evaluated models can be found in our replication package [64]. To evaluate our
models, we focus on the F1 and AUC metrics. F1 represents the harmonic mean
of precision and recall. Additionally, AUC is relevant because we are dealing
with binary classification and this metric shows the performance of a model at
all thresholds. For these reasons, both metrics are suitable for the imbalanced
nature of data [11].
Explaining the Models. The current literature offers many possibilities to ex-
plain machine learning models in multiple problems. One of the most prominent
techniques spread in the literature is the application of SHAP (SHapley Ad-
dictive exPlanation) values [39]. These values compute the importance of each
feature in the prediction model. Therefore, we can reason why a machine learn-
ing model made such decisions about the specific domain. For this reason, SHAP
is appropriate as machine learning models are hard to explain [69] and features
interact in complex patterns to create models that provide more accurate predic-
tions. Consequently, knowing the logic behind a software class is a determinant
factor that can help to tackle the reasons behind a defect or code smell in the
target class.
4 Results
Before explaining the models, we evaluate if they can effectively predict the
code smells and defects. Even though we originally built models for the entire
set of code smells, we observed that only three code smells (God Class, Refused
Bequest, and Spaghetti Code) have comparable models to the defects. For this
reason, we only present the results of these three code smells. We believe some
code smells are not similar to the defect model because they indicate simple code
with less chance of having a defect, for instance, Lazy Class and Data Class. As
a result, these code smells seem to not have similarities with the defects. The
remaining code smells results are available in the replication package [64].
This section discusses the explanation of each target model. We rely on SHAP
to support the model explanation [39]. To simplify our analysis, we consider the
top-10 most influential software features on the target in each prediction model.
We then compare each code smell model with the defective one. Our goal is to
find similarities and redundancies between the software features that help the
machine learning model to predict the target code smells and defects. We extract
these ten software features from each of the four target models (i.e., the defect
model and the three code smell models presented in this paper).
To illustrate our results, we employ a Venn diagram to check the intersection
of features between the four models (Figures 2, 3, and 4). The Venn diagram
displays two dashed circles, one for the code smell model and another for the
defect model. Inside each dashed circle, we present the top-10 software features
that contributed the most to the prediction of the target within inner circles. The
color of these inner circles represents the feature’s quality attribute. Likewise,
the size of the inner circle represents the influence of the feature on the model,
meaning that the bigger the size, the more it contributes to the target prediction.
On each side of the inner circles, we have an arrow that indicates the direction
of the feature value. For instance, a software feature with an arrow pointing up
means that the software feature contributes to the prediction when its value is
high. On the other hand, a software feature with an arrow pointing down means
that the feature contributes to the prediction when its value is low. The software
features on the intersection have two inner circles because they have a different
impact on each target (i.e., defects and the three code smells). For a better
understanding of the acronyms, we show on the right side of each diagram, a
table with the acronym and the feature full name of all features that appears on
the diagram.
A Study on Model’s Similarities for Defect and Code Smells 293
God Class. Figure 2 shows the top-10 features that contribute to the Defect
and God Class models, and their feature intersection. We can observe from
Figure 2 that the defect model has an intersection with God Class of 6 out of 10
features. This means that 60% of the top-10 features that contribute the most to
predictions are the same for both models. These features are: CD, CLOC, AD,
NL, NLE, and CLLC; and most of them are related to documentation (3 out of
6) and complexity (2 out of 6). The only difference is for the CD, which needs
to have low values to help in the God Class prediction. All remaining software
features require a high value to predict a defect or a God Class (see arrows
up). Moreover, in terms of importance, for both models, the largest inner circles
are for NLE, NL, and AD. For the AD, its importance is smaller for the GC
model compared to the defect model. Meanwhile, for the NLE, the importance
of God Class is a bit larger than for the defect model. For the NL feature, their
importance was equivalent.
Fig. 2. Top-10 Software Features for the Defect and God Class Models.
Refused Bequest. Figure 3 shows the top-10 features that contribute the most
to the Defect and Refused Bequest models. We can observe from the Venn dia-
gram in Figure 3 that the defect model has an intersection of 40% (4 out of 10
features) with the Refused Bequest model when considering their top-10 software
features. The features that intersect are CD, AD, NLE, and DIT. It is interesting
to notice that for 3 out of the 4 software features in the intersection, the values
that help to detect the Refused Bequest have to be low (see arrows pointing
down), while for the defect model, all of them require to have high values. Fur-
thermore, most of the Refused Bequest features have to be low (6 or 60%). In
terms of importance, the DIT and NLE features were similar for both models.
However, for both CD and AD, their contribution to the Refused Bequest model
was smaller. Additionally, two features that highly contributed to the Refused
Bequest are not in the intersection (NOP and NOA), while one (NL) is outside
the intersection for the defect model. We also note that three features are related
to the inheritance quality attribute, but only one intersects for both models, the
294 G. Santos et al.
DIT one. We also observe that the size is relevant for both models. However,
we do not have any size feature on the intersection of the models. The cohesion
aspect was important only for the Refused Bequest model. The documentation
attribute, which is relevant for the defect model (4 out of 10), has two of them
with small importance (CLOC and PDA). The complexity attribute, indicated
by NLE, is also very relevant for both models. CBO is the only coupling metric
in the Refused Bequest model.
Fig. 3. Top-10 Software Features for the Defect and Refused Bequest Models.
Spaghetti Code. Figure 4 presents the 10 features that are most important
to the Defect and Spaghetti Code models. We observe in Figure 4 that the
Spaghetti Code model has 50% of intersection with the defect model. They in-
tersect with the CD, CLOC, CLLC, NL, and NLE features. For both models,
most features need high values, except one for Spaghetti Code, the CD. The
features NL, NLE, and CLOC had similar importance. On the other hand, the
CD feature contributes less to the Spaghetti Code. Meanwhile, the CLLC fea-
ture contributes less to the defect model than to the Spaghetti Code model. It is
interesting to notice that most features that highly contribute to the Spaghetti
Code prediction are outside the intersection (NOI, TNOS, and CBO). Further-
more, the complexity quality attribute intersects both models (i.e., 2 out of 5). In
addition, two of the documentation features on the defect model are important
for the Spaghetti Code model. In terms of clone duplication, it also intersects
half of the features of the Spaghetti Code model (CLLC). The size is relevant for
both models, but none of the features intersects (2 out of 10 for both models).
The features TLOC and NLG appear on the defect model, while the TNOS and
TNLA on the Spaghetti Code model. The coupling is exclusive to the Spaghetti
Code model, while the inheritance is exclusive to the defect model.
After observing the three figures (Figures 2, 3, and 4), we notice some inter-
sections between the four models. For instance, CLOC is important for Defect,
God Class, and Spaghetti Code models, even though the importance for God
A Study on Model’s Similarities for Defect and Code Smells 295
Fig. 4. Top-10 Software Features for the Defect and Spaghetti Code Models.
Class was smaller (see inner circle sizes). For this trio, we also have that NL and
CLLC are important for the three models, but the CLLC has a small contribu-
tion in comparison to other features. For the Defect, God Class, and Refused
Bequest, we highlight that the AD feature has high importance for all three
models. Meanwhile, we also have some intersections between smells models. For
the God Class and Spaghetti Code pair, we note that both NOI and TNOS are
highly relevant to the models. Finally, CBO is important for the God Class,
Refused Bequest, and Spaghetti Code, but with moderate importance.
RQ2. There is a group of software features that intersect between the defect
models and the three code smells. More importantly, Nesting Level Else-If
(NLE) and Comment density (CD) appear in the four models, although the
CD influence is considerably low for the evaluated code smells. Furthermore,
CBO is important for all the code smells, but not the defect model.
Figure 5 presents the number of features that correspond to the evaluated
quality attributes according to the top-10 features discovered by SHAP. We
stack each quality attribute horizontally to facilitate the comparison between
them. Hence, our results indicate that practitioners do not need to concentrate
on all software features to predict defects and the investigated code smells. A
subset of features is enough to predict the targets. For instance, software features
related to the documentation are the most relevant for the Defect and God Class
models, with 4 and 3 features on the top-10, respectively. The Refused Bequest
model needs software features related to the inheritance (3 features), but size
and documentation are also relevant with two features each. Meanwhile, the
Spaghetti Code model is the most comprehensive, requiring features linked to
documentation, size, complexity, coupling, and clone duplication, with all of
them having two features.
Based on the results discussed, we conclude that the four ensemble machine
learning models require at least one software feature related to documentation
(CD) and complexity (NLE) to predict the target. Hence, future studies about
296 G. Santos et al.
defect and code smell prediction, independently of the dataset and domain, could
focus on these two feature collections. Furthermore, as we can observe in Figure
5, considering all the machine learning models evaluated, the documentation,
complexity, and size are the most important quality attributes that contribute
to the detection of defects and the code smell.
RQ3. The most relevant quality attributes to predict defects and code smells
vary greatly between each model. For instance, documentation is more impor-
tant for the Defect and God Class models, while Spaghetti Code has all of its five
quality attributes with the same importance, and Refused Bequest prioritizes the
inheritance. In general, documentation, complexity, and size contribute more
to the prediction of defects and the investigated code smells.
5 Threats to Validity
the defects and code smell. In this case, we limit the scope to the Java
programming language to make our analysis feasible. However, we selected
relevant systems that vary in domains, maturity, and development practices.
For this reason, we cannot guarantee that our results generalize to other
programming languages.
− Construct Validity: The use of SHAP is a possible threat to construct
validity [79]. There are other tools to explain a machine learning model in
the literature, such as Lime [60]. However, we tested only SHAP for our
experimentation. Further interactions of this data could compare to other
tools that focus on model explainability.
− Conclusion Validity: Our study could only match a chunk of the data col-
lected with Organic with the defect dataset. Even though we pulled the same
version from GitHub, we could not find some matching classes within the
dataset. One of the main reasons for unmatched software classes is proba-
bly the refactoring of the class name and dependencies. For this reason, we
cannot guarantee how different the results would be if we could match more
classes. Furthermore, our study focuses on a diverse set of domains, which
is a potential issue for generalization.
6 Related Work
Defect Prediction. Several studies [42,75] share the ability of applying code
metrics for defect prediction. They vary in terms of accuracy, complexity, target
programming language, input prediction density, and machine learning models.
Menzies et al. [42] presented defect classifiers using code attributes defined by
McCabe and Halstead metrics. They concluded that the choice of the learning
method is more important than which subset of the available data we use for
learning the software defects. In a similar approach, Turhan et al. [75] used
cross-company data for building localized defect predictors. They used principles
of analogy-based learning to cross-company data to fine-tune these models for
localization and used static code features extracted from the source code, such
as complex software features and Halstead metrics. They concluded that cross-
company data are useful in extreme cases and when within-company data is not
available [75].
In the same direction, the study of Turhan et al. [76] evaluate the effect of
mixing data from different projects stages. In this case, the authors use within
and cross-project data to improve the prediction performance. They show that
mixing project data based on the same project stage does not significantly im-
prove the model performance. Hence, they concluded that optimal data for de-
fect prediction is still an open challenge for researchers [76]. Similarly, He at al.
[27] investigate defect prediction based on data selection. The authors propose
a brute force approach to select the most relevant data for learning the soft-
ware defects. To do so, they experiment with three large-scale experiments on
34 datasets obtained from ten open-source projects. They conclude that training
data from the same project does not always help to improve the prediction per-
298 G. Santos et al.
formance [27]. On the other hand, we base our investigation on ensemble learning
to improve the prediction performance and a wide set of software features.
Code Smells Prediction. Several automated detection strategies for code smells,
and anti-patterns were proposed in the literature [18]. They also use diverse strate-
gies in their identification. For instance, some methods are based on combination
of metrics [48,57]; refactoring opportunities [19]; textual information [54]; histori-
cal data [52]; and machine learning techniques [7,12,14,20,21,35,40,41]. Khomh et
al. [35] used Bayesian Belief Networks to detect three anti-patterns. They trained
the models using two Java open-source systems. Maiga et al. [41] investigated the
performance of the Support Vector Machine trained in three systems to predict
four anti-patterns. Later, the authors introduced a feedback system to their model
[40]. Amorim et al. [7] investigated the performance of Decision Trees to detect four
code smells in one version of the Gantt project. Differently from these works, our
dataset is composed of 14 systems, and we evaluate 9 code smells at the class level.
Cruz et al. [12] evaluated seven models to detect four code smells in 20
systems. The authors found that algorithms based on trees had a better F1
score than other models. Fontana et al. [20] evaluated six models to predict four
smells. However, they have used the severity of the smells as the target. They
reported high-performance numbers of the evaluated models. Later, Di Nucci et
al. [14] replicated it [20] to address several limitations that potentially generated
bias on the models’ performance. Thus, the authors found out that the models’
performance, when compared to the reference study, was 90% lower, indicating
the need to further explore how to improve code smell prediction. Differently
from previous work on code smell prediction, we are interested in exploring the
similarities and differences between models for predicting code smells, in contrast
with the models for defect prediction.
Defects and Code Smells. Several works tried to understand how code smells
can affect software, evaluating different aspects of quality, such as maintainability
[21,67,82], modularity [62], program comprehension [2], change-proneness [33,34],
and how developers perceive code smells [53,81]. Other studies aim to evaluate
how code smells impact the defect proneness [24,28,34,49,50,51]. Olbrich et al.
[49] evaluated the fault-proneness evolution of the God Class and Brain Class of
three open-source systems. They discovered that classes with these two smells
can be more faulty, however, this did not hold for all analyzed systems. Similarly,
Khomh et al. [34] evaluated the impact on fault-proneness of 13 different smells
in several versions of three large open-source systems. They report the existence
of a relationship between some code smells with defects, but it is not consistent
for all system versions. Openja et al. [50] evaluated how code smells can make the
class more fault-prone in quantum projects. Differently from these studies, we
aim to understand whether models build for defects and code smells are similar
or not.
Hall et al. [24] investigated if files with smells present more defects than files
that do not have them. They found that for most of these smells, there is no
statistical difference between smelly and non-smelly classes. Palomba et al. [51]
evaluated how 13 code smells affect the presence of defects using a dataset of
A Study on Model’s Similarities for Defect and Code Smells 299
30 open-source java systems. They reported that classes with smells have more
bug fixes than classes that do not have any smells. Jebnoun et al. [28] evaluated
how Code Clones are related to defects in three different programming languages.
They concluded that smelly classes are more defect prone, but it varies according
to the programming language. Differently from these three studies, we aim to
understand how the prediction of defects differs from the models used to detect
code smells, not on establishing a correlation between defect and code smell.
Explainable Machine Learning for Software Features. Software defect
explainability is a relatively recent topic in the literature [30,46,58]. Mori and
Uchihira [46] analyzed the trade-off between accuracy and interpretability of
various models. The experimentation displays a comparison between the bal-
anced output that satisfies both accuracy and interpretability criteria. Likewise,
Jiarpakdee et al. [30] empirically evaluated two model-agnostic procedures, Local
Interpretability Model-agnostic Explanations (LIME) [60] and BreakDown tech-
niques. They improved the results obtained with LIME using hyperparameter
optimization, which they called LIME-HPO. This work concludes that model-
agnostic methods are necessary to explain individual predictions of defect mod-
els. Finally, Pornprasit et al. [58] proposed a tool that predicts defects for systems
developed in Python. The input data consists of software commits, and the au-
thors compare its performance with the LIME-HPO [30]. They conclude that
the results are comparable to the state-of-the-art technology to explain models.
7 Conclusion
In this work, we investigated the relationship between defects and code smell
machine learning models. To do so, we identified and validated the code smells
collected with Organic. Then, we applied an extensive data processing step to
clean the data and select the most relevant features for the prediction model.
Subsequently, we trained and evaluated the models using an ensemble of models.
In the end, as the models performed well, we employed an explainability tech-
nique to understand the models’ decisions known as SHAP. We concluded that
among the seven code smells initially collected, only three of them were similar
to the defect model (Refused Bequest, God Class, and Spaghetti Code). In ad-
dition, we found that the features Nesting Level Else-If and Comment Density
were relevant for the four models. Furthermore, most features require high val-
ues to predict defects and code smells, except for the Refused Bequest. Finally,
we reported that the documentation, complexity, and size quality attributes are
the most relevant for these models. In the future steps of this investigation, we
can compare the SHAP results with other techniques (e.g., Lime) and employ
white-box models to simplify the explainability. Another potential application of
our study is the comparison between the reported code smells with other tools.
We encourage the community to further investigate and replicate our results.
For this reason, we made all data available under the replication package [64].
300 G. Santos et al.
References
1. Ieee standard glossary of software engineering terminology. In: IEEE Std 610.12-
1990 (1990)
2. Abbes, M., Khomh, F., Guéhéneuc, Y., Antoniol, G.: An empirical study of the
impact of two antipatterns, blob and spaghetti code, on program comprehension.
In: European Conference on Software Maintenance and Reengineering (CSMR)
(2011)
3. Abdullah AlOmar, E., Wiem Mkaouer, M., Ouni, A., Kessentini, M.: Do Design
Metrics Capture Developers Perception of Quality? An Empirical Study on Self-
Affirmed Refactoring Activities. In: International Symposium on Empirical Soft-
ware Engineering and Measurement (ESEM) (2019)
4. Aghajani, E., Nagy, C., Linares-Vásquez, M., Moreno, L., Bavota, G., Lanza, M.,
Shepherd, D.C.: Software documentation: The practitioners’ perspective. In: Pro-
ceedings of the ACM/IEEE 42nd International Conference on Software Engineering
(ICSE) (2020)
5. Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: A next-generation
hyperparameter optimization framework. In: International Conference on Knowl-
edge Discovery & Data Mining (SIGKDD) (2019)
6. Ali, M.: PyCaret: An open source, low-code machine learning library in Python,
https://www.pycaret.org
7. Amorim, L., Costa, E., Antunes, N., Fonseca, B., Ribeiro, M.: Experience report:
Evaluating the effectiveness of decision trees for detecting code smells. In: Inter-
national Symposium on Software Reliability Engineering (ISSRE) (2015)
8. Basili, V.R., Briand, L.C., Melo, W.L.: A validation of object-oriented design met-
rics as quality indicators. IEEE Transactions on Software Engineering (TSE) (1996)
9. Brown, W.H., Malveau, R.C., McCormick, H.W.S., Mowbray, T.J.: AntiPatterns:
refactoring software, architectures, and projects in crisis. John Wiley & Sons, Inc.
(1998)
10. Bui, X.N., Nguyen, H., Soukhanouvong, P.: Extra trees ensemble: A machine learn-
ing model for predicting blast-induced ground vibration based on the bagging and
sibling of random forest algorithm. In: Proceedings of Geotechnical Challenges in
Mining, Tunneling and Underground Infrastructures (ICGMTU) (2022)
11. Cawley, G.C., Talbot, N.L.: On over-fitting in model selection and subsequent
selection bias in performance evaluation. Journal of Machine Learning Research
(JMLR) (2010)
12. Cruz, D., Santana, A., Figueiredo, E.: Detecting bad smells with machine learning
algorithms: an empirical study. In: International Conference on Technical Debt
(TechDebt) (2020)
13. D’Ambros, M., Lanza, M., Robbes, R.: An extensive comparison of bug prediction
approaches. In: 7th IEEE Working Conference on Mining Software Repositories
(MSR) (2010)
14. Di Nucci, D., Palomba, F., Tamburri, D.A., Serebrenik, A., De Lucia, A.: Detecting
code smells using machine learning techniques: Are we there yet? In: 2018 IEEE
25th International Conference on Software Analysis, Evolution and Reengineering
(SANER) (2018)
15. Ferenc, R., Tóth, Z., Ladányi, G., Siket, I., Gyimóthy, T.: A public unified bug
dataset for java. In: Proceedings of the 14th International Conference on Predictive
Models and Data Analytics in Software Engineering (PROMISE) (2018)
A Study on Model’s Similarities for Defect and Code Smells 301
16. Ferenc, R., Tóth, Z., Ladányi, G., Siket, I., Gyimóthy, T.: A public unified bug
dataset for java and its assessment regarding metrics and bug prediction. In: Soft-
ware Quality Journal (SQJ) (2020)
17. Ferenc, R., Tóth, Z., Ladányi, G., Siket, I., Gyimóthy, T.: Unified bug dataset,
https://doi.org/10.5281/zenodo.3693686
18. Fernandes, E., Oliveira, J., Vale, G., Paiva, T., Figueiredo, E.: A review-based com-
parative study of bad smell detection tools. In: Proceedings of the 20th Interna-
tional Conference on Evaluation and Assessment in Software Engineering (EASE)
(2016)
19. Fokaefs, M., Tsantalis, N., Stroulia, E., Chatzigeorgiou, A.: Jdeodorant: identifi-
cation and application of extract class refactorings. In: 2011 33rd International
Conference on Software Engineering (ICSE) (2011)
20. Fontana, F.A., Mäntylä, M.V., Zanoni, M., Marino, A.: Comparing and experi-
menting machine learning techniques for code smell detection. In: Empirical Soft-
ware Engineering (EMSE) (2016)
21. Fontana, F.A., Zanoni, M., Marino, A., Mäntylä, M.V.: Code smell detection: To-
wards a machine learning-based approach (icsm). In: Int’l Conf. on Software Main-
tenance (2013)
22. Fowler, M.: Refactoring: Improving the Design of Existing Code. Addison-Wesley
(1999)
23. Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., Ubayashi, N.: An empiri-
cal study of just-in-time defect prediction using cross-project models. In: Working
Conference on Mining Software Repositories (MSR) (2014)
24. Hall, T., Zhang, M., Bowes, D., Sun, Y.: Some code smells have a significant but
small effect on faults. In: Transactions on Software Engineering and Methodology
(TOSEM) (2014)
25. Haskins, B., Stecklein, J., Dick, B., Moroney, G., Lovell, R., Dabney, J.: Error cost
escalation through the project life cycle. In: INCOSE International Symposium
(2004)
26. Hassan, A.E.: Predicting faults using the complexity of code changes. In: Interna-
tional Conference of Software Engineering (ICSE) (2009)
27. He, Z., Shu, F., Yang, Y., Li, M., Wang, Q.: An investigation on the feasibility of
cross-project defect prediction. In: Automated Software Engineering (ASE) (2012)
28. Jebnoun, H., Rahman, M.S., Khomh, F., Muse, B.: Clones in deep learning code:
What, where, and why? In: Empirical Software Engineering (EMSE) (2022)
29. Jiang, T., Tan, L., Kim, S.: Personalized defect prediction. In: 28th IEEE/ACM
International Conference on Automated Software Engineering (ASE) (2013)
30. Jiarpakdee, J., Tantithamthavorn, C., Dam, H.K., Grundy, J.: An empirical study
of model-agnostic techniques for defect prediction models. In: Transactions on Soft-
ware Engineering (TSE) (2020)
31. Jureczko, M., D., S.D.: Using object-oriented design metrics to predict software
defects. In: Models and Methods of System Dependability (MMSD) (2010)
32. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y.:
Lightgbm: A highly efficient gradient boosting decision tree. In: 31st Conference
on Neural Information Processing System (NIPS) (2017)
33. Khomh, F., Di Penta, M., Gueheneuc, Y.: An exploratory study of the impact of
code smells on software change-proneness. In: Proceedings of the 16th Working
Conference on Reverse Engineering (WCRE) (2009)
34. Khomh, F., Di Penta, M., Guéhéneuc, Y., Antoniol, G.: An exploratory study
of the impact of antipatterns on class change- and fault-proneness. In: Empirical
Software Engineering (EMSE) (2012)
302 G. Santos et al.
35. Khomh, F., Vaucher, S., Guéhéneuc, Y., Sahraoui, H.: Bdtex: A gqm-based
bayesian approach for the detection of antipatterns. In: Journal of Systems and
Software (JSS) (2011)
36. Lanza, M., Marinescu, R., Ducasse, S.: Object-Oriented Metrics in Practice.
Springer-Verlag (2005)
37. Levin, S., Yehudai, A.: Boosting automatic commit classification into maintenance
activities by utilizing source code changes. In: Proceedings of the 13rd International
Conference on Predictor Models in Software Engineering (PROMISE) (2017)
38. Lin, Z., Ding, G., Hu, M., Wang, J.: Multi-label classification via feature-aware
implicit label space encoding. In: International Conference on International Con-
ference on Machine Learning (ICML) (2014)
39. Lundberg, S.M., Lee, S.: A unified approach to interpreting model predictions. In:
Conference on Neural Information Processing Systems (NIPS) (2017)
40. Maiga, A., Ali, N., Bhattacharya, N., Sabané, A., Guéhéneuc, Y., Aimeur, E.:
Smurf: A svm-based incremental anti-pattern detection approach. In: Working
Conference on Reverse Engineering (WCRE) (2012)
41. Maiga, A., Ali, N., Bhattacharya, N., Sabané, A., Guéhéneuc, Y., Antoniol, G.,
Aı̈meur, E.: Support vector machines for anti-pattern detection. In: Proceedings
of International Conference on Automated Software Engineering (ASE) (2012)
42. Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn
defect predictors. In: Transactions on Software Engineering (TSE) (2007)
43. Menzies, T., Milton, Z., Turhan, B., Cukic, B., Jiang, Y., Bener, A.: Defect pre-
diction from static code features: current results, limitations, new approaches. In:
Automated Software Engineering (ASE) (2010)
44. Menzies, T., Zimmermann, T.: Software analytics: So what? In: IEEE Software
(2013)
45. Menzies, T., Distefano, J., Orrego, A., Chapman, R.: Assessing predictors of
software defects. In: In Proceedings, Workshop on Predictive Software Models
(PROMISE) (2004)
46. Mori, T., Uchihira, N.: Balancing the trade-off between accuracy and interpretabil-
ity in software defect prediction. In: Empirical Software Engineering (EMSE)
(2018)
47. Nagappan, N., Ball, T., Zeller, A.: Mining metrics to predict component failures.
In: International Conference on Software Engineering (ICSE) (2006)
48. Oizumi, W., Sousa, L., Oliveira, A., Garcia, A., Agbachi, A.B., Oliveira, R., Lu-
cena, C.: On the identification of design problems in stinky code: experiences and
tool support. In: Journal of the Brazilian Computer Society (JBCS) (2018)
49. Olbrich, S.M., Cruzes, D.S., Sjøberg, D.I.K.: Are all code smells harmful? a study
of god classes and brain classes in the evolution of three open source systems. In:
IEEE International Conference on Software Maintenance (ICSM) (2010)
50. Openja, M., Morovati, M.M., An, L., Khomh, F., Abidi, M.: Technical debts and
faults in open-source quantum software systems: An empirical study. Journal of
Systems and Software (JSS) (2022)
51. Palomba, F., Bavota, G., Di Penta, M., Fasano, F., Oliveto, R., De Lucia, A.:
On the diffuseness and the impact on maintainability of code smells: A large scale
empirical investigation. In: IEEE/ACM 40th International Conference on Software
Engineering (ICSE) (2018)
52. Palomba, F., Bavota, G., Di Penta, M., Oliveto, R., De Lucia, A., Poshyvanyk,
D.: Detecting bad smells in source code using change history information. In: 28th
IEEE/ACM International Conference on Automated Software Engineering (ASE)
(2013)
A Study on Model’s Similarities for Defect and Code Smells 303
53. Palomba, F., Bavota, G., Penta, M.D., Oliveto, R., Lucia, A.D.: Do they really
smell bad? a study on developers’ perception of bad code smells. In: IEEE Inter-
national Conference on Software Maintenance and Evolution (ICSME) (2014)
54. Palomba, F., Panichella, A., De Lucia, A., Oliveto, R., Zaidman, A.: A textual-
based technique for smell detection. In: 2016 IEEE 24th international conference
on program comprehension (ICPC) (2016)
55. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
learning in Python. Journal of Machine Learning Research (JMLR) (2011)
56. Petrić, J., Bowes, D., Hall, T., Christianson, B., Baddoo, N.: The jinx on the
nasa software defect data sets. In: International Conference on Evaluation and
Assessment in Software Engineering (EASE) (2016)
57. PMD: Pmd source code analyser, https://pmd.github.io/
58. Pornprasit, C., Tantithamthavorn, C., Jiarpakdee, J., Fu, M., Thongtanunam, P.:
Pyexplainer: Explaining the predictions of just-in-time defect models. In: Interna-
tional Conference on Automated Software Engineering (ASE) (2021)
59. Raju, V.N.G., Lakshmi, K.P., Jain, V.M., Kalidindi, A., Padma, V.: Study the
influence of normalization/transformation process on the accuracy of supervised
classification. In: 2020 Third International Conference on Smart Systems and In-
ventive Technology (ICSSIT) (2020)
60. Ribeiro, M.T., Singh, S., Guestrin, C.: ”why should i trust you?”: Explaining the
predictions of any classifier. In: International Conference on Knowledge Discovery
and Data Mining (KDD) (2016)
61. Riel, A.: Object Oriented Design Heuristics. Addison-Wesley Professional (1996)
62. Santana, A., Cruz, D., Figueiredo, E.: An exploratory study on the identification
and evaluation of bad smell agglomerations. In: Proceedings of the 36th Annual
ACM Symposium on Applied Computing (SAC) (2021)
63. Santos, G., Figueiredo, E., Veloso, A., Viggiato, M., Ziviani, N.: Understanding
machine learning software defect predictions. In: Automated Software Engineering
Journal (ASEJ) (2020)
64. Santos, G.: gesteves91/artifact-fase-santos-23: FASE Artifact Evaluation 2023 (Jan
2023), https://doi.org/10.5281/zenodo.7502546
65. Sayyad S., J., Menzies, T.: The PROMISE Repository of Software Engineering
Databases. (2005), http://promise.site.uottawa.ca/SERepository
66. Schumacher, J., Zazworka, N., Shull, F., Seaman, C.B., Shaw, M.A.: Building em-
pirical support for automated code smell detection. In: International Symposium
on Empirical Software Engineering and Measurement (ESEM) (2010)
67. Sjøberg, D.I.K., Yamashita, A., Anda, B.C.D., Mockus, A., Dybå, T.: Quantifying
the effect of code smells on maintenance effort. In: IEEE Transactions on Software
Engineering (TSE) (2013)
68. Stroulia, E., Kapoor, R.: Metrics of refactoring-based development: An experi-
ence report. 7th International Conference on Object Oriented Information Systems
(OOIS) (2001)
69. Tantithamthavorn, C., Hassan, A.E.: An experience report on defect modelling in
practice: Pitfalls and challenges. In: International Conference on Software Engi-
neering: Software Engineering in Practice (ICSE-SEIP) (2018)
70. Tantithamthavorn, C., Hassan, A.E., Matsumoto, K.: The impact of class rebalanc-
ing techniques on the performance and interpretation of defect prediction models.
In: Transactions on Software Engineering (TSE) (2019)
304 G. Santos et al.
71. Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Ihara, A., Matsumoto, K.: The
impact of mislabelling on the performance and interpretation of defect prediction
models. In: International Conference on Software Engineering (ICSE) (2015)
72. Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: An empirical
comparison of model validation techniques for defect prediction models. In: IEEE
Transactions on Software Engineering (TSE) (2017)
73. Tantithamthavorn, C., McIntosh, S., Hassan, A.E., Matsumoto, K.: The impact of
automated parameter optimization on defect prediction models. In: Transactions
on Software Engineering (TSE) (2019)
74. Tóth, Z., Gyimesi, P., Ferenc, R.: A public bug database of github projects and
its application in bug prediction. In: Computational Science and Its Applications
(ICCSA) (2016)
75. Turhan, B., Menzies, T., Bener, A.B., Di Stefano, J.: On the relative value of
cross-company and within-company data for defect prediction. Empirical Software
Engineering (EMSE) (2009)
76. Turhan, B., Tosun, A., Bener, A.: Empirical evaluation of mixed-project defect
prediction models. In: Proceedings of the 37th Conference on Software Engineering
and Advanced Applications (SEAA) (2011)
77. Vale, G., Hunsen, C., Figueiredo, E., Apel, S.: Challenges of resolving merge con-
flicts: A mining and survey study. In: Transactions on Software Engineering (TSE)
(2021)
78. Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect
prediction. In: International Conference of Software Engineering (ICSE) (2016)
79. Wohlin, C., Runeson, P., Hst, M., Ohlsson, M.C., Regnell, B., Wessln, A.: Exper-
imentation in Software Engineering. Springer (2012)
80. Xuan, X., Lo, D., Xia, X., Tian, Y.: Evaluating defect prediction approaches using
a massive set of metrics: An empirical study. In: Proceedings of the 30th Annual
ACM Symposium on Applied Computing (SAC) (2015)
81. Yamashita, A., Moonen, L.: Do developers care about code smells? an exploratory
survey. In: 20th Working Conference on Reverse Engineering (WCRE) (2013)
82. Yamashita, A., Counsell, S.: Code smells as system-level indicators of maintain-
ability: An empirical study. In: Journal of Systems and Software (JSS) (2013)
83. Yatish, S., Jiarpakdee, J., Thongtanunam, P., Tantithamthavorn, C.: Mining soft-
ware defects: Should we consider affected releases? In: International Conference on
Software Engineering (ICSE) (2019)
84. Zimmermann, T., Premraj, R., Zeller, A.: Predicting defects for eclipse. In: In-
ternational Workshop on Predictor Models in Software Engineering (PROMISE)
(2007)
A Study on Model’s Similarities for Defect and Code Smells 305
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Competition Contributions
Software Testing: 5th Comparative Evaluation:
Test-Comp 2023
Dirk Beyer(B)
1 Introduction
• Establish standards for software test generation. This means, most prominently,
to develop a standard for marking input values in programs, define an exchange
format for test suites, agree on a specification language for test-coverage
criteria, and define how to validate the resulting test suites.
This report extends previous reports on Test-Comp [7,8,9,10,11].
Reproduction packages are available on Zenodo (see Table 3).
(B) [email protected]
Program
under Test
Test Suite Bug
(Test Cases) Report
Test Test
Generator Validator
Coverage
Statistics
Test
Specification
Fig. 1: Flow of the Test-Comp execution for one test generator (taken from [8])
the test generator and validates it by executing the program on all test cases:
for bug finding it checks if the bug is exposed and for coverage it reports the
coverage. We use the tool TestCov [23] 2 as test-suite validator.
Task-Definition Format 2.0. Test-Comp 2023 used again the task-definition for-
mat in version 2.0.
2 https://gitlab.com/sosy-lab/software/test-suite-validator
312 Dirk Beyer
Benchmark Programs. The input programs were taken from the largest and
most diverse open-source repository of software-verification and test-generation
tasks 3 , which is also used by SV-COMP [13]. As in 2020 and 2021, we se-
lected all programs for which the following properties were satisfied (see is-
sue on GitLab 4 and report [9]):
1. compiles with gcc, if a harness for the special methods 5 is provided,
2. should contain at least one call to a nondeterministic function,
3. does not rely on nondeterministic pointers,
4. does not have expected result ‘false’ for property ‘termination’, and
5. has expected result ‘false’ for property ‘unreach-call’ (only for category Error
Coverage).
This selection yielded a total of 4 106 test-generation tasks, namely 1 173 tasks
for category Error Coverage and 2 933 tasks for category Code Coverage. The
test-generation tasks are partitioned into categories, which are listed in Ta-
bles 6 and 7 and described in detail on the competition web site.6 Figure 2
illustrates the category composition.
Category Error-Coverage. The first category is to show the abilities to discover
bugs. The benchmark set consists of programs that contain a bug. We produce for
every tool and every test-generation task one of the following scores: 1 point, if the
validator succeeds in executing the program under test on a generated test case that
explores the bug (i.e., the specified function was called), and 0 points, otherwise.
Category Branch-Coverage. The second category is to cover as many branches
of the program as possible. The coverage criterion was chosen because many test
generators support this standard criterion by default. Other coverage criteria
can be reduced to branch coverage by transformation [35]. We produce for every
tool and every test-generation task the coverage of branches of the program (as
reported by TestCov [23]; a value between 0 and 1) that are executed for the
generated test cases. The score is the returned coverage.
Ranking. The ranking was decided based on the sum of points (normalized for
meta categories). In case of a tie, the ranking was decided based on the run time,
which is the total CPU time over all test-generation tasks. Opt-out from categories
was possible and scores for categories were normalized based on the number of
tasks per category (see competition report of SV-COMP 2013 [6], page 597).
3 https://gitlab.com/sosy-lab/benchmarking/sv-benchmarks
4 https://gitlab.com/sosy-lab/benchmarking/sv-benchmarks/-/merge_requests/774
5 https://test-comp.sosy-lab.org/2023/rules.php
6 https://test-comp.sosy-lab.org/2023/benchmarks.php
Software Testing: 5th Comparative Evaluation: Test-Comp 2023 313
Arrays
BitVectors
ControlFlow
ECA
Floats
Heap
Loops
Cover-Error
ProductLines
Recursive
Sequentialized
XCSP
Hardware
BusyBox-MemSafety
DeviceDriversLinux64-ReachSafety
Arrays
C-Overall
BitVectors
ControlFlow
ECA
Floats
Heap
Loops
ProductLines
Cover-Branches
Recursive
Sequentialized
XCSP
Combinations
BusyBox-MemSafety
DeviceDriversLinux64-ReachSafety
SQLite-MemSafety
MainHeap
4 Reproducibility
(a) Test-Generation Tasks (b) Benchmark Definitions (c) Tool-Info Modules (d) Tester Archives
Table 4: Competition candidates with tool references and representing jury members;
new ∅
indicates first-time participants, indicates hors-concours participation
Tester Ref. Jury member Affiliation
CoVeriTest [19,39] Marie-Christine Jakobs TU Darmstadt, Germany
ESBMC-kind new [33,32] Rafael Sá Menezes U. of Manchester, UK
FuSeBMC [3,4] Kaled Alshmrany U. of Manchester, UK
FuSeBMC_IA new [1,2] Mohannad Aldughaim U. of Manchester, UK
HybridTiger [26,47] (hors concours) –
KLEE [27,28] (hors concours) –
Legion [42,43] (hors concours) –
Legion/SymCC [43] Gidon Ernst LMU Munich, Germany
PRTest [22,41] Thomas Lemberger QAware GmbH, Germany
Symbiotic [29,30] Marek Trtík Masaryk U., Brno, Czechia
TracerX [37,38] Joxan Jaffar National U. of Singapore, Singapore
VeriFuzz [45] Raveendra Kumar M. Tata Consultancy Services, India
WASP-C new [44] Filipe Marques INESC-ID, Lisbon, Portugal
This section represents the results of the competition experiments. The report
shall help to understanding the state of the art and the advances in fully au-
tomatic test generation for whole C programs, in terms of effectiveness (test
coverage, as accumulated in the score) and efficiency (resource consumption
in terms of CPU time). All results mentioned in this article were inspected
and approved by the participants.
Participating Test-Suite Generators. Table 4 provides an overview of the
participating test generators and references to publications, as well as the team
representatives of the jury of Test-Comp 2023. (The competition jury consists
of the chair and one member of each participating team.) An online table with
information about all participating systems is provided on the competition web
site.8 Table 5 lists the features and technologies that are used in the test generators.
There are test generators that did not actively participate (e.g., tester archives
taken from last year) and that are not included in rankings. Those are called
hors-concours participations and the tool names are labeled with a symbol ( ∅ ).
Computing Resources. The computing environment and the resource limits
were the same as for Test-Comp 2020 [8], except for the upgraded operating system:
Each test run was limited to 8 processing units (cores), 15 GB of memory, and
15 min of CPU time. The test-suite validation was limited to 2 processing units,
8 https://test-comp.sosy-lab.org/2023/systems.php
316 Dirk Beyer
Evolutionary Algorithms
Explicit-Value Analysis
Predicate Abstraction
Algorithm Selection
Symbolic Execution
Random Execution
Portfolio
CEGAR
Tester
CoVeriTest 3 3 3 3 3
ESBMC-kind new 3 3 3
FuSeBMC 3 3 3 3 3
FuSeBMC_IA new 3 3 3 3 3
HybridTiger 3 3 3 3
KLEE 3 3 3
Legion 3 3 3 3 3 3
Legion/SymCC 3 3 3 3 3 3
PRTest 3 3
Symbiotic 3 3 3 3 3
TracerX 3 3 3 3
VeriFuzz 3 3 3 3 3 3
WASP-C new 3 3 3
7 GB of memory, and 5 min of CPU time. The machines for running the experiments
are part of a compute cluster that consists of 168 machines; each test-generation
run was executed on an otherwise completely unloaded, dedicated machine, in
order to achieve precise measurements. Each machine had one Intel Xeon E3-
1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of
RAM, and a GNU/Linux operating system (x86_64-linux, Ubuntu 22.04 with
Linux kernel 5.15). We used BenchExec [24] to measure and control computing
resources (CPU time, memory, CPU energy) and VerifierCloud9 to distribute,
install, run, and clean-up test-case generation runs, and to collect the results. The
values for time and energy are accumulated over all cores of the CPU. To measure
the CPU energy, we use CPU Energy Meter [25] (integrated in BenchExec [24]).
Further technical parameters of the competition machines are available in the
repository which also contains the benchmark definitions. 10
9 https://vcloud.sosy-lab.org
10 https://gitlab.com/sosy-lab/test-comp/bench-defs/tree/testcomp22
Software Testing: 5th Comparative Evaluation: Test-Comp 2023 317
new
Table 6: Quantitative overview over all results; empty cells mark opt-outs; indicates
∅
first-time participants, indicates hors-concours participation
Cover-Branches
Cover-Error
1173 tasks
2933 tasks
4106 tasks
Overall
Tester
Table 7: Overview of the top-three test generators for each category (measurement
values for CPU time and energy rounded to two significant digits)
11 https://test-comp.sosy-lab.org/2023/results
Software Testing: 5th Comparative Evaluation: Test-Comp 2023 319
1 3
10 2
4
5 11 10
9 9
6
0
2019 2020 2021 2022 2023
Year
Fig. 4: Number of evaluated test generators for each year (top: number of first-time
participants; bottom: previous year’s participants)
4000 CoVeriTest
FuSeBMC
FuSeBMC-IA
3500 HybridTiger
KLEE
Legion/SymCC
3000
Min. number of test tasks
PRTest
Symbiotic
VeriFuzz
2500 WASP-C
2000
1500
1000
500
0
0 500 1000 1500 2000 2500
Cumulative score
Fig. 5: Quantile functions for category Overall. Each quantile function illustrates
the quantile (x-coordinate) of the scores obtained by test-generation runs below a
certain number of test-generation tasks (y-coordinate). More details were given
previously [9]. The graphs are decorated with symbols to make them better
distinguishable without color.
6 Conclusion
The Competition on Software Testing took place for the 5th time and provides
an overview of fully-automatic test-generation tools for C programs. A total of
13 test-suite generators was compared (see Fig. 4 for the participation numbers and
Table 4 for the details). This off-site competition uses a benchmark infrastructure
that makes the execution of the experiments fully-automatic and reproducible.
Transparency is ensured by making all components available in public repositories
and have a jury (consisting of members from each team) that oversees the process.
All test suites were validated by the test-suite validator TestCov [23] to measure
the coverage. The results of the competition are presented at the 26th International
Conference on Fundamental Approaches to Software Engineering at ETAPS 2023.
Data-Availability Statement. The test-generation tasks and results of the com-
petition are published at Zenodo, as described in Table 3. All components and data
that are necessary for reproducing the competition are available in public version
repositories, as specified in Table 2. For easy access, the results are presented also
online on the competition web site https://test-comp.sosy-lab.org/2023/results.
Funding Statement. This project was funded in part by the Deutsche Forschungs-
gemeinschaft (DFG) — 418257054 (Coop).
References
1. Aldughaim, M., Alshmrany, K.M., Gadelha, M.R., de Freitas, R., Cordeiro, L.C.:
FuSeBMC_IA: Interval analysis and methods for test-case generation (competition
contribution). In: Proc. FASE. LNCS 13991, Springer (2023)
2. Aldughaim, M., Alshmrany, K.M., Mustafa, M., Cordeiro, L.C., Stancu, A.: Bounded
model checking of software using interval methods via contractors. arXiv/CoRR
2012(11245) (December 2020). https://doi.org/10.48550/arXiv.2012.11245
3. Alshmrany, K., Aldughaim, M., Cordeiro, L., Bhayat, A.: FuSeBMC v.4: Smart seed
generation for hybrid fuzzing (competition contribution). In: Proc. FASE. pp. 336–
340. LNCS 13241, Springer (2022). https://doi.org/10.1007/978-3-030-99429-7_19
4. Alshmrany, K.M., Aldughaim, M., Bhayat, A., Cordeiro, L.C.: FuSeBMC:
An energy-efficient test generator for finding security vulnerabili-
ties in C programs. In: Proc. TAP. pp. 85–105. Springer (2021).
https://doi.org/10.1007/978-3-030-79379-1_6
5. Bartocci, E., Beyer, D., Black, P.E., Fedyukovich, G., Garavel, H., Hartmanns, A.,
Huisman, M., Kordon, F., Nagele, J., Sighireanu, M., Steffen, B., Suda, M., Sutcliffe,
G., Weber, T., Yamada, A.: TOOLympics 2019: An overview of competitions in
formal methods. In: Proc. TACAS (3). pp. 3–24. LNCS 11429, Springer (2019).
https://doi.org/10.1007/978-3-030-17502-3_1
6. Beyer, D.: Second competition on software verification (Summary of SV-
COMP 2013). In: Proc. TACAS. pp. 594–609. LNCS 7795, Springer (2013).
https://doi.org/10.1007/978-3-642-36742-7_43
7. Beyer, D.: Competition on software testing (Test-Comp). In:
Proc. TACAS (3). pp. 167–175. LNCS 11429, Springer (2019).
https://doi.org/10.1007/978-3-030-17502-3_11
Software Testing: 5th Comparative Evaluation: Test-Comp 2023 321
26. Bürdek, J., Lochau, M., Bauregger, S., Holzer, A., von Rhein, A., Apel,
S., Beyer, D.: Facilitating reuse in multi-goal test-suite generation for soft-
ware product lines. In: Proc. FASE. pp. 84–99. LNCS 9033, Springer (2015).
https://doi.org/10.1007/978-3-662-46675-9_6
27. Cadar, C., Dunbar, D., Engler, D.R.: Klee: Unassisted and automatic generation
of high-coverage tests for complex systems programs. In: Proc. OSDI. pp. 209–224.
USENIX Association (2008)
28. Cadar, C., Nowack, M.: Klee symbolic execution engine in 2019 (competition
contribution). Int. J. Softw. Tools Technol. Transf. 23(6), 867 – 870 (December
2021). https://doi.org/10.1007/s10009-020-00570-3
29. Chalupa, M., Novák, J., Strejček, J.: Symbiotic 8: Parallel and targeted test
generation (competition contribution). In: Proc. FASE. pp. 368–372. LNCS 12649,
Springer (2021). https://doi.org/10.1007/978-3-030-71500-7_20
30. Chalupa, M., Strejček, J., Vitovská, M.: Joint forces for mem-
ory safety checking. In: Proc. SPIN. pp. 115–132. Springer (2018).
https://doi.org/10.1007/978-3-319-94111-0_7
31. Cok, D.R., Déharbe, D., Weber, T.: The 2014 SMT competition. JSAT 9, 207–242
(2016)
32. Gadelha, M.Y.R., Monteiro, F.R., Cordeiro, L.C., Nicole, D.A.: Esbmc v6.0: Ver-
ifying C programs using k -induction and invariant inference (competition con-
tribution). In: Proc. TACAS (3). pp. 209–213. LNCS 11429, Springer (2019).
https://doi.org/10.1007/978-3-030-17502-3_15
33. Gadelha, M.Y., Ismail, H.I., Cordeiro, L.C.: Handling loops in bounded model
checking of C programs via k -induction. Int. J. Softw. Tools Technol. Transf. 19(1),
97–114 (February 2017). https://doi.org/10.1007/s10009-015-0407-9
34. Godefroid, P., Sen, K.: Combining model checking and testing.
In: Handbook of Model Checking, pp. 613–649. Springer (2018).
https://doi.org/10.1007/978-3-319-10575-8_19
35. Harman, M., Hu, L., Hierons, R.M., Wegener, J., Sthamer, H., Baresel, A., Roper,
M.: Testability transformation. IEEE Trans. Software Eng. 30(1), 3–16 (2004).
https://doi.org/10.1109/TSE.2004.1265732
36. Holzer, A., Schallhart, C., Tautschnig, M., Veith, H.: How did you
specify your test suite. In: Proc. ASE. pp. 407–416. ACM (2010).
https://doi.org/10.1145/1858996.1859084
37. Jaffar, J., Maghareh, R., Godboley, S., Ha, X.L.: TracerX: Dynamic symbolic
execution with interpolation (competition contribution). In: Proc. FASE. pp. 530–534.
LNCS 12076, Springer (2020). https://doi.org/10.1007/978-3-030-45234-6_28
38. Jaffar, J., Murali, V., Navas, J.A., Santosa, A.E.: Tracer: A symbolic execution
tool for verification. In: Proc. CAV. pp. 758–766. LNCS 7358, Springer (2012).
https://doi.org/10.1007/978-3-642-31424-7_61
39. Jakobs, M.C., Richter, C.: CoVeriTest with adaptive time scheduling (compe-
tition contribution). In: Proc. FASE. pp. 358–362. LNCS 12649, Springer (2021).
https://doi.org/10.1007/978-3-030-71500-7_18
40. King, J.C.: Symbolic execution and program testing. Commun. ACM 19(7), 385–394
(1976). https://doi.org/10.1145/360248.360252
41. Lemberger, T.: Plain random test generation with PRTest (competition contri-
bution). Int. J. Softw. Tools Technol. Transf. 23(6), 871–873 (December 2021).
https://doi.org/10.1007/s10009-020-00568-x
42. Liu, D., Ernst, G., Murray, T., Rubinstein, B.: Legion: Best-first concolic testing
(competition contribution). In: Proc. FASE. pp. 545–549. LNCS 12076, Springer
(2020). https://doi.org/10.1007/978-3-030-45234-6_31
Software Testing: 5th Comparative Evaluation: Test-Comp 2023 323
43. Liu, D., Ernst, G., Murray, T., Rubinstein, B.I.P.: Legion: Best-first concolic testing.
In: Proc. ASE. pp. 54–65. IEEE (2020). https://doi.org/10.1145/3324884.3416629
44. Marques, F., Santos, J.F., Santos, N., Adão, P.: Concolic execution for
webassembly (artifact). Dagstuhl Artifacts Series 8(2), 20:1–20:3 (2022).
https://doi.org/10.4230/DARTS.8.2.20
45. Metta, R., Medicherla, R.K., Karmarkar, H.: VeriFuzz: Fuzz centric test generation
tool (competition contribution). In: Proc. FASE. pp. 341–346. LNCS 13241, Springer
(2022). https://doi.org/10.1007/978-3-030-99429-7_20
46. Panichella, S., Gambi, A., Zampetti, F., Riccio, V.: SBST tool competition 2021. In:
Proc. SBST. pp. 20–27. IEEE (2021). https://doi.org/10.1109/SBST52555.2021.00011
47. Ruland, S., Lochau, M., Jakobs, M.C.: HybridTiger: Hybrid model checking
and domination-based partitioning for efficient multi-goal test-suite generation
(competition contribution). In: Proc. FASE. pp. 520–524. LNCS 12076, Springer
(2020). https://doi.org/10.1007/978-3-030-45234-6_26
48. Song, J., Alves-Foss, J.: The DARPA cyber grand challenge: A competi-
tor’s perspective, part 2. IEEE Security and Privacy 14(1), 76–81 (2016).
https://doi.org/10.1109/MSP.2016.14
49. Stump, A., Sutcliffe, G., Tinelli, C.: StarExec: A cross-community infrastructure
for logic solving. In: Proc. IJCAR, pp. 367–373. LNCS 8562, Springer (2014).
https://doi.org/10.1007/978-3-319-08587-6_28
50. Sutcliffe, G.: The CADE ATP system competition: CASC. AI Magazine 37(2),
99–101 (2016)
51. Visser, W., Păsăreanu, C.S., Khurshid, S.: Test-input generation
with Java PathFinder. In: Proc. ISSTA. pp. 97–107. ACM (2004).
https://doi.org/10.1145/1007512.1007526
52. Wendler, P., Beyer, D.: sosy-lab/benchexec: Release 3.16. Zenodo (2023).
https://doi.org/10.5281/zenodo.7612021
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
FuSeBMC IA: Interval Analysis and Methods for Test
Case Generation
(Competition Contribution)
1 Introduction
In Test-comp 2022 [1], cooperative verification tools showed their strength by being
the best tools in each category. FuSeBMC [9,10] is a test-generation tool that employs
cooperative verification using fuzzing and BMC. FuSeBMC starts with the analysis to
instrument the Program Under Test (PUT); then, based on the results from BMC/AFL,
it generates the initial seeds for the fuzzer. Finally, FuSeBMC keeps track of the goals
covered and updates the seeds, while producing test cases using BMC/Fuzzing/Selec-
tive fuzzer. This year, we introduce abstract interpretation to FuSeBMC to improve the
test case generation. In particular, we use interval methods to help our instrumentation
and fuzzing by providing intervals to help reach (instrumented) goals faster. The selec-
tive fuzzer is a crucial component of FuSeBMC, which generates test cases for uncov-
ered goals based on information obtained from test cases produced by BMC/fuzzer [9].
This work is based on our previous study, where CSP/CP by contractor techniques are
applied to prune the state-space search [12]. Our approach also uses Frama-C [4,8] to
obtain variable intervals, further pruning the state space exploration. Our original con-
tributions are: (1) improve instrumentation to allow abstract interpretation to provide
information about variable intervals; (2) apply interval methods to improve the fuzzing
and produce higher impact test cases by pruning the search space exploration; (3) re-
duce the usage of resources (incl. memory and CPU time).
FuSeBMC v4
Test-Generation
Seed Generation
Selective
Engines
FuSeBMC analysis BMC/AFL fuzzer
Analyze and
Inject Seeds Tracer
Fig. 1: FuSeBMC IA’s architecture. The changes introduced in FuSeBMC IA for Test-Comp 2023 are highlighted in green.
The new Interval Analysis & Methods component generates intervals to be used by the selective fuzzer.
of a given CSP [5]. The used contractor technique is the Forward-Backward contractor,
which is applied to a CSP/CP with a single constraint [3], which is implemented in the
IBEX library [6]. IBEX is a C++ library for constraint processing over real numbers that
326 M. Aldughaim et al.
implement contractors. More details regarding contractors can be found in our current
work-in-progress [12].
Parsing Conditions and CSP/CP creation for each goal. While traversing the PUT
clang AST [2], we consider each statement’s conditions that lead to an injected goal: the
conditions are parsed and converted from Clang expression [2] to IBEX expression [6].
The converted expressions are used as the constraints in CSP/CP to create a contractor.
After parsing the goals, we have a CSP/CP for each goal. In case of a goal does not have
a CSP/CP, the intervals for the variables are left unchanged. We also create a constraint
for each condition in case of multiple conditions and take the intersection/union. At the
end of this phase, we have a list of each goal and its contractor. Also, a list of variables
for each contractor will be used to instrument the Frama-C file in the next phase.
Domains reduction. In this step, we attempt to reduce the domains (primarily starting
from (−∞, ∞)) to a smaller range. This is done via Frama-C eva plugin (evolved value
analysis) [7]. First, during the instrumentation, we make an instrumented file aimed to
be used by Frama-C using its intrinsic functions Frama c show each() (cf. Fig. 2).
This function allows us to add custom text to identify goals and how many variables are
in each call. Second, we run Frama-C to obtain the new variable intervals. Finally, we
update the domains for the corresponding CSP/CP.
Applying contractors. Contractors will help prune the domains of the variables by
removing a subset of the domain that is guaranteed not to satisfy the constraints. With all
the components for a CSP/CP available, we now apply the contractor for each goal and
produce the output file in Figure 2. The result will be split per goal into two categories.
The first category lists each variable and the possible intervals (lower bound followed
by upper bound) to enter the condition given. The second category contains unreachable
goals, i.e. when the contractor result is an empty vector.
Selective Fuzzer. The Selective Fuzzer parses the file produced by the analyzer, ex-
tracts all the intervals, applies these intervals to each goal, and starts fuzzing within the
given interval. Thus, pruning the search space from random intervals to informed inter-
vals. The selective fuzzer will also prioritize the goals with smaller intervals and set a
low priority to goals with unreachable results.
FuSeBMC IA: Interval Analysis and Methods for Test Case Generation 327
5 Software Project
FuSeBMC IA is publicly available on GitHub1 under the terms of MIT License. In the
repository, FuSeBMC IA is implemented using a combination of Python and C++. Build
instructions and dependencies are all available in README.md file. FuSeBMC IA is a
fork of the main project FuSeBMC available on GitHub2 .
1
https://github.com/Mohannad-Aldughaim/FuSeBMC IA
2
https://github.com/kaled-alshmrany/FuSeBMC
328 M. Aldughaim et al.
6 Data-Availability Statement
All files necessary to run the tool are available on Zenodo [13].
Acknowledgment
King Saud University, Saudi Arabia3 supports the FuSeBMC IA development. The
work in this paper is also partially funded by the UKRI/IAA project entitled “Using
Artificial Intelligence/Machine Learning to assess source code in Escrow”.
References
1. Beyer, D. Advances in Automatic Software Testing: Test-Comp 2022. FASE. pp. 321-335
(2022) DOI:https://doi.org/10.1007/978-3-030-99429-7 18
2. The Clang Team, Clang documentation. (2022), https://clang.llvm.org/docs/UsersManual.
html, accessed: 19-12-2022
3. Jaulin, L., Kieffer, M., Didrit, O. & Walter, E. Applied Interval Analysis. Springer London.
pp. 11-100 (2001) DOI:https://doi.org/10.1007/978-1-4471-0249-6 2
4. Cuoq, P., Kirchner, F., Kosmatov, N., Prevosto, V., Signoles, J. & Yakobowski, B. Frama-
C. International Conference On Software Engineering And Formal Methods. pp. 233-247
(2012) DOI:https://doi.org/10.1007/978-3-642-33826-7 16
5. Mustafa, M., Stancu, A., Delanoue, N. & Codres, E. Guaranteed SLAM—An interval ap-
proach. Robotics And Autonomous Systems. 100 pp. 160-170 (2018) DOI:https://doi.org/10.
1016/j.robot.2017.11.009
6. Chabert, G. ibex-lib.org. , http://www.ibex-lib.org/, accessed: 19-12-2022
7. Bühler, D. EVA, an evolved value analysis for Frama-C: structuring an abstract inter-
preter through value and state abstractions. (Rennes 1,2017) DOI:https://doi.org/10.1007/
978-3-319-52234-0 7
8. Baudin, P., Bobot, F., Bühler, D., Correnson, L., Kirchner, F., Kosmatov, N., Maroneze, A.,
Perrelle, V., Prevosto, V., Signoles, J. & Others The dogged pursuit of bug-free C programs:
the Frama-C software analysis platform. Communications Of The ACM. 64, 56-68 (2021)
DOI:https://doi-org.manchester.idm.oclc.org/10.1145/3470569
9. Alshmrany, K., Aldughaim, M., Bhayat, A. & Cordeiro, L. FuSeBMC: An energy-efficient
test generator for finding security vulnerabilities in C programs. International Conference
On Tests And Proofs. pp. 85-105 (2021) DOI: https://doi.org/10.1007/978-3-030-79379-1 6
10. Alshmrany, K., Aldughaim, M., Bhayat, A. & Cordeiro, L. FuSeBMC v4: Smart Seed Gen-
eration for Hybrid Fuzzing. International Conference On Fundamental Approaches To Soft-
ware Engineering. pp. 336-340 (2022) DOI: https://doi.org/10.1007/978-3-030-99429-7 19
11. Gadelha, M., Monteiro, F., Morse, J., Cordeiro, L., Fischer, B. & Nicole, D. ESBMC 5.0:
An Industrial-Strength C Model Checker. ASE. pp. 888-891 (2018) DOI: https://doi-org.
manchester.idm.oclc.org/10.1145/3238147.3240481
12. Aldughaim, M., Alshmrany, K., Menezes, R., Stancu, A. & Cordeiro, L. Incremental Sym-
bolic Bounded Model Checking of Software Using Interval Methods via Contractors.
13. Aldughaim, M., Alshmrany, K., Gadelha, M., Freitas, R. & Cordeiro, L. FuSeBMC v.5: In-
terval Analysis and Methods for Test Case Generation. DOI:https://doi.org/10.5281/zenodo.
7473124(Zenodo,2022,12)
3
https://ksu.edu.sa/en/
FuSeBMC IA: Interval Analysis and Methods for Test Case Generation 329
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Com-
mons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder.
Author Index
A F
Aguirre, Nazareno 3, 111 Falcone, Yliès 173
Aldughaim, Mohannad 324 Figueiredo, Eduardo 282
Alshmrany, Kaled M. 324 Frias, Marcelo F. 111
Ansari, Saba Gholizadeh 151
G
Gadelha, Mikhail R. 324
B Gopinath, Divya 133
Baunach, Marcel 26
Bengolea, Valeria 111 H
Beyer, Dirk 309 Haltermann, Jan 195
Bianculli, Domenico 249 Henzinger, Thomas A. 260
Bliudze, Simon 143 Huisman, Marieke 143
Brizzio, Matías 3
Burholt, Charlie 241
J
Jakobs, Marie-Christine 195
Jones, Maddie 241
C
Calinescu, Radu 241
Carvalho, Luiz 3 K
Cavalcanti, Ana 241 Kamburjan, Eduard 220
Chalupa, Marek 260 Keller, Gabriele 151
Cordeiro, Lucas C. 324 Kifetew, Fitsum Meshesha 151
Cordy, Maxime 3
L
Larsen, Kim Guldstrand 26
D Lei, Stefanie Muroya 260
d’Aloisio, Giordano 88 Li, Zhe 67
Dastani, Mehdi 151 Lorber, Florian 26
Dawes, Joshua Heneage 249 Lungeanu, Luca 133
de Freitas, Rosiane 324
Degiovanni, Renzo 3 M
Di Marco, Antinisca 88 Mangal, Ravi 133
Dignum, Frank 151 Molina, Facundo 111
Din, Crystal Chang 220 Muehlboeck, Fabian 260
N
E Neele, Thomas 47
El-Hokayem, Antoine 173 Nyman, Ulrik 26