0% found this document useful (0 votes)
20 views11 pages

An Empirical Study Assessing Software Modeling in Alloy

Uploaded by

Mert Caliskan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

An Empirical Study Assessing Software Modeling in Alloy

Uploaded by

Mert Caliskan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

2023 IEEE/ACM 11th International Conference on Formal Methods in Software Engineering (FormaliSE)

FormaliSE
2023 IEEE/ACM 11th International Conference on Formal Methods in Software Engineering (FormaliSE) | 979-8-3503-1263-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/FormaliSE58978.2023.00013

Artifact
An Empirical Study Assessing Software Evaluation
2022
2023
Accepted
Modeling in Alloy
Niloofar Mansoor Hamid Bagheri
School of Computing School of Computing
University of Nebraska - Lincoln University of Nebraska - Lincoln
Lincoln, NE USA Lincoln, NE USA
[email protected] [email protected]

Eunsuk Kang Bonita Sharif


Institute of Software Research School of Computing
Carnegie Mellon University University of Nebraska - Lincoln
Pittsburg, PA USA Lincoln, NE USA
[email protected] [email protected]

Abstract—Alloy is a declarative formal modeling language with formal specification languages, the most important factors have
syntax derived from notations common to object-oriented design been soundness and correctness, and factors like readability,
and first-order relational logic semantics. To better understand usability, and comprehension were often overlooked [9]–[13].
the usability of Alloy, the paper presents the results of an
empirical study with 30 participants assessing two types of Alloy’s design seeks to alleviate this problem with its
modeling tasks: bug fixing and model building based on natural easy-to-understand syntax and use of familiar mathematical
language specifications. The participants consisted of both novices concepts such as sets. However, there is very little research
and non-novices. Besides accuracy and time to complete tasks, on the usability and comprehension of Alloy as a language
we also examined the correlation between the performance of from the user perspective and how it can be best taught to
two cognitive tasks and task performance. Results indicate that
overall, non-novices completed the tasks with significantly higher novices. Krishnamurthi et al. make a case for paying more
accuracy (54% more accurate) than novices. In the novice group, attention to human factors in formal methods and state that
performing more actions using the Alloy analyzer led to more performing more user focused research can be beneficial for
edits and, eventually, higher scores in the bug fixing tasks. We building better tools and encouraging more people to learn
found that participants of all levels had much difficulty writing formal methods [10]. There has been some work on how
a model from scratch, and they did not utilize the analyzer to
improve their models. On average, non-novices completed all the students use Alloy Analyzer in different contexts [14], [15].
tasks 32 minutes faster than novices. Non-novices who performed However, none of this work is focused on the comprehension
better on the Alloy tasks had higher mental rotation scores, which of the Alloy language. This is important to study because
indicates the importance of spatial cognition ability in solving even though the Alloy Analyzer is helpful, its use does not
Alloy tasks. Overall, we find that there is a definite need to indicate comprehension of the actual Alloy model specifica-
improve the usability of the visualizations in the Alloy Analyzer.
Index Terms—Alloy Specification Language, Bug Fixing, Em-
tions. Moreover, almost all of the prior work is done with non-
pirical Study, Software Modeling, Usability novices. There is a clear existing gap of human factor studies
in the literature to understand how expertise plays a role in the
comprehension of Alloy specifications. It has been shown that
I. I NTRODUCTION
including varied expertise in program comprehension studies
The Alloy formal specification language [1] aids in con- can give interesting insights on how developers think about
structing models in the software design phase and checking problem solving [16].
whether specific properties of systems hold. Its back-end To better understand the usability aspect of Alloy, the paper
tool, the Alloy Analyzer [2], performs automated analysis on presents an empirical study on the comprehension of the
models, checks assertions, and generates counterexamples to Alloy language in two contexts: fixing bugs and building
those assertions if they do not hold. Alloy has been used in models. Both novices and non-novices participated in the
a wide range of applications, such as test case generation [3], experiment. For the bug fixing tasks, participants were given
security analysis of Android [4], [5], IoT devices [6], and natural language specifications for problems and their corre-
verification of critical properties of real systems [7], [8]. sponding Alloy models, which included buggy statements. The
Traditionally, working with formal specification languages participants were tasked with fixing the problems within the
has required in-depth mathematical knowledge due to their models so they matched their specifications. To the best of
complexity, and the learning curve is very steep for non- our knowledge, this is the first study to assess the impact of
mathematicians. In dealing with formal methods and designing experience on the ability to fix syntactic and semantic bugs in

979-8-3503-1263-8/23/$31.00 ©2023 IEEE 44


DOI 10.1109/FormaliSE58978.2023.00013
Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
Alloy models to match specifications. For the model building • RQ2: What is the difference in accuracy and speed be-
task, we presented the participants with a natural language tween novices and non-novices for building Alloy models
specification and a blueprint of an Alloy specification for from a requirements specification?
them to complete. Research has shown that cognitive skills • RQ3: What patterns do we observe in user behavior
such as spatial cognition and working memory capacity are during bug fixing and model building tasks?
correlated with mathematical ability [17], [18]. In software • RQ4: How do working memory and spatial cognition
engineering studies, a moderate correlation between working ability relate to task correctness?
memory capacity and fixing bugs was found by Baum et The results from RQ1 help us explore the differences
al. [19]. Sharafi et al. found that spatial ability and data between the novices’ and non-novices’ understanding of the
structure manipulation are correlated [20]. We wanted to test Alloy tasks regarding accuracy and speed in fixing bugs.
whether such correlations exist between cognitive skills and RQ2 helps understand how novices and non-novices work
fixing bugs/building models in Alloy. To do this, we performed on the model building task. Both these questions can help
two sets of cognitive tests (mental rotation [21] and operation us understand how prior exposure to Alloy makes a differ-
span [22]) to explore the correlation between memory capacity ence in problem solving. RQ3 helps us understand detailed
and solving Alloy problems. The contributions of this paper patterns of bug finding and model building in both groups via
are as follows: recorded snapshots of the specification every time a participant
• An empirical study that explores comprehension of the performed an action using the Analyzer. Finally, RQ4 explores
Alloy specification language in bug fixing tasks (syntax the relationship between performance on the Alloy tasks,
and semantic) and a model building task. spatial cognition ability and working memory capacity, which
• Comprehension pattern differences between novice and are two different cognitive skills related to mathematical and
non-novices in Alloy, which has not been studied before. programming abilities.
• A detailed analysis of Alloy Analyzer usage patterns
III. R ELATED W ORK
during the tasks.
• Cognitive tests to investigate the relationship between The two studies most related to ours are Li et al. [15] and
working memory capacity and spatial cognition ability Danas et al. [14], who performed empirical studies on the
with software modeling, also not studied before. Alloy language in novices. Li et al. [15] explored how the
• Usability guidelines to improve future Alloy releases. Alloy tool is used in practice by beginners by logging some
• A complete replication package for verifiability and repli- of the user interaction with the Alloy tool when students were
cation purposes. building Alloy models. The students are asked to build Alloy
Results indicate that non-novices find and fix Alloy bugs models, which can indicate their language comprehension. In
with significantly higher accuracy (54% more on average) and contrast, our study is focused on exploring comprehension
complete the bug fixing tasks 32 minutes faster than novices. by using both bug fixing and model building tasks, which
These results show that even a few months of familiarity and can give us more detailed information about the participants’
working with the Alloy language can make a big difference comprehension. Unlike their study, we focus on both novices
in levels of comprehension. We found that building a model and non-novices.
from scratch is difficult for both novices and non-novices. Danas et al. [14] performed studies on both students and
Many non-novices were not successful in adding the specified Mechanical Turk participants to explore how different types
dynamic properties to their model. For the bug fixing tasks, we of outputs of the Alloy Analyzer model finder are used
found that the number of Alloy Analyzer actions and model in practice. They explored principled output forms (such as
edits correlated, which predicted the accuracy score of novices. minimal and maximal forms), provenance, and unsatisfiable
We found that novices and non-novices make incremental cores. Their goal was to see how the different types of outputs
changes before running the model to check whether they help users understand and debug Alloy models.
can see a correct instance or fix the issues that generate Our work is complementary to both of these studies. We
counterexamples. On average, novices make more changes focus on how users find and fix different kinds of bugs
to the Alloy models to get to the correct specification (an (syntactic and semantic) in Alloy models based on their natural
average of 12 more edits for novices compared to non- language specifications. This can indicate how comfortable
novices over all the tasks). We found that spatial cognition and the participants are in understanding the Alloy syntax and
Alloy bug fixing ability correlated, indicating the importance language and how they work with the Alloy Analyzer. This
of this cognitive skill in understanding Alloy’s underlying is one of the first works we know of that includes both
mathematical concepts. novices (N=17) and non-novices (N=13) in an empirical study
on Alloy. The comprehension model and patterns of using
II. R ESEARCH Q UESTIONS the Alloy Analyzer can be observed in both groups and be
The paper addresses the following research questions: compared to the findings of Li et al. [15]. To our knowledge,
• RQ1: What is the difference in accuracy and speed this is also the first study to explore the relationship between
between novices and non-novices for bug fixings tasks specific cognitive abilities and comprehension of a lightweight
in Alloy models? formal language such as Alloy. Exploring these relationships

45

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Experiment Overview Computer Engineering, or Software Engineering degrees, and
one was pursuing an Industrial and Labor Relations degree.
Study Alloy specification
Goal
language comprehension
Three participants were either pursuing a Bachelor’s degree
or held one. The rest of the participants were either pursuing
Independent Variable Experience (novice and non-novice)
a graduate degree or held Masters or Doctorate degrees.
Cognitive: Operation Span, Mental Rotation We asked the participants to self-report their experience
Tasks Alloy Tasks: Syntactic Alloy Error,
Semantic Alloy Bug, Model Building level, as it has been established [26] that self estimation is
a reliable measurement for programming experience. Partici-
Accuracy, Speed, Usage of Analyzer,
Dependent Variables Number of Analyzer Actions, pants completed a post-questionnaire (after they completed the
Number of Edits study to avoid any imposter syndrome bias) that asked them
to rate their programming skills, design skills, knowledge of
can give us insight into what skills are more indicative of
set theory, first-order logic, and object-oriented programming
better comprehension of specification languages.
skills. They were also asked to rate their comprehension
IV. E XPERIMENTAL D ESIGN level of Alloy syntax and their level of comfort using the
Alloy Analyzer. The post-questionnaire showed us that some
We describe the experimental design of the study, including
participants were more familiar with Alloy despite only using
participants, tasks, study instrumentation, and measures. A
it for less than a year, specifically participants who were using
complete replication package [23] is available.
Alloy for research. We decided not to group these participants
A. Experiment Overview with novices, as they had a deeper understanding of Alloy
The goal of our controlled experiment [24], [25] is to due to extensive use. Thus, we defined non-novices in Alloy
explore comprehension of the Alloy specification language in as having more than one year of experience or having less
novices and non-novices in evaluating syntactic and semantic than one year of experience but having familiarity with the
bug fixing and model building. We designed two different task language and rating their comfort level in understanding Alloy
categories: cognitive tasks and Alloy tasks. The cognitive tasks syntax higher or equal to 3 out of 5. With this criteria, our
measure working memory capacity and spatial recognition novice group consisted of N=17 and non-novices of N=13.
ability (RQ4) to determine if these abilities play a role in C. Tasks
bug fixing and model building performance. The Alloy tasks
The first category of tasks are the cognitive tasks. The
are designed for the participants to locate and fix syntactic
two cognitive tasks were the Operation Span Task [22],
and semantic bugs in the Alloy models and build models
[27] and the 3D Mental Rotation Task [21], which measure
based on a specification. We measure comprehension using
working memory and spatial cognition ability, respectively.
accuracy, speed, number of Analyzer actions, and number of
Prior research [19], [20] has shown a correlation between
edits. An overview of the experiment is shown in Table I. The
cognitive tasks and software comprehension tasks. We aimed
university’s institutional review board approved the study.
to explore whether these correlations exist for bug fixing and
B. Participants and Experience model building tasks in Alloy.
We recruited 30 participants from different universities and We used a Python version of the Operation Span task [28]
institutions worldwide. Each potential participant was sent for our study. This task shows a number of letters to the
an email inviting them to participate in the study. If they participant, with a distractor math task between each letter,
accepted the invitation, they were assigned an ID for the pre and asks them to recall all the letters they have seen in order.
and post questionnaires and were sent the study package and The final calculated score by the application (partial-credit
the consent form via email. When the participants submitted unit score) is between 0 and 1, with a score of 1 indicating
the study, they were compensated with a $10 Amazon e-gift that the participant has recalled all the letters correctly. Each
card. participant completed four practice trials and 12 task trials.
Our participants had different levels of expertise in Alloy, We implemented a Java desktop application for the 3D
ranging from beginners who had recently started learning the Mental Rotation task [29]. The task shows an image of a 3D
language to experts who had been working with the language object to the participant and asks them to choose the correct
for years. Participants were recruited through emails to class rotations of the 3D object from four different images presented
mailing lists, posts on Alloy messaging boards, and profes- to them. For this task, each participant completed five practice
sional contacts. They were asked to fill out a demographic trials and 20 task trials. Since the participants had to choose
questionnaire before the start of the study, which asked them two correct rotations of the 3D object, we gave them 1 point
about their age, gender, affiliation, degree, native language, and for each correct choice they made. With this scoring criteria,
proficiency in English. There were 19 male participants and 11 the max score for this task would be 40.
female participants. Eleven participants were between the ages The second category of tasks were Alloy tasks. The first
of 21-25, nine were between 26-30, seven were between 31-35, set of Alloy tasks were bug fixing tasks contained in three
two were 36-40, and one was over 40 years old. Twenty-nine models. The second set consisted of a single task asking
participants were either pursuing or had Computer Science, participants to build an Alloy model according to a natural

46

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Semantic bugs in Alloy models’ predicates
Semantic Bugs
Model Fix a predicate
Original Specification Altered Specification
grade.als s !in a.assigned to s in a.assigned to
all nl: n.left.*(left + right) | nl.elem <n.elem some n.left =>n.left.elem <n.elem
balancedBST.als all nr: n.right.*(left + right) | nr.elem >n.elem some n.right =>n.right.elem >n.elem
(HasAtMostOneChild[n1] && (HasAtMostOneChild[n1] &&
HasAtMostOneChild[n2]) => HasAtMostOneChild[n2]) =>
(let diff = minus[Depth[n1], Depth[n2]] — (let diff = minus[Depth[n1], Depth[n2]] —
-1 <= diff && diff <= 1) -1 <= diff || diff <= 1)
(one item : from - Farmer | { (one item : from - Farmer | {
farmer.als from’ = from - Farmer - item - from’.eats from’ = from - Farmer - item
to’ = to + Farmer + item }) to’ = to - to.eats + Farmer + item })

language specification. We chose three Alloy models from the to complete two cognitive tasks, fix 10 Alloy bugs in total, and
GitHub repository by Wang et al. [30], located at [31] for the a subset (who volunteered) were asked to work on the model
bug fixing tasks. The three models we chose are grade.als, building task. The participants were not aware of how many
balancedBST.als, farmer.als. The Grade model describes a Alloy bugs there were. The only instructions given to them
gradebook designed to include constraints about the graders were to make sure the Alloy model conforms to the natural
and classes, the BST model specifies a balanced binary search language specification.
tree, and the Farmer model seeks to solve the classic River
Crossing Puzzle [32]. Due to their various levels of com- D. Dependent Variables
plexity, we label these models as easy, medium, and difficult, We modified the Alloy Analyzer and instructed the par-
respectively. We also provide natural language specifications ticipants to use it while working on the tasks. The modified
explaining what problems these Alloy specifications are mod- Analyzer logged snapshots of the open Alloy specification
eling. The participant was instructed to fix the model in a file every time the user executed a command. The logged
way that would adhere to the natural language specification. timestamped user actions are as follows.
We used some of the bugs in ARepair’s [30] buggy models
• Execute: Runs the most recent or the first written com-
but introduced some other bugs to fit our task types: syntactic
mand if no command has been executed so far. The com-
bugs and semantic bugs. In introducing each syntactic bug,
mands can be either “assert” for generating counterexam-
one line in an Alloy model was changed. These bugs elicit
ples to an assertion or “run” for generating instances of
errors from the Alloy Analyzer that would help the user in
a predicate. The command “run” can be combined with
detecting and fixing them, and fixing the bugs requires a
“show” to show an instance of the model.
level of understanding of Alloy syntax. We also introduced
• Show Instance: Displays the most recent instance or
semantic bugs into the models. We changed either facts or
counterexample.
constraints within a signature to modify the constraints of
• Show MetaModel: Displays a meta model of the currently
the model. And finally, we also changed some predicates in
open Alloy specification [1], which shows the relation-
the models to change their meanings. The semantic changes
ships between different elements (e.g., signatures) of the
resulted in the Alloy Analyzer showing incorrect instances
specification as an object model.
or counterexamples to assertions, and the participants were
expected to find and correct these bugs. Table II shows the We derived the following dependent variables.
semantic bugs in Alloy models’ predicates. The rest of the • Accuracy: Accuracy for bug fixing tasks is calculated
tasks can be found in the replication package. by assigning 1 point to each correct bug fix and half
The second set of Alloy tasks was on model building. a point for localizing a bug. There were cases where a
After working on the bug fixing tasks, the participants were participant would localize the line of the bug but could
asked to build an Alloy model to describe a Linked List. not correct it. The maximum score a participant could
The natural language specification described the structure and receive from all the tasks was 10 points. Tasks for models
constraints of the linked list. The specification and the partial Grade and Farmer had a maximum of 3 points, and tasks
model containing a blueprint of a linked list and its functions for the Balanced BST model had a maximum of 4 points.
provided to the participants can be found in the replication Accuracy for the model building task is determined by
package [23]. how participants solved the subproblems of the tasks. In
The participants were always asked to do the cognitive the model building task, we asked the participants to build
tasks first and then work on the two sets of Alloy tasks, but an Alloy model of a linked list, which included specifying
the order of the tasks was randomized within each category the properties of connectivity of all nodes and ensuring
(Alloy or Cognitive). Since the models had different levels of that no node points to the linked list head. We also
difficulty, we permuted the order of the tasks to control for asked the participants to write predicates for adding and
order effects (3! Alloy tasks × 2! Cognitive tasks), ending up removing nodes from a linked list and different assertions
with 12 different variations of the study. Each participant had to ensure the aforementioned properties hold. We also

47

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
gave points to the participants for building the correct original and submitted files, if a change was made to a buggy
signatures and relationships in the model. line, we consider this as bug localization. Some participants
• Speed: Speed is calculated by looking at the start time also commented on the lines with “bug detected”. Section IV-D
and finish time of the tasks. lists the scoring criteria.
• Number of Analyzer Actions: We used the information Aside from the submitted models from each participant, a
from the logs to calculate how many times they performed number of files were generated by the Analyzer if the par-
an action (Execute, Show Instance, etc.) on each model ticipant performed an action. These files include the snapshot
to see instances or check assertions. of the Alloy model at the time of action, the type of action
• Number of Edits: We used the log information to calculate performed (execute, show instance, show metamodel), and the
the number of times the participants edited each model timestamp of the snapshot. We refer to these snapshots as
in the bug fixing task. “logs”. Due to technical difficulties, we could not process 3
participants’ log files. We also wrote a Python script to extract
E. Study Instrumentation the action sequence and time spent between each action. We
We sent emails to potential participants inviting them to used the difference finder to go through all the logs submitted
participate in the study. Once they accepted the invitation, by the user for each model to show the differences between
another email was sent, which contained the link to the the snapshots after each action was performed. For every set
study package. The participants did the study remotely in of logs we had (for different participants and models), we
the location of their choice. The study package included the generated an HTML file highlighting the steps they took to
cognitive and Alloy tasks, a modified version of the Alloy localize and fix the Alloy bugs. From these HTML files,
Analyzer, a tutorial on Alloy for participants to remember the we acquired information about the number of edits. We used
language syntax and structure, a sample task on Alloy with JASP [33] to run the statistical tests.
the correct answers, and a ReadMe file detailing the steps
to do the study. We also asked the participants to fill out B. RQ1 Results: Accuracy and Speed in Bug Fixing Tasks
pre and post questionnaires to gather demographic and self- Research question 1 asks about the accuracy and speed of
reported experience data. Participants were instructed to read novices and non-novices when fixing Alloy bugs. The null and
through the ReadMe file that walked them through the steps alternate hypotheses are as follows.
of the study. The participants were not given a time limit to AH0 Having experience (non-novices) working with the
complete the tasks, but they were asked to do the study in one Alloy specification language does not have an effect on the
sitting and without interruption. Finally, they had to submit accuracy of solving the tasks.
the study package, containing all their changed files and the AHA Having experience (non-novices) working with the
generated logs, back to the researchers. Note that the study Alloy specification language has an effect on the accuracy of
was not conducted via a web browser since we wanted to take solving the tasks.
full advantage of all the Alloy Analyzer functionalities (not T H0 Having experience (non-novices) working with the
available on web). Alloy specification language does not have an effect on the
speed of solving the tasks.
V. E XPERIMENTAL R ESULTS
T HA Having experience (non-novices) working with
A. Pre-processing the Alloy specification language has an effect on the speed of
We created a master file that included each participant’s solving the tasks.
cognitive and Alloy tasks data. For the operation span task, To test the AH hypothesis, we used the participants’
we gathered the automatically graded scores of the working accuracy score on the Alloy bug fixing tasks as a measure
memory tasks from the generated files. For the mental rotation of program comprehension. For each syntactic or semantic
task, a Python script was written to grade the result files. bug, if the participant changed the buggy line, they received
The automated nature of grading these two tasks eliminates a score of 0.5 for bug localization. If the participant changed
the possibility of errors in grading. For the Alloy bug fixing the buggy line and corrected the bug, they received a score
task data, a Python script was written to show the differences of 1 for that task. There were overall three syntactic bugs and
between the Alloy logs generated by each action by creating seven semantic bugs across the three Alloy models, and the
HTML files highlighting the differences between each run. maximum score a participant could receive was 10.
We used the highlighted differences to grade each submission Table III presents the descriptive statistics. Overall, non-
by looking at the submitted version of the model next to the novices performed better on the tasks, and the average accu-
original version that included the bugs. One of the authors racy score for non-novices (M = 8.615 ± 1.024, N = 13) is
ran each of the submissions to make sure they passed all the 54.17% higher than the average accuracy score for novices
checks. We did not auto grade Alloy tasks via a script since (M = 5.588 ± 2.386, N = 17). We observe the same pattern
there were multiple ways of fixing a bug in some cases. The in individual models and on different types of tasks as well.
manual nature of running all the submissions shows that the Figure 1 shows the box plots of the overall accuracy score in
bug was either fixed or not fixed, leaving no subjective nature both groups. We can see that the scores are widely dispersed
to the grading. When looking at the differences between the in the novice group, ranging from 1.5 to 8.5, whereas the

48

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
TABLE III: Descriptive Statistics for Accuracy Across the Tasks and Models
AccuracySyntactic AccuracySemantic AccuracyGrade AccuracyBST AccuracyFarmer AccuracyScore

Novice Non-novice Novice Non-novice Novice Non-novice Novice Non-novice Novice Non-novice Novice Non-novice
Valid 17 13 17 13 17 13 17 13 17 13 17 13
Mean 2.382 2.923 3.206 5.692 2.147 2.731 1.735 3.308 1.706 2.577 5.588 8.615
Std. Deviation 0.740 0.188 1.937 0.925 0.880 0.599 0.970 0.855 0.902 0.277 2.386 1.024
Note. AccuracySyntactic and AccuracySemantic are the scores received in semantic and syntactic task types, respectively. AccuracyGrade, AccuracyBST, and
AccuracyFarmer are the scores received from solving the tasks in each model. AccuracyScore is the overall score the participants received for all the tasks
in all the models.

TABLE IV: Descriptive Statistics for Speed (in minutes) for Each Model and Overall
Grade BST Farmer Overall

Novice Non-novice Novice Non-novice Novice Non-novice Novice Non-novice


Valid 17 13 17 13 17 13 17 13
Mean 16.471 12.769 23.471 20.923 41.294 15.385 81.235 49.077
Std. Deviation 10.026 5.761 15.399 8.067 61.531 7.911 66.390 15.787

non-novice group’s scores range from 6 to 9.5, with the novices (Overall column, M = 49.077 ± 15.78, N = 13). We
minimum score of 6 being an outlier in this group. We were can observe the same pattern in individual models as well.
also interested in the differences between the scores of all We ran the Shapiro-Wilk normality test for this data. The
participants in syntactic and semantic task types. We observe distribution of the Grade model duration was the only normal
that the participants performed better in syntactic bug fixing distribution, and the t-test was used. For the rest of the tasks
tasks in general, with the average of M = 2.61 ± 0.625 and the overall speed, we used the Mann-Whitney U test to
(maximum score of 3) compared to semantic tasks, with the check for significant differences between the groups, but the
average score of M = 4.283 ± 1.99 (maximum score of 7). test did not show any significant differences. This indicates a
lack of evidence to reject the null hypothesis (Grade t-test p =
0.24, BST Mann-Whitney U p = 0.85, Farmer Mann-Whitney
U p = 0.18, Overall Mann-Whitney U p = 0.28).
RQ1 Finding: Non-novices performed significantly better
in all task types than novices. Participants received higher
scores on syntactic tasks compared to semantic tasks. We
found that, on average, non-novices finished the bug fixing
tasks 32 minutes faster than novices.
TABLE V: Mann-Whitney U Test Results for Accuracy in
Two Groups
W p Rank-Biserial Correlation
Fig. 1: Box plot of overall accuracy across Alloy tasks.
AccuracyScore 18.500 < .001 −0.833
AccuracyGrade 63.000 0.032 −0.430
For the statistical tests, we first tested normality with the AccuracyBST 23.500 < .001 −0.787
AccuracyFarmer 46.500 0.005 −0.579
Shapiro-Wilk test. We found the data was not normal in all Syntactic 52.000 0.006 −0.529
the groups. For the non-normal data, we chose to perform the Semantic 18.000 < .001 −0.837
Mann–Whitney U test, a non-parametric test to compare the Note. For the Mann-Whitney test, effect size is given by the
accuracy scores of novice and non-novice groups. Table V rank biserial correlation. W is The Mann-Whitney statistic (W-
Value) is the sum of the ranks of the first sample
shows that there are significant differences between the total
accuracy scores of two groups (p < .001), Grade tasks C. RQ2 Results: Accuracy and Speed in Model Building
(p = .032), BST tasks (p < .001), and Farmer tasks Of the thirty participants who participated in the study, only
(p = .005). We also observed that a significant difference sixteen received a model building task to create an Alloy spec-
could be seen between novice and non-novice groups in fixing ification for a Linked List. We did not send the model building
syntactic (p = 0.006) and semantic (p < .001) bugs as well. task to the rest of the participants because we received feed-
The significant differences between the two groups give us back that building models from scratch is complex and very
evidence to reject the null hypothesis (AH0 ), meaning that time consuming for the participants. To answer the research
experience makes a difference in solving Alloy tasks. question about accuracy in model building, we checked their
To test the T H hypothesis, we measured participants’ submitted model to see whether it satisfied the requirements
speed in completing the bug fixing tasks in each model. of a linked list and whether it showed a correct instance. We
Table IV shows the descriptive statistics for the speed for gave the participants a partial Alloy model, which included
each specification and overall for both groups. On average, the blueprint of two predicates (add and remove) and some
novices (Overall column, M = 81.235 ± 66.39, N = 17) took signatures that contained incomplete relations (Listing 1). We
32 more minutes to finish the Alloy tasks compared to non- graded the accuracy of the subproblems we expected the

49

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
TABLE VI: Participants’ Performance in the Model Building Exercise. (0=incorrect, 0.5=partially correct, 1=correct.)
Participant Signatures Insert Remove Acyclic Connectivity Show Acyclic No Pointer Number of
ID Node Node Property Instance Assertion to Head Actions
Assertion
N3 0 0 0 0 0 0 0 0 -
N4 1 0 0 1 1 1 1 1 71
N5 0 0 0 0 0 0 0 0 -
N6 1 0 0 0 1 1 1 1 37
N7 0.5 0 0 1 0 0 0 0 1
N12 0 0 0 0 0 0 0 0 6
N15 0.5 0 0 0 1 0 0 1 32
N17 0.5 0 0 1 0 0 0 0 31
E4 1 1 1 0 0 1 1 1 1
E5 1 0 0 1 0 1 1 0 -
E6 1 0 0 1 1 1 1 1 36
E7 0 0 0 0 0 0 0 0 10
E8 1 0 1 1 1 1 1 1 13
E9 1 0 0 1 1 1 1 1 -
E10 1 0 1 1 0 1 1 1 75
E11 1 1 0 1 1 1 1 1 68

participants to solve. Table VI describes the accuracy scores 1 sig LinkedList {


of the novice and non-novice group participants (Novice: N3- 2 h e a d : l o n e Node
3 }
N17, Non-novice: E4-E11). We gave each participant a score 4 s i g O b j e c t {}
of 1 for each subproblem if it was entirely correct, a score of 5 s i g Node {
6 data : . . .
0.5 for the signatures if they were partially correct, or a score 7 next : . . .
of 0 for incomplete or incorrect answers. We ran the models 8 }
9 / / This p r e d i c a t e should i n s e r t a v a l i d item to the l i s t
to see the generated instances of linked lists to confirm their 10 pred add ( l : L i n k e d L i s t , l ’ : L i n k e d L i s t , new : Node ) {}
correctness, and two of the authors graded each subproblem 11
12 pred remove ( l : L i n k e d L i s t , l ’ : L i n k e d L i s t , new : Node ) {}
and met to dispute disagreement on the scores. 13
Out of all the non-novice participants, only one could 14 / / Acyclic property
15 / / C o n n e c t i v i t y between a l l nodes
not complete the signatures. Three novice participants could 16 / / show i n s t a n c e s o f t h e l i n k e d l i s t
not complete the signatures and relations inside of them 17 / / a s s e r t whether t h e a c y c l i c p r o p e r t y h o l d s
18 / / a s s e r t t h a t no node p o i n t s t o t h e l i s t head
correctly, three completed half of the relations correctly, and
two completed the signatures. None of the novice participants Listing 1: Blueprint for Model Building Task
could successfully write the Insert Node or Remove Node
predicates. In contrast, two non-novices completed Insert Node To address RQ3, we explored the data we gathered from the
predicate, and three non-novices correctly wrote Remove Node. participant logs to find patterns in the number of actions and
Furthermore, three novices and four non-novices correctly the number of edits in the bug fixing tasks, as well as for the
ensured the connectivity property. Seven non-novices and two model building task. We believe that the number of actions
novices wrote the correct predicate to show an instance of a and edits are useful quantitative measures of using the Ana-
linked list. We also checked how the novice and non-novice lyzer and the work patterns of the participants. The Analyzer
participants wrote assertions to verify the properties in their provides an interactive environment for the participants and
models (detailed in Table VI). gives us data on the number of actions (execute, show model,
The difference between the speed of the two groups was show metamodel) performed. Table VII shows the average
not statistically significant, but on average, novices spent less number of actions and edits performed by the participants. The
time working on the model building task (M = 37.12±20.87, data shows that on average (rounded up to the nearest whole
N = 8) compared to non-novices (M = 45±24, N = 8). number), novices performed more actions (M = 58 ± 51, N
RQ2 Finding: Overall, we observed that the model building = 16) compared to non-novices (M = 31 ± 23, N = 11). The
task was difficult for even the non-novices, especially writing same pattern can be observed in the Grade, Farmer, and BST
predicates to create dynamic properties such as adding and logs as well. Since the data was not normally distributed, we
removing a node from the linked list. Non-novices did better used the Mann-Whitney U test but did not see a statistically
in ensuring the acyclic property, connectivity, showing the significant difference between the number of actions across
instance, and the assertions to verify the model. each model and overall.

D. RQ3 Results: Behavior Patterns Figures 2a, 2b, 2c show the categorical scatter plots of the
sequence of actions performed by participants. We can see
The null and alternate hypotheses for RQ3 are as follows. that as the tasks get more difficult (Difficulty: Grade < BST
P H0 Having experience (non-novices) working with the < Farmer), the number of actions performed by non-novices
Alloy specification language does not have an effect on the is reduced compared to the number of actions by novices.
behavioral patterns of problem solving in Alloy. We can also see that “Execute” is the most popular action
P HA Having experience (non-novices) working with the overall between both groups. Interestingly, participants used
Alloy specification language has an effect on the behavioral “Show Instance” while working on the BST model the most,
patterns of problem solving in Alloy. to see valid instances of the balanced binary search tree. We

50

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
TABLE VII: Descriptive Statistics For Number of Logs (Number of Performed Actions) and Edits
GradeLogsNum BSTLogsNum FarmerLogsNum TotalLogs

Novice Non-novice Novice Non-novice Novice Non-novice Novice Non-novice


Valid 16 10 15 11 16 11 16 11
Missing 1 3 2 2 1 2 1 2
Mean 9.375 8.800 26.333 15.818 24.750 7.909 58.813 31.727
Std. Deviation 8.921 6.909 26.351 15.823 32.460 5.467 51.763 23.946

GradeNumberOfEdits BSTNumberofEdits FarmerNumberofEdits NumberOfEdits

Novice Non-novice Novice Non-novice Novice Non-novice Novice Non-novice


Valid 16 10 15 11 16 11 16 11
Missing 1 3 2 2 1 2 1 2
Mean 6.500 9.200 17.600 16.091 20.875 7.273 43.875 31.727
Std. Deviation 5.633 7.525 18.212 16.802 30.785 6.389 39.673 24.816

observe that most of the actions are performed close to the one alternate hypotheses are as follows.
before. We also noticed that changes were mostly incremental: CogH0 There is no relationship between working
participants only changed one line and performed an action. memory and accuracy, and no relationship between mental
We could not find any specific patterns in the differences rotation skills and accuracy.
between the number of edits in different models. Overall, we CogHA There exists a relationship between working
observe that novices make more edits than non-novices on memory and accuracy, and mental rotation skills and accuracy.
average (Novice: M = 43.87 ± 39.67, N = 16, Non-novice: M The data was not normally distributed, so we ran Spear-
= 31.72 ± 24.81, N = 11), but the Mann-Whitney U test did man’s correlation to assess the relationship between the oper-
not show any significant difference between them (p = 0.4). ation span task score and the overall accuracy score in Alloy
Additionally, we wanted to know whether performing more bug fixing tasks. The correlation was not statistically signif-
actions correlated with the number of edits in both of the icant (rs = 0.103, p = 0.588). Next, Spearman’s correlation
groups. We looked at the correlation between the number was run to assess the relationship between mental rotation
of edits for each model, the number of actions on each task score and overall accuracy score in Alloy tasks. There
model, and the overall number of actions and edits. We was a positive correlation between the two variables, which
found that for both groups, and for each individual model was statistically significant (rs = 0.367, p = 0.046). This
and overall, the number of actions correlated positively with finding allows us to reject the null hypothesis and accept
the number of edits. This indicates that seeing an instance the alternate hypothesis that there is indeed a relationship
of the model helped the participants make edits. Finally, we between mental rotation task and bug fixing task accuracy. We
ran the linear regression model with bug fixing accuracy as also examined the relationship between these cognitive skills
our dependent variable and the number of actions and edits and the participants’ model building scores. We first added
as our independent variable in both novice and non-novice all the scores of the subproblems together to get one single
groups. The regression model was statistically significant in model building score for each participant. We could not find
predicting the outcome variable, meaning that the number of statistically significant correlations between the overall model
actions and edits had a positive effect on the novice group’s building score and cognitive task scores. We examined the
overall score (p = 0.046, regression equation: Accuracy = relationship between the cognitive task scores and the score
4.192 − 0.07(N umberOf Edits) + 0.032(T otalLogs)). We of each of the subproblems in model building, and we only
could not find this relationship in the non-novice group. found one significant correlation between building Signatures
We also examined whether the experience had an effect on and the Mental Rotation Task (rs = 0.555, p = 0.026).
the number of actions (Table VI) participants performed in RQ4 Finding: The statistically significant positive correla-
the model building task, but we did not find any significant tion between the bug fixing task accuracy and mental rotation
differences in the number of actions between the two groups score suggests that people with such skills might be better
(Mann-Whitney U p = 0.916). suited to understand Alloy models.
RQ3 Finding: We found that on average non-novices make
VI. T HREATS TO VALIDITY
fewer edits than novices in bug fixing tasks. The number of
actions performed by novices is, on average higher than by Internal Validity: The Alloy community is a relatively small
non-novices. We observed that participants used the “Execute” community. The models that were presented in this study
action the most, and they made small and incremental edits can be considered educational Alloy models. It is possible
before executing the commands again. The number of actions that some of the non-novices who have had experience with
and edits correlated with bug fixing accuracy for novices. Alloy might have seen these models before while learning or
teaching Alloy. We injected 1 to 2 line bugs into the models
E. RQ4 Results: Working Memory and Spatial Cognition ourselves. Hence the models used were sufficiently different
Research question 4 asks how working memory and spatial from models found on the web. To have everyone at the same
cognition ability relate to task correctness. The null and baseline to start, we asked all the participants to go through

51

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
(a) Grade Action Sequence (b) BST Action Sequence (c) Farmer Action Sequence
Fig. 2: Analyzer Action Sequences across two groups (E1-E13: Non-novices, N1-N17: Novices)

the Alloy tutorial we sent them. They were also presented non-novices had difficulty in completing the harder task of
with comments on the models that could help them find the completing the Insert Node and Remove Node predicates. By
bugs. We also asked the participants not to look online for examining the logs, we found only two participants (N17
answers, but since the study was remote we did not have and E10) ran the Insert Node (add) predicate to see what
any control over this factor. To mitigate this threat, we took instances the Analyzer created. Despite running add and the
every precaution to make sure the instructions given to the analyzer showing an incorrect instance, the participant could
participants were clear. not recognize the issue and could not correct the predicate. An
External Validity: The Alloy user population is smaller example of an incorrect instance of add is shown in Figure 3.
than the general developer population, making it extremely The instance shows that the participant did not ensure that
difficult to recruit participants. A few users dropped from the the difference between the linked lists used in the predicate is
study because they did not understand Alloy and could not only the new node (Node0). They also did not notice that
solve the tasks. We did not include their data in our analysis. LinkedList2, which is the first argument of the predicate,
Finding non-novices was also challenging because Alloy is does not contain any nodes. Another common mistake in
mainly used in academic settings, and finding experts who writing the Add predicate was that participants did not specify
were willing to partake in the study was difficult. Despite this, to the analyzer that the two linked lists in the argument
we secured 13 participants who knew and used Alloy before list of the predicate cannot be the same, which resulted in
through extensive advertising and 17 who were willing to read wrong instances (included in the replication package). Our
the tutorial and learn the language before completing the study. observations highlight the importance of understanding the
Construct Validity: All dependent variables were chosen instances and visualization in Alloy and that the participants
carefully to ensure they represented what we sought to mea- either did not know how to get information out of the instances
sure. Even though we automated most of the log analysis, we or they were not able to understand their mistakes and fix them.
manually validated them to mitigate any errors in calculation. RQ3 results show that the novices rely more on the Analyzer
Conclusion Validity: The unpaired Mann-Whitney test was to find issues with the model and make more edits to fix bugs.
used to compare averages of two independent groups which is We also observe that overall and in each task, the number of
suitable for small samples that are not normally distributed. actions and edits were correlated, indicating that the instance
generated by the analyzer can help the participant in deciding
VII. D ISCUSSION AND I MPLICATIONS about their edits. Furthermore, in the novice population, we
Our findings for RQ1 present clear differences between observe that accuracy is affected by the number of actions
novices and non-novices in accuracy and speed of working on and edits. This implies that seeing the instances and interacting
bug fixing tasks. It implies that prior exposure to and experi- with the analyzer helps the novice participants solve the tasks
ence with the language is important in completing Alloy tasks. more accurately. RQ4 results show that the comprehension
Despite Alloy being more readable and easier to understand of Alloy language is more related to spatial cognition ability
in comparison to other formal languages, it is still challenging and not as much related to working memory capacity. We can
for novices to work on Alloy tasks without much background rationalize this difference by pointing out that Alloy tasks are
on formal methods and the language itself. RQ2 results show not memory intensive tasks, and they are more related to the
that despite the differences in experience, all participants found mathematical abilities of a person, which research shows is
it difficult to build an Alloy model from scratch by looking correlated with spatial recognition abilities [17].
at the natural language specification. Novices had a very The post-questionnaire results indicate that 16 users had
difficult time completing the easier subproblems. In contrast, trouble completing tasks. Out of the fourteen that said they

52

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
same concept applies to visualizing counterexamples where
the affected nodes should be highlighted for easier compre-
hension. Overall, users rated their confidence as low/medium
for all tasks. Perhaps these suggested changes will boost their
overall confidence in finding incorrect instances.
Implications: Educators can make use of patterns we find
in novices to better teach Alloy and help them avoid common
mistakes while fixing bugs. In practice, one can choose devel-
opers in the industry who are better at spatial skills to help
with Alloy debugging. Formal specification languages such as
Alloy are used for many safety critical applications [7], [8],
[35]. With the amount of day to day activities that depend on
software running safely and securely, it is important to study
how developers interact with modeling software such that we
can improve them to support novice modelers by learning
how the experts/non-novices behave. We still have a long way
Fig. 3: An incorrect instance Add predicate in model building to go in this area, as is clearly evidenced by the literature.
We strongly believe that more studies on formal specification
did not have trouble with the tasks, eight of them were non- languages are needed for different types of tasks. The tasks
novices. The findings about the patterns in solving tasks, used in this paper are only the beginning of paving the way
specifically the fact that the participants make incremental for more studies that can be conducted in this space. One
and small changes is consistent with the observations of Li way we can improve usability and tool support and adoption
et al. [15], who found that users perform consecutive actions of software modeling tools such as Alloy is by learning (via
on models that are only slightly different. The other study [14] studies such as this one) how modelers (users) interact with
is not comparable to ours because their tasks were different them. This behavior can then be used in conjunction with
and dealt with which model outputs (minimality/maximality, patterns found for novices and experts via static profiling [36].
UNSAT cores) people used the most.
Suggested Usability Improvements: Based on the outcome VIII. C ONCLUSIONS AND F UTURE W ORK
of this study, we recommend the following improvements to
The paper investigates how novices and non-novices per-
the Alloy Analyzer as well as the areas of focus for teaching
form bug fixing and model building tasks in Alloy. The results
Alloy. Given that novice users struggled more with semantic
indicate that non-novices perform 54% better than novices
tasks (while they performed relatively well on syntactic ones),
on average and that participants perform better on syntactic
teaching systematic methods for debugging an Alloy model to
tasks compared to semantic tasks. Non-novices spend less
locate and fix semantic bugs more quickly (e.g., identifying
time working on the bug fixing tasks, and the participants in
an over-constraint that results in unsatisfiability of a model
both groups use the action “Execute” most frequently while
or a missing constraint that causes an assertion failure) is
working on the Alloy models. The study results also show that
recommended. An extension to the Analyzer that automates
small incremental changes are made before re-executing the
this debugging process would also be valuable. In addition,
model commands. The number of edits and actions performed
given that both novices and non-novices tend to work with
is smaller with non-novices and predicts accuracy in the novice
the Alloy models in an incremental manner, tool enhancements
group. Results also show that the model building task was
that further facilitate this incremental process would also be
difficult even for non-novices. Several usability improvements
helpful (e.g., automated compilation and execution of the
in Alloy Analyzer visualizations are presented based on the
model given a change; generating suggestions for which part
study results. This study has taken the critical first step towards
of the model the user should inspect next). Finally, the results
digesting a practice that software designers have always en-
of RQ2 suggest that even non-novice users of Alloy struggle
gaged in, leading to an understanding that promises to enable
with inspecting the generated instances to build models. In
researchers, practitioners, and educators to improve rigorous
our experience, navigating visual instance diagrams is a non-
software modeling. In future work, we plan to qualitatively
trivial task that demands a significant amount of cognitive load,
explore the participants’ patterns of problem solving and
especially for models with complex relations. An alternative
perform in-person studies to monitor closely the participants
way of visualizing Alloy instances (e.g., one that supports
while working on Alloy tasks.
domain-specific visualization [34]) may help overcome this
challenge. For example, looking at Figure 3, we notice that the
ACKNOWLEDGMENTS
default visualization can highlight the involved nodes better by
perhaps changing their color so users have an easier time un- This work is supported in part by the US National Science
derstanding that it is not a correct instance of add. Such details Foundation under Grant Numbers CNS 18-55753, CCF 18-
are easy to miss for novices, especially in bigger instances. The 55756, CCF 17-55890, and CCF 16-18132.

53

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES [18] K. P. Raghubar, M. A. Barnes, and S. A. Hecht, “Working memory
and mathematics: A review of developmental, individual difference,
[1] D. Jackson, Software Abstractions: logic, language, and analysis, 2012. and cognitive approaches,” Learning and individual differences, vol. 20,
[2] “Alloy analyzer,” https://alloytools.org/download.html. no. 2, pp. 110–122, 2010.
[3] N. Mirzaei, J. Garcia, H. Bagheri, A. Sadeghi, and S. Malek, “Reducing [19] T. Baum, K. Schneider, and A. Bacchelli, “Associating working memory
combinatorics in gui testing of android applications,” pp. 559–570, 2016. capacity and code change ordering with code review performance,”
[4] H. Bagheri, A. Sadeghi, J. Garcia, and S. Malek, “Covert: Compositional Empirical Software Engineering, vol. 24, no. 4, pp. 1762–1798, 2019.
analysis of android inter-app permission leakage,” IEEE transactions on [20] Z. Sharafi, Y. Huang, K. Leach, and W. Weimer, “Toward an objective
Software Engineering, vol. 41, no. 9, pp. 866–886, 2015. measure of developers’ cognitive activities,” ACM Transactions on
[5] H. Bagheri, E. Kang, S. Malek, and D. Jackson, “A formal approach for Software Engineering and Methodology (TOSEM), vol. 30, no. 3, pp.
detection of security flaws in the android permission system,” Formal 1–40, 2021.
Aspects of Computing, vol. 30, no. 5, pp. 525–544, 2018. [21] S. G. Vandenberg and A. R. Kuse, “Mental rotations, a group test of
[6] M. Alhanahnah, C. Stevens, and H. Bagheri, “Scalable analysis of three-dimensional spatial visualization,” Perceptual and Motor Skills,
interaction threats in iot systems,” in Proceedings of the 29th ACM vol. 47, no. 2, pp. 599–604, 1978, pMID: 724398. [Online]. Available:
SIGSOFT International Symposium on Software Testing and Analysis, https://doi.org/10.2466/pms.1978.47.2.599
2020, pp. 272–285. [22] N. Unsworth, R. P. Heitz, J. C. Schrock, and R. W. Engle, “An automated
[7] J. P. Near, A. Milicevic, E. Kang, and D. Jackson, “A lightweight code version of the operation span task,” Behavior research methods, vol. 37,
analysis and its role in evaluation of a dependability case,” in Pro- no. 3, pp. 498–505, 2005.
ceedings of the 33rd International Conference on Software Engineering, [23] N. Mansoor and B. Sharif, “An empirical study assessing software
2011, pp. 31–40. modeling in alloy-replication package.” [Online]. Available: https:
[8] N. Mansoor, J. A. Saddler, B. Silva, H. Bagheri, M. B. Cohen, and //osf.io/5p6e4/
S. Farritor, “Modeling and testing a family of surgical robots: an [24] N. Juristo and A. M. Moreno, Basics of Software Engineering Experi-
experience report,” in Proceedings of the 2018 26th ACM Joint Meeting mentation, 1st ed. Springer Publishing Company, Incorporated, 2010.
on European Software Engineering Conference and Symposium on the [25] K. Stol and B. Fitzgerald, “The ABC of software engineering research,”
Foundations of Software Engineering, 2018, pp. 785–790. ACM Trans. Softw. Eng. Methodol., vol. 27, no. 3, pp. 11:1–11:51,
[9] M. Spichkova and A. Zamansky, “Teaching of formal methods for 2018. [Online]. Available: https://doi.org/10.1145/3241743
software engineering,” in ENASE, 2016, pp. 370–376. [26] J. Siegmund, C. Kästner, J. Liebig, S. Apel, and S. Hanenberg, “Mea-
[10] S. Krishnamurthi and T. Nelson, “The human in formal methods,” in suring and modeling programming experience,” Empirical Software
Formal Methods – The Next 30 Years, M. H. ter Beek, A. McIver, and Engineering, vol. 19, no. 5, pp. 1299–1334, 2014.
J. N. Oliveira, Eds. Cham: Springer International Publishing, 2019, pp. [27] A. R. Conway, M. J. Kane, M. F. Bunting, D. Z. Hambrick, O. Wilhelm,
3–10. and R. W. Engle, “Working memory span tasks: A methodological
[11] T. Weber, A. Zoitl, and H. Hußmann, “Usability of development review and user’s guide,” Psychonomic bulletin & review, vol. 12, no. 5,
tools: A case-study,” in 2019 ACM/IEEE 22nd International Conference pp. 769–786, 2005.
on Model Driven Engineering Languages and Systems Companion [28] T. von der Malsburg, “Py-Span-Task – A Software for Testing
(MODELS-C), 2019, pp. 228–235. Working Memory Span,” Jun. 2015. [Online]. Available: http:
[12] R. Kalantari and T. C. Lethbridge, “Characterizing ux evaluation in //dx.doi.org/10.5281/zenodo.18238
software modeling tools: A literature review,” IEEE Access, vol. 10, [29] N. Mansoor, “3d mental rotation task in java.” [Online]. Available:
pp. 131 509–131 527, 2022. https://github.com/niloofarmansoor/3DMentalRotation
[13] S. Abrahao, F. Bourdeleau, B. Cheng, S. Kokaly, R. Paige, [30] K. Wang, A. Sullivan, and S. Khurshid, “Arepair: a repair framework for
H. Stoerrle, and J. Whittle, “User experience for model-driven alloy,” in 2019 IEEE/ACM 41st International Conference on Software
engineering: Challenges and future directions,” in 2017 ACM/IEEE Engineering: Companion Proceedings (ICSE-Companion). IEEE, 2019,
20th International Conference on Model Driven Engineering Languages pp. 103–106.
and Systems (MODELS). Los Alamitos, CA, USA: IEEE Computer [31] K. Wang, “Arepair: A repair framework for alloy - experiment
Society, sep 2017, pp. 229–236. [Online]. Available: https://doi. repository.” [Online]. Available: https://github.com/kaiyuanw/ARepair/
ieeecomputersociety.org/10.1109/MODELS.2017.5 tree/master/experiments/models
[14] N. Danas, T. Nelson, L. Harrison, S. Krishnamurthi, and D. J. Dougherty, [32] Alloy, “River Crossing Puzzle,” 2022. [Online]. Available: https:
“User studies of principled model finder output,” in International Con- //alloytools.org/tutorials/online/sidenote-RC-puzzle.html
ference on Software Engineering and Formal Methods. Springer, 2017, [33] JASP Team, “JASP (Version 0.14.1)[Computer software],” 2022.
pp. 168–184. [Online]. Available: https://jasp-stats.org/
[15] X. Li, D. Shannon, J. Walker, S. Khurshid, and D. Marinov, “Analyzing [34] T. Dyer and J. W. B. Jr., “Sterling: A web-based visualizer for relational
the uses of a software modeling tool,” Electronic Notes in Theoretical modeling languages,” in 8th International Conference on Rigorous State-
Computer Science, vol. 164, no. 2, pp. 3–18, 2006. Based Methods ABZ, 2021, pp. 99–104.
[16] N. J. Abid, J. I. Maletic, and B. Sharif, “Using developer eye movements [35] S. Pernsteiner, C. Loncaric, E. Torlak, Z. Tatlock, X. Wang, M. D.
to externalize the mental model used in code summarization tasks,” in Ernst, and J. Jacky, “Investigating safety of a radiotherapy machine using
Proceedings of the 11th ACM Symposium on Eye Tracking Research & system models with pluggable checkers,” in International Conference on
Applications, 2019, pp. 1–9. Computer Aided Verification. Springer, 2016, pp. 23–41.
[17] M. Hegarty and M. Kozhevnikov, “Types of visual–spatial represen- [36] E. Eid and N. A. Day, “Static profiling of alloy models,” IEEE Trans-
tations and mathematical problem solving.” Journal of educational actions on Software Engineering, vol. 49, no. 2, pp. 743–759, 2023.
psychology, vol. 91, no. 4, p. 684, 1999.

54

Authorized licensed use limited to: ULAKBIM UASL - Hacettepe Universitesi. Downloaded on May 16,2024 at 14:26:47 UTC from IEEE Xplore. Restrictions apply.

You might also like