0% found this document useful (0 votes)
41 views11 pages

Views On Internal and External Validity in Empirical Software Engineering

The document discusses the tradeoff between internal and external validity in empirical software engineering research. It presents the results of a survey of program committee and editorial board members which found differing opinions on the importance of internal vs external validity and a lack of consensus in the community. The survey also found disagreement about the need and extent of replication necessary.

Uploaded by

malek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views11 pages

Views On Internal and External Validity in Empirical Software Engineering

The document discusses the tradeoff between internal and external validity in empirical software engineering research. It presents the results of a survey of program committee and editorial board members which found differing opinions on the importance of internal vs external validity and a lack of consensus in the community. The survey also found disagreement about the need and extent of replication necessary.

Uploaded by

malek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

2015 IEEE/ACM 37th IEEE International Conference on Software Engineering

Views on Internal and External Validity


in Empirical Software Engineering
Janet Siegmund, Norbert Siegmund, and Sven Apel
University of Passau, Germany

Abstract—Empirical methods have grown common in software [With internal validity] you might
get a more 'reliable' result, but the
[…] we first need to clearly
engineering, but there is no consensus on how to apply them control [confounding result could not be used to explain
anything about the real world
properly. Is practical relevance key? Do internally valid studies factors] before eventually
being able to generalise
include two studies [in a paper],
one maximizing internal validity [internal] would
have any value? Should we replicate more to address the tradeoff and the other maximizing external show no value to
between internal and external validity? We asked the community Without internal [the] SE community
validity, the results
how empirical research should take place in software engineering, cannot be trusted
with a focus on the tradeoff between internal and external validity
and replication, complemented with a literature review about
the status of empirical research in software engineering. We
found that the opinions differ considerably, and that there is Internal – 20% Balance – 29% External – 51%
no consensus in the community when to focus on internal or
external validity and how to conduct and review replications. Fig. 1. Preferences for internal vs. external validity among program-committee
and editorial-board members.
I. I NTRODUCTION
Empirical research in software engineering came a long way.
but at the cost of not being able to unambiguously understand
From being received as a niche science, the awareness of its
why the new tool affects the work flow—maybe it is just
importance has increased. In 2005, empirical studies were found
because it is new.
in about 2% of papers of major venues and conferences [31],
while in recent years, almost all papers of ICSE, ESEC/FSE, There is an inherent tradeoff in empirical research: Do we
and EMSE reported some kind of empirical evaluation (see want observations that we can fully explain, but with a limited
Section III). Thus, the amount of empirically investigated claims generalizability, or do we want results that are applicable
has increased considerably. to a variety of circumstances, but where we cannot reliably
With the rising awareness and usage of empirical studies, the explain underlying factors and relationships? Due to the options’
question of where to go with empirical software-engineering different objectives, we cannot choose both. Deciding for one
research is also emerging. New programming languages, tech- of these options is not easy, and existing general guidelines,
niques, and paradigms, new tool support to improve debugging for example by Wohlin [36] or Juristo and Moreno [12], are
and testing, new visualizations to present information emerge too general to assist in making this decision.
almost daily, and claims regarding their merits need to be With our work, we want to raise the awareness of this
evaluated—otherwise, they remain claims. But, how should problem: Should we focus on internal or external validity?
new approaches be evaluated? Do researchers focus on internal Should we focus on one first and then on the other? Should
validity and control every aspect of the experiment setting, we balance both kinds of validity, not maximizing one? In
so that differences in the outcome can only be caused by the the end, every time we are planning an experiment, we
newly introduced technique? Or, do they focus on external must ask ourselves: Do we ask the right questions? For
validity and observe their technique in the wild, showing a example, is it better to ask principal questions, such as whether
real-world effect, but without knowing which factors actually static type systems ease program comprehension compared to
caused the observed difference? dynamic type systems, or is it better to ask broadly which
Both options, maximizing internal or maximizing external commonly-used programming languages are more superior in
validity, have their benefits and drawbacks, which we illustrate what circumstances? Do we want pure, ground research, or
by the example of evaluating the influence of using a new applied research with immediate practical relevance? Is there
tool on the performance of beginning programmers: The first even a way to design studies such that we can answer both
option (maximizing internal validity) allows researchers to kinds of questions at the same time, or is there no way around
exclude almost all influencing factors, so that they can observe replications (i.e., exactly repeated studies or studies that deviate
in a highly controlled setting whether the new tool improves from the original study design only in a few, well-selected
one aspect of the every-day work of beginning programmers. factors) in software-engineering research?
This way, researchers can draw sound conclusions about the In the remainder of this paper, we present the results of a
reasons of improvement or degradation, but at the cost of literature review to evaluate the kind and extent of empirical
generalizability. With the second option (maximizing external methods used in software engineering, and to get an impression
validity), researchers can observe whether the tool has any of the role of internal and external validity and replications
effect on different types of developers in an every-day setting, (Sec. III), followed by example studies maximizing one or the

978-1-4799-1934-5/15 $31.00 © 2015 IEEE 9 ICSE 2015, Florence, Italy


DOI 10.1109/ICSE.2015.24
other (Sec. IV). Thereupon, as a main contribution of this paper, engineering, following guidelines from medical research [18].
we present the results of an online survey among 79 program- Kitchenham and others complement this research with a study
committee and editorial-board members—“key players” in their on systematic literature reviews [16] and on the repeatability
field—of 11 major software-engineering venues regarding their of systematic literature reviews [17]. Ko and others present
perception and opinion on how to address the tradeoff between guidelines for conducting controlled experiments to evaluate
internal and external validity (Sec. V to VII). software-engineering tools with human participants [19]. These
In a nutshell, we found large differences in the opinions guidelines arrange research activities along ten steps, including
regarding the importance of internally and externally valid recruitment and training of participants as well as task design.
studies and a lack of awareness of the tradeoff between the Siegmund and Schuman provide an overview of confounding
two, which we illustrate in Figure 1. Furthermore, many survey parameters that influence the outcome of an experiment and
participants are aware of the need of replication, but there that need to be controlled for [28]. Sjøberg and others make
is substantial disagreement about the kind and extent of the the case for more realistic settings in software-engineering
delta that is necessary for proper replication. Thus, there are research, stressing the role of funding to pay professional
different reviewer expectations to a paper and there are no developers [29]. Furthermore, they report on current problems
proper guidelines for reviewing a paper in this hindsight. of empirical research, for which the lack of practical relevance
Our research is meant to stimulate researchers across multiple is still an issue, among others [30]. As solutions, they suggest
areas rethinking their expectations and standards of empirical to give more competence to empirical researchers (e.g., by
research, including educating (young) software-engineering training) or to improve the collaboration between industry
researchers, assisting researchers in evaluating their work, and academia. Their vision of empirical research in 5 to 10
helping reviewers in judging the soundness of a research paper, years is striving for more practical relevance, more synthesis of
and providing guidelines for planning empirical research. knowledge, and more theory building. Tichy and others reported
In summary, we make the following contributions: on the status of experimental research in software engineering
• Overview of the state of the art of empirical software en- compared to optical engineering and neural computation,
gineering in three major (empirical) software-engineering concluding that there is only little empirical research in software
venues, with a focus on the role of internal and external engineering [34]. Consequently, Tichy stated that computer
validity and replication. scientists should experiment more [35]. He also provided
• Overview of the opinions of the “key players” of the guidelines for reviewing empirical research, which describe
software-engineering community, based on a survey common arguments that reviewers use to reject a paper, and
among 79 program-committee and editorial-board mem- explanations for why these are not valid for rejection [33].
bers of 11 major software-engineering venues. Replication: There is considerable work in the direction
• Suggestions on how to conduct empirical research in of replication (i.e., a repetition of an experiment under similar
software engineering. conditions, but with specified variation, such as a new sam-
• A discussion of open issues, meant to initiate a discussion ple [36]). Basili and others stated that “too many studies tend to
in the community. be isolated and are not replicated, either by the same researchers
All data from the literature review and survey are available on or by others” [2]. They describe a framework for categorizing
a supplementary Web site: http://www.infosun.fim.uni-passau. related studies, which can then be viewed in context, rather than
de/spl/janet/ese/ . As an overarching theme of our work, let us viewing each study in isolation. Shull and others describe the
quote David Parnas on software-engineering research [22]: role of exact and conceptual replication in software engineering,
“It is time to stop “exploring” and start experimenting.” which both are standard in behavioral science [27], but not
in software-engineering research, as our literature review and
II. R ELATED W ORK our survey show. Juristo and Vegas describe the role of non-
There is considerable work concerned with the status of exact replications, explaining that exact replications are almost
empirical research or guidelines on how to conduct empirical impossible to conduct in software-engineering research, because
research in the area of software engineering. However, we the context is so complex (e.g., how techniques were applied
are not aware of any work surveying program-committee or as well as the knowledge of participants and how they were
editorial-board members to assess their opinion and suggestions trained) [13]. Thus, many researchers give up (e.g., [20]) or
addressing the tradeoff between internal and external validity do not publish their efforts because of contradicting results.
in empirical software engineering. To improve this situation, Juristo and Vegas suggest to loosen
Guidelines: There is a long history of advocating and the restrictions for the exactness of replication studies, so that
evaluating empirical research in software engineering. As some obstacles of replication studies can be removed.
early as 1986, Basili and others published guidelines on Status of empirical research: Do all these guidelines
empirical research [3], comprising a framework to describe and insights affect the status of empirical research? There is
experimental work. Furthermore, Basili proposed the goal- evidence that the amount of empirical research has increased:
question-metric approach to guide researchers in defining their While Sjøberg and others found that in major software-
research goals, such that the context of an experimental setting engineering venues from 1992 to 2002, only 1.9% of the
is clearly described [1]. Kitchenham and Charters proposed papers reported a controlled experiment [31], this fraction has
guidelines on how to conduct systematic surveys in software increased in recent years, for example, as observed by Ivarsson

10 ICSE 2015, Florence, Italy


and Gorschek in the domain of requirements engineering [9].
In our literature review, we found a large number of papers
5 108 88
177
that conducted some sort of empirical evaluation (Section III).

nodifferentiation
ofval.threats
But does that also count for the quality? Ivarsson and Gorschek
found that, in requirements engineering, the rigor of empirical
studies has improved, but practical relevance has not [9].
Nagappan and others found that selected subject systems cover
a wide range of different dimensions, such as team size and 7
project size, which positively affects external validity [21]. 1 4
Sjøberg and others [31] as well as Dybå and others [5] noted, empirical replication
405 study
381 32
among others, that the reports of empirical studies often 24
lack important details. For example, threats to validity are
often vague and unsystematic despite the numerous guidelines 3 10
on how to describe empirical studies [4], [10], [11], [15],
[36]. Kampenes and others analyzed the conduct of quasi 2 38
experiments and found that the design, analysis, and reporting 294 87
can be improved [14]. 31
Thus, despite the long history of advocating empirical
research in software engineering, there is still much room for Fig. 2. Fraction of empirical studies that meet certain criteria. Numbers in
improvement, which Zannier and others nicely phrased [37]: circles represent the absolute number of papers, to which the circle area is
proportional. Gray numbers refer to the paragraphs in the text.
“[G]iven the numerous clear and repeated messages
of [numerous researchers], which date back almost 20 comparison with the Sjøberg study, indicating that the human
years and provide results that date even further in history, factor is nowadays considered as more important.
we must ask ourselves, at what point will the message
Third, there is a diversity in the selection of human
become clear?”
participants. Of the 87 studies involving human participants,
Our work—in particular, the analysis of the survey results— 38 recruited students, 31 professionals, and 10 both. In
strives for making this message clearer. 7 papers, the participants were not specified closer, and in
1 study, researchers used Mechanical Turk. Thus, relying on
III. S TATE OF THE A RT: A L ITERATURE R EVIEW
professional programmers is not the exception.
As not being addressed by previous work (cf. Section II), we Fourth, we determined whether a paper reported on a
conducted a literature review of three of the major (empirical) replication: Of the 381 papers, 347 (91 %) did not conduct
software-engineering venues, to get an overview of the current a replication; 32 (8 %) did a replication. Of these, 24 did an
status of empirical research in software engineering. Our sample internal replication (i.e., by the same group) and 7 reported
consisted of all 405 full technical papers of ICSE (2012, 2013), on an external (i.e., by another group) replication (one paper
ESEC/FSE (2011 to 2013), and EMSE (2011 to 2013), the was not clear on the kind of replication). This result suggests
major venues in (empirical) software engineering. While this that replication studies, especially external ones, are underrep-
selection is limited, it still gives a good impression of the state resented in software engineering.
of the art. We manually examined each paper regarding the
Fifth, we looked at the discussion of validity. Of the 381
use of empirical methods, recruitment of human participants
papers using an empirical method, surprisingly 177 (46 %)
(students or professionals), replication, and presentation of
papers did not explicitly mention threats to validity at all. In
validity. To this end, we skimmed each paper and searched
108 (28 %) papers, authors discussed threats to validity, but
with a set of keywords1 . In Figure 2, we provide an overview
did not differentiate between internal or external (or other
of the process and findings of the literature review.
kinds of) validity. In a few papers (5), the discussion was
First, we determined whether an empirical method (e.g., case not explicit, but hidden in a paragraph of the discussion or
study, controlled experiment) was applied, which happened in conclusion. The remaining 88 (23 %) papers differentiated
overwhelmingly 381 (94 %) papers. This seems like a large between different kinds of validity (mostly internal, external,
increase compared to the 1.9 % that Sjøberg and others found construct, and conclusion validity). While this result may be
about 10 years earlier [31]; but, to be fair, they only included biased by the selection of venues (i.e., 2 conferences and
controlled experiments with human participants. 1 journal; conferences impose a strict page limit to which
Second, we also determined whether a study was conducted the discussion of validity is often sacrificed), it nevertheless
with or without human participants. Of all 405 papers, 87 suggests that there is considerable room for improvement in
(21 %) recruited human participants, and 294 (73 %) had no the discussion and documentation of threats to validity.
human participants, but evaluated other properties, such as
To summarize, empirical research seems to be an integral part
performance or test coverage. Now, we can draw a more close
of software-engineering research nowadays. However, from the
1 Keywords: empirical, student, profession, developer, subject, participant, methodological point of view, the individual standards differ
human, repeat, replicat, further. considerably.

11 ICSE 2015, Florence, Italy


IV. M AXIMIZING I NTERNAL OR E XTERNAL VALIDITY there is a lack of appreciation for internally valid studies,
To illustrate the merits of internal and external validity as and that external validity or practical relevance of a study is
well as to provide a foundation for the survey, we introduce seen as most important. Thus, we assess the awareness of
two studies as running examples, one maximizing internal, the the community (RO1 ) as well as their suggestions on how to
other maximizing external validity. address this tradeoff (RO2 ). Furthermore, in other disciplines,
A study set up that maximizes internal validity was designed replicating studies is a commonly accepted way to address this
by Hanenberg [6]. He evaluated whether static type systems, as tradeoff—in medicine or physics, only replicated results are
compared to dynamic type systems, influence development time. accepted. Thus, we asked the community about what they think
To control for confounding parameters—which might influence of replication to address this tradeoff in software-engineering
the result beside the merits of the two kinds of type systems— research (RO3 ).
he developed a language and corresponding IDE, solely for the B. Participants
purpose of this experiment. The language and IDE differed only
in the type system used, nothing else. Furthermore, Hanenberg As participants, we contacted the program-committee
recruited students with similar programming experience as and editorial-board members of major (empirical) software-
participants and let them implement two small tasks. In this engineering venues. We decided to balance venues with empiri-
highly controlled setting, he was able to conclude that a cal focus and venues with a general software-engineering focus
difference in the performance of student programmers is caused to reduce the bias toward empirically interested researchers.
only by the type system, nothing else. This way, we can assess the opinion of renowned researchers
However, how can one generalize the result of such a and experts of their area (empirical and not empirical). Clearly,
controlled experiment? Should developers switch to another the “key players” shape the future of software-engineering
type system? Obviously, giving a recommendation is difficult, research by deciding on the acceptance of research papers,
because the setting of Hanenberg’s experiment was artificial. guide young researchers, and advise funding agencies.
So, how about another setting, in which several different To ensure that the participants have been reviewing current
professional developers from different companies work with papers, we extracted the e-mail addresses of members active
different programming languages on every-day tasks? Röhm in the years 2010 to 2013 from the following venues:
• ASE (Automated Software Engineering)
and others used such a setting to observe how professional
• EASE (Evaluation and Assessment in Software Engineering)
programmers work with source code [23]. While being realistic, • ECOOP (Object-Oriented Programming)
this set up has a lot of confounding parameters that have • EMSE (Empirical Software Engineering)
• ESEC/FSE (Foundations of Software Engineering)
not been controlled for, such as the complexity of the task,
• ESEM (Empirical Software Engineering and Measurement)
programming language, and programming experience of the • GPCE (Generative Programming)
developers. Thus, while such a general setting produces general, • ICPC (Program Comprehension)
• ICSE (Software Engineering)
potentially practically relevant results, it is unclear how the
• ICSM (Software Maintenance)
results emerged—many factors could have affected the results • OOPSLA (Object-Oriented Programming)
Both kinds of study setting are viable and lead to interesting • TOSEM (Software Engineering and Methodology)
• TSE (Software Engineering)
results, but which one is preferable and in which situation?
We conducted an online survey among 79 program-committee On average, a participant was on the program committee or
and editorial-board members, to provide answers to this and editorial board of 3.6 (± 2) different venues, with a minimum
related questions. of 1 and a maximum of 9 different venues.

V. S URVEY S ETUP C. Questionnaire and Conduct


In this section, we give a detailed overview of our survey fol- We designed a questionnaire that covers several aspects
lowing the guidelines provided by Jedlitschka and others [10]. of empirical research, in particular, focusing on internal
and external validity and replication. We included several
A. Objective closed questions, for each of which we additionally asked
With our survey, we targeted several research objectives:2 the participants to elaborate on their decision. Furthermore,
RO1 Assess the awareness of the community of the tradeoff we included several open questions asking for suggestions,
between external and internal validity. for example, “Do you have any suggestions on how empirical
RO2 Assess the opinion of the community regarding how to researchers can solve the dilemma of internal vs. external
address this tradeoff. validity of empirical work in computer science?”. All questions
RO3 Assess the opinion of the community regarding the role were optional.
of replication. To ensure that the participants knew what a highly internally
and highly externally valid study looks like, we described a
These objectives emerged from discussions with researchers
research question inspired by Hanenberg’s study (Sec. IV) and
at different conferences and workshops as well as from reviews
two settings to evaluate the corresponding research question,
of empirical research papers. We experienced that, sometimes,
one maximizing internal validity and one maximizing external
2 We refer to “objectives” rather than “questions” to avoid any confusion validity. In Table I, we list all survey questions and map them
with the actual questions of the survey. to our research objectives.

12 ICSE 2015, Florence, Italy


TABLE I
Q UESTIONS OF THE SURVEY TO ANSWER THE RESEARCH OBJECTIVES (RO). B EFORE THE QUESTIONS , WE DESCRIBED A RESEARCH QUESTION AND TWO
SCENARIOS FOR ITS EVALUATION , ONE MAXIMIZING INTERNAL AND THE OTHER EXTERNAL VALIDITY.

RO Questions Answer options


1, 2 Which option would you prefer for an evaluation?  Max. internal validity,  Max. external validity
[We asked this question two times, for human and non-human studies]  No preference
1 Would it be a reason to reject a paper that does not choose your favorite option?  Yes,  No
1, 2 In your opinion, what is the ideal way to address research questions like the one outlined above? Open
1 Did you recommend to reject a paper in the past mainly for the following reasons?  Int. validity too low,  Ext. validity too low
1, 2 For research questions like the one presented above (FP vs. OOP), do you prefer more practically  Applied,  Basic,  No preference
relevant research or more theoretical (ground) research?
1 Have you changed how you judged a paper regarding internal and external validity?  Yes,  No
1, 3 What do you think about a reviewing format with several rounds, but with publication guarantees? Open
1, 2 Do you have any suggestions on how empirical researchers can solve the dilemma of internal vs. Open
external validity of empirical work in computer science?
3 During your activity as a reviewer, how often have you reviewed a replicated study?  Never,  Sometimes,  Regularly
3 In general, how were the replications rated by you... by your fellow reviewers?  Accept,  Borderline,  Reject
3 During your activity as a reviewer, did you notice a change in the number of replicated studies?  Yes, increase,  Yes, decrease,  No
3 Do you think we need to publish more experimental replications in computer science?  Yes,  No
3 As a reviewer of a top-ranked conference, would you accept a paper that, as the main contribution,...
...exactly replicates a previously published experiment of the same group?  Yes,  No,  I do not know
...exactly replicates a previously published experiment of another group?  Yes,  No,  I do not know
...replicates a previously published experiment of the same group, but increases external validity?  Yes,  No,  I do not know
...replicates a previously published experiment of another group, but increases external validity?  Yes,  No,  I do not know
...replicates a previously published experiment of the same group, but increases internal validity?  Yes,  No,  I do not know
...replicates a previously published experiment of another group, but increases internal validity?  Yes,  No,  I do not know

We used SurveyGizmo for our survey. In May 2014, we Awareness of tradeoff: Participants stated that both kinds
e-mailed each program-committee and editorial-board member, of validity should be balanced, which we found 14 times across
and asked them to complete the survey within three weeks. Of all questions related to RO1 .
the 807 people we contacted, 94 completed the questionnaire, Unawareness of tradeoff: By contrast, we also found a
leading to the typical 10% response rate. Some members profound lack of awareness regarding the tradeoff. One reviewer
preferred to have all questions on one page, so we created an stated s/he would reject a paper that describes a study that
according version for them. maximizes internal validity, because it
“[w]ould show no value at all to SE community”.
VI. R ESULTS AND D ISCUSSION
To analyze the answers of the survey, we used an open Another participant stated that his/her opinion regarding the
card-sorting technique [8]. To this end, we looked for higher- kind of validity changed, such that s/he now can appreciate
order themes in the open answers of participants for each studies with external validity more, and that s/he has “come
question. Overall, we spent 19 (open questions) × 2 hours to loathe ivory tower toy examples”.
(per question) = 38 hours on categorizing 776 answers. We Other interesting insights: We also often found that
identified several categories per question, several of them reviewers stated that “it depends” (35) on different aspects, for
occurred across questions. example, on the research question, on the study subjects, or on
Instead of discussing all identified categories, we structure the claims, indicating that the kind of validity plays a minor
this section along our research objectives. For each objective, role in judging the merits of a study.
we present descriptive statistics of the closed questions (if Human and non-human studies: There is a disagreement
applicable), followed by a summary of the categories we found on whether, for human and non-human studies, the same (6) or
with as minimal interpretation as possible (to separate data different (11) criteria regarding validity should be applied. The
analysis from interpretation). On the supplementary Web site, reasons for different criteria lie, among others, in the effort of
we provide all identified categories per question, including human studies:
their frequency of occurrence. We conclude this section with “Non-human experiments are be able to scale up to real-
an interpretation and insights we gained. istic situations at reasonable cost, in contrast to human
experiments.”,
A. RO1 : Awareness of the tradeoff between external and
internal validity. they lie in the bias caused by human studies:
For this objective, there are no closed questions, so we “Removing humans from the exercise reduces the challenges
directly start with the categories we identified in the free-text for internal validity. In that context, knowing how general
responses of the participants. the approach was would seem a more important issue to
1) Categories: The responses show a mixed picture. In address.”,
particular, we found answers indicating that participants are or researchers should maximize internal validity for non-human
aware of this issue, but also statements lacking this awareness. studies (because this is possible in the first place):

13 ICSE 2015, Florence, Italy


“[...] systems, unlike humans, can be inspected and explained 47

fully. We can produce extremely precise theories about the 40

behavior of software that we create and we should.”.

Frequency
30
Arguments in favor of applying the same criteria for human 20
and non-human studies arise, among others, from the fact that
adoption for industry is the key point of software-engineering 10

research: 0
I E N I E N G P N
“[...] assess the potential for industrial adoption.”;
(a) (b) (c)

or that, independent of the kind of study, both kinds of validity Fig. 3. Frequency distribution of answers. (a)/(b): Which option would you
are necessary to get a thorough understanding: prefer for an evaluation? (a): human studies, (b): non-human studies. [I]nternal
validity, [E]xternal validity, [N]o preference. (c): Do you prefer more practically
“[...] we need both studies (and possibly more) to get a relevant research or more theoretical (ground) research? [G]round research,
thorough understanding”. [P]ractical research, [N]o preference.

Interestingly, one even stated the equality of human and non- practice for several conferences, such as SIGCSE or ECOOP
human studies as ground truth: (2014).
Overall, these different opinions show the fundamental need
“[...] It makes no difference with or without humans! We for a community-agreed standard on how to conduct empirical
are talking about software technologies...”. research in software engineering.
Key insights:
2) Consequences: The magnitude of difference in the
• There is a mixed degree of awareness of the tradeoff
opinions surprised us, starting from the view that internally
valid studies would have no value to software-engineering between internal and external validity.
• The opinions on how to handle the tradeoff differ to a
research, to the view that only a combination of internally
and externally valid studies lets us understand a problem in large extent.
• There are different points of view whether the same or
detail. What can be learned from this result is that researchers
should be aware that there is a tradeoff and that both kinds of different criteria regarding internal and external validity
validity add valuable information to our body of knowledge. for human and non-human studies should be applied.
• There are no transparent community standards for
Furthermore, we would like to point researchers to the fact
that there are strong differences in opinions of key players in handling the tradeoff between internal and external
software-engineering research. If there is no consensus—some validity.
might not even be aware of this situation—it is difficult to B. RO : Opinion of the community regarding how to address
2
properly shape the future of software-engineering research. this tradeoff.
Currently, getting a paper on a study published seems like a
game of chance: If authors get a reviewer who is not open to 1) Descriptives: In Figure 3, we show the answers to the
the kind of study that authors report on, chances are that the three closed questions for RO2 . They indicate a tendency toward
reviewer will argue strongly against the paper, possibly leading externally valid studies with practical relevance.
2) Categories: Again, we got a mixed picture of which
to the rejection of a methodologically sound study.
questions researchers should ask, but with a clear preference
Generally speaking, there are no transparent community
for externally valid studies. We also found that several reviewers
standards on empirical research. On the contrary: Different
prefer balancing internal and external validity. The most
program-committee and editorial-board members have strongly
important reason to favor external validity is practical relevance:
different opinions about internal and external validity without
even knowing it. This is partly reflected in the large number of “[...] external validity is very important since it provides
participants stating that the kind of study depends on several indications about the potential for industrial adoption.”
factors. One participant even stated that it also depends on the “Leave the ivory tower. If actual insights for people’s lives
resources of the authors of the paper: are supposed to be the outcome of research, it better be
“...what resources did the authors have? What I expect from applied to such problems.”
a paper out of Cisco is different from a paper out of a “[...] experience from professional developers seems more
university. [...]” relevant.”
Exaggerating this statement, it could mean that it is ok to recruit These and further statements indicate that external validity and
students as participants in studies conducted by researchers practical relevance are seen as equivalent. However, this is not
at universities, because they lack the resources to recruit entirely true, as we will discuss shortly.
professionals; however, studies conducted by or in companies, In addition to focusing on one study, some participants stated
such as Cisco, should recruit professionals, because they have that researchers should replicate studies or conduct multiple
according resources. Clearly, knowing the authors would help in studies on the same topic, as inspired by other sciences. For
understanding certain tradeoffs regarding resources, but would example, to declare the discovery of the Higgs-Boson particle,
also prohibit conducting double-blind reviews, which is current many replications had to be conducted.

14 ICSE 2015, Florence, Italy


3) Consequences: 69
60
External validity vs. practical relevance: Several partici- 50

Frequency
pants equated external validity with practical relevance, leading
40
us to two interpretations:
30

• First, external validity describes how the results obtained 20


in one experimental setting can be applied to different 10
settings [25], for example, to different programming 0
languages, tasks, or participants. Many answers indicate N S R A B R A B R I D N Y N
that a study conducted with professional programmers (a) (b) (c) (d) (e)
automatically has higher external validity than a study
Fig. 4. Frequency distribution of answers. (a): How often have you reviewed
with students. However, if researchers use professional a replication? [N]ever, [S]ometimes, [R]egularly. (b)/(c): How were the
programmers in their every-day work, the results cannot replications rated... (b): ...by you? (c): ...by others? [A]ccept, [B]orderline,
necessarily be applied to students in a university context. [R]eject. (d): Did you notice a change in the number of replicated studies?
I: Increase, [D]ecrease, [N]o (e): Do you think we need to publish more
Thus, a practically relevant study does not necessarily experimental replications in computer science? [Y]es, [N]o.
have high external validity. Instead, practical relevance is
described by the term ecological validity [25]. Admittedly, Key insights:
we might have slightly influenced our participants by the • There is a misconception of the relation between external
way we asked the questions, as we discuss in Section VIII. validity and practical relevance.
• Second, studies involving students are not seen as • A single study is not seen as piece of the puzzle, but
practically relevant, because the results are applicable requires immediate practical impact; this is in contrast
to professionals only to a limited extent. While it is to the view that studies provide incremental insights
true that much research is conducted to improve the into a complete big picture.
life of the professional programmer, also students (or • Replication studies have proved successful in other
beginning programmers) are an important population sciences and should be considered more in software-
to be studied, especially, when they have considerable engineering research.
programming experience. Furthermore, there are studies
showing that, in certain scenarios, students are comparable C. RO3 : Opinion of the community regarding the role of
to professionals [6], [7], [32]. replication
1) Descriptives: In Figure 4, we show the answers to the
Practical impact of studies: Second, some participants
closed questions regarding RO3 . In essence, many participants
stated that studies should have an immediate practical impact:
think that there are too few replications in our field.
2) Categories: Even though replications are common in
“My preference towards external validity is only slight. I other sciences to increase the credibility of results, they are
am worried that maximizing internal validity easily creates not as accepted in software engineering. For example, some
overly academic papers that provide little impact.[...]”. participants stated that there should always be something novel
in a study; one even stated:
Thus, a single study is not seen as a piece of the puzzle, but
“Getting a publication accepted that doesn’t contribute
each study needs to immediately lead to general conclusions.
anything but a new experiment while assessing the same
Some reviewers suggested to look at the standards in others question (not even adding artifacts) is a good example of
sciences, specifically referring to replications being common. hunting for publications just for the sake of publishing.
Come on.”
“[studies in medicine or biology] have hundreds/thousands However, the majority of the participants stated that we need
of participants, over several years, and address very narrow more replication in software engineering, showing awareness
issues (e.g. is medicine X better than Y). We don’t see of this issue. They gave several reasons for the need and the
there studies that use 20 participants, are done in 2 months, lack of replication, as we discuss next.
and attempt to answer questions of the caliber ‘is CT better Delta of a replication: Participants who appreciate
than MRI’.”. replication studies said that replications are useful, as long
as they add information to the body of knowledge. However,
Looking at other sciences, it is certainly advisable to get away participants do not agree on what “add information” means. In
from the view that a single study must provide a definite answer general, there are many different points of view regarding how
to a substantial research question. Instead, combining different to conduct a replication. Many participants say that a replication
kinds of studies, for example, a case study to explore hypotheses should add something new or improve an aspect of the original
and controlled experiments to evaluate these hypotheses, is a study, for example, not to make the same methodological
feasible strategy to address the tradeoff. mistakes again. Some say that a replication should increase

15 ICSE 2015, Florence, Italy


external validity of a study, while others state that internal “I think that this is a big problem in our discipline. However,
validity should be increased, or that, at least, the replication in my experience, people are inclined to say that replications
has to be done by a different group. So, apparently, there is no are important but then reject replication studies for not
agreement on the delta of a replication compared to its original presenting “new” problems/questions.”
study in the community.
“I say “yes” [to accepting replication studies] but, like
Reasons for lack of replications: The participants men- everyone else I know, I wouldn’t actually like to do so. So
tioned several reasons for why we do not see many replications it probably won’t happen, even though we pay lip service
in software-engineering research, including that they are to it.”
difficult to publish and that incentives, platforms, guidelines,
standards, as well as replication packages are missing: Interestingly, we could observe a related pattern in our survey
(see Figure 4, (b) and (c)): The number of participants stating
“I have seen few replications (and perform myself a few) that they would accept a replication (b) is as high as the number
because they are too difficult to publish: there will always of participants stating that fellow reviewers tend to reject a
be a (dumb) reviewer to say ‘this is not novel!’...” replication (c). We see three possible explanations: First, this
contradiction might be caused by the possible selection bias
“It seems that replication is rarely done since it is costly,
in our sample in that mostly participants with an affinity to
hard to do (often not all details, tools, software, or datasets
empirical research responded to our survey, which may tend to
involved in an earlier study are available), and it carries a
accept a replication, while the majority who did not respond
low-impact factor (at least, in certain venues).”
tend to reject it. Second, the distribution of answers might
“I am not sure though [replication studies] would be appro- indicate a certain mismatch of views or even hypocrisy, in that
priate for conferences. A replication study is appropriate the participants believe they tend to accept more replication than
for conference if new findings arise. I think that journals their fellow reviewers, which, however, might be a biased view.
are the right outlets for replication studies.” Third, participants have different expectations about how to
conduct a replication leading to disagreement in the process of
Interestingly, one participant in our survey stated that review. In any case, to increase the appreciation of replications,
“[r]eplications are common”. we need to encourage and motivate reviewers to rate them
Several rounds of reviewing: One question in our survey more positively, independently of the novelty of the results.
was: “What do you think about a reviewing format with several Incentives for replication: Several participants made
rounds, but with publication guarantees? That is, the paper suggestions on how to change the current situation to support
is guaranteed to be published (independent of the results), if replication: In essence, we need to create incentives for authors
the authors conduct a further, sound empirical evaluation that and reviewers. Many participants have the impression that
improves either internal or external validity.” We got mixed authors do not want to do replications, because they expect
reactions to this suggestion, some stating that only the quality difficulties getting them accepted. Also, reviewers do not want
of the conducted study counts: to review replications, because there is nothing new to learn.
“Multiple rounds is a good idea, but approving publication Furthermore, reviewing a replication also means more work
must be based on the quality of the research and presentation. for reviewers, who would also have to look at the original
It should not be related to the outcome of the study.”, study to give an informed recommendation about the quality
and delta of the replication. However, program-committee and
Others fear a degradation of quality, and that authors will abuse
editorial-board work is already a rather ungrateful job, and
this publication guarantee:
increasing the workload for reviewers is unlikely to improve the
“Regrettably, my experience is that some authors will situation. Thus, without incentives for authors and reviewers,
undermine this process. It isn’t viable.” it is unlikely that we will see more replications.
Nevertheless, while the participants fear that the process will A suggestion of some participants is to have a special
be misused, decreasing research quality, they are not against platform for replication. There is already a workshop series
it in general. Suggestions of our participants to implement specifically for replication; Replication in Empirical Software
this process include providing templates for authors, reviewers, Engineering Research (RESER). But, this does not appear
papers, and the process itself. to be sufficient, because our participants stated that we need
more replication, and workshops typically have limited impact.
3) Consequences:
Furthermore, a designated workshop series gives reviewers
Mismatch: The participants mostly agree that there should the opportunity to reject a replication that does not have
be more replication in our field, but they also argued that novel or contradicting results, based on the argument that it is
this is unlikely to happen. A possible resume is that doing out of scope and better fits to RESER. This way, unpopular
or reading a replication is boring and there is no payoff for replications become banished from mainstream venues, so that
neither the authors nor the reviewers. It seems that there is they will still face a niche existence. To make replications
a certain hypocrisy in that everybody agrees that replications more accepted, there needs to be a place for them in renowned
are important, but not many researchers want to conduct, read, conferences and journals. This could mean a special track,
and accept them, as two participants nicely stated: issue, or paper categories. However, the community is certainly

16 ICSE 2015, Florence, Italy


well advised to be honest to itself: Who wants to attend a Key insights:
session about replication studies, of which the results may be • There is a certain mismatch in the participants’ view
well known? Thus, accepted replications may face a difficult on replication studies: Most participants appreciate
role in conferences. Some of our participants mentioned replications, but see that they are hard to conduct and
that replications should be published in journals, whereas publish.
conferences are for presenting novel results. A special track at • Neither researchers nor reviewers seem to like to
renowned software-engineering venues, such as ICSE, could conduct, read, or accept replications.
raise the awareness for the value of replications. • There is disagreement on the delta a replication must
A second suggestion was that there should be standards or provide.
guidelines for reviewers and authors on how to rate replications. • Suggestions to improve the situation include setting
For example, for a replication, the methodological soundness up special platforms and guidelines for reviewers and
should have more influence on acceptance than the novelty of authors, which need to be defined and communicated
the results (i.e., it would be ok to confirm a previous result). by the respective subcommunity.
This way, we can counteract the expectation of exciting results:
VII. F URTHER I NSIGHTS
“It depends [...] whether the findings contradict the previous
In addition to answering our research objectives, we gained
ones [...]”.
several further insights we want to share with the community.
Authors could instead focus on the soundness of the study
design, and they should provide replication packages, so that a A. Paper = Experiment?
lack of information does not hinder researchers to replicate a An interesting point that came up across all questions was
study. One of the participants suggested to consult a member whether a study should map 1-to-1 to a paper, or whether
of the original team to conduct the replication, because there should be an n-to-m mapping, in particular, multiple
(replication) studies making up a single, substantial paper.
“[...] many details of the experiment are not properly During our analysis, we learned that we and others tend to
described or not published.” think of a study and a paper as interchangeable concepts.
In fact, there is an experience report of a group of researchers However, is this really the way to go? One participant asked:
who originally planned to replicate a study, but due to the “Excuse me, but are we discussing science and the way it
difficulties they encountered (despite conversing with the should be done, or how to prepare papers to be accepted?”
original team), they could not conduct the replication [20]. This issue indicates that empirical research in software en-
Instead, they published their experience with exact replication. gineering has come to a point that, when designing a study,
Standards on which information to share in which way— researchers also think in terms of getting a corresponding paper
also learning from other disciplines—will help authors when published. But this is not just a problem in software engineering,
replicating other studies. but in many more disciplines, as the slogan “publish or perish”
Third, there are considerable differences in the expectation describes. A possible solution to this dilemma is exercised
of the delta a replication must provide. Should a replication by PLOS ONE (http://www.plosone.org/static/publication), in
study count as such only when conducted by a group different which the evaluation of the worthiness of a result is left to the
than the original group or only when it adds new information reader; the purpose of the review process is quality assurance,
or improve the methodology? Is it enough to change the such that the conclusions drawn from a study are justified.
sample to different students/subjects or the room/daytime of a
conduct? We cannot give an answer to these questions based B. Internal/External Validity vs. Artificiality/Practicality
on the survey, and we do not think that there is a general There seems to be a misconception about the relationship of
answer, because software engineering incorporates numerous validity and practicality. We believe that the reasons lie in the
different subfields of different maturity and with different close relationship of both concepts: An internally valid study
requirements. For example, measuring performance is different is only rarely realistic, because many confounding parameters
than a human-based study on program comprehension. Thus, need to be controlled for, which easily results in artificial
each subcommunity needs to define its own standards, which experiment settings. Externally valid studies often are realistic,
need to be communicated clearly to the authors and reviewers. because the lack of control for confounding parameters can
Finally, our suggestion of multiple reviewing rounds with lead to several different values for them (e.g., novice to expert
publication guarantees—given sufficient quality of a study— programmers). However, internally valid studies can also be
received mixed reactions, mostly because of the fear of realistic and produce generalizable results, for example, if the
degrading the quality of research and of undermining the whole selected programming language shares similar properties with
process. With a well-defined review and publication process, other often-used programming languages. Thus, we should
we might mitigate this problem, as one participant stated: avoid equating external validity with practicality, and internal
validity with artificiality: If internally valid studies are only
“Sounds interesting but has to be outlined and studied in
seen as artificial, toy examples, or ivory-tower research, it is
detail.”
hard to raise the appreciation for internally valid studies, which

17 ICSE 2015, Florence, Italy


are an important way to understand effects in depth. Likewise, affect acceptance chances, especially, contacting only program-
if only externally valid studies are accepted, how can we ever committee and editorial-board members, which threatens
pinpoint the precise factors causing the observed effect? external validity. We cannot say whether the big picture would
change when including further researchers. But, as our goal
C. Software Engineering = Engineering Discipline? was to get insights from the “key players”, we sufficiently
We were surprised by the number of answers stating that controlled this threat with respect to the scope of our study.
participants expect a practical impact of each study, because
IX. C ONCLUSION
software engineering is an engineering discipline rather than
science—practically relevant studies are inherent to software As empirical research has grown common in software
engineering. However, there is no reason for why software- engineering, it is time to agree on how it should be conducted
engineering research should not follow standards of natural and how to address the tradeoff between internal and external
science, where internally valid studies can add valuable validity and replications. Reviewing papers that were recently
information to our knowledge base. Maybe the view of software published in major software-engineering venues, we found that
engineering “solely” as engineering discipline (which is still 91 % presented an empirical study, but only 54 % discussed
discussed [26]) is one reason for the lack of appreciation of threats to validity, and only 23 % differentiated between
internally valid and replication studies? different kinds of validity. Given that we include EMSE
as major empirical software-engineering journal, this is an
D. Empirical Research not for its Own Sake alarmingly high number of authors who do not seem to be
Several participants expressed their concern not to do aware of the threats to validity to their study.
empirical research only for its own sake. In a world where To get a deeper understanding of the view of the community’s
publishing papers decides over careers, there is certainly the “key players”, we asked program-committee and editorial-board
danger that people start conducting (replication) studies just members of major software-engineering venues about their
for increasing their publication count. Given that replication opinions on these and related issues. We found that many
studies will be more accepted in the future, one could imagine reviewers are not aware of the tradeoff between internal and
that it is quite easy to grab “low-hanging fruit” by replicating external validity, but at the same time have strong opinions
existing studies. Which degree of replication is healthy? In on maximizing one kind of validity, which indicates a lack of
some sense, we trade confidence in our results by conducting community standards on conducting and reviewing empirical
replication studies with the danger of being swamped by studies studies. This leads to the situation that getting a paper accepted
that have been conducted only of their own sake. is a game of chance rather than based on quality or value
added to the community. Interestingly, a considerable number
VIII. T HREATS TO VALIDITY of participants stated that only externally valid studies, best
A. Internal Validity with immediate practical impact, have value.
Regarding the role of replication, we also found a mismatch:
There is a possible selection bias, as it may be that only those Most participants wish to see more replications but, at the same
program-committee and editorial-board members responded time, are reluctant to conduct, read, or accept them. Apparently,
who have experience with empirical research. This could mean in software engineering, there is a lack of incentives for
that there is a bias toward the awareness of the tradeoff between conducting replication studies (e.g., low impact, low acceptance
internal and external validity and the appreciation for internally chance, high effort) and a lack of standards on how to
valid and replication studies. Thus, one could assume that the design and review replications (in particular, on the delta of a
software-engineering community as whole is less aware and replication). Thus, software engineering does not seem to be
appreciative of these issues. While the selection bias is relevant, comparable to other engineering or social disciplines. So, we
it does not affect the big picture that there are many different must ask ourselves: How can we shape and promote empirical
opinions, about which the researchers are not necessarily aware. software engineering if we cannot agree on what it should like?
Another threat arises from the Rosenthal effect [24]: The Having made these different points of view explicit, we hope
wording of the questions might have influenced the participants, that they initiate a discussion in the community and provide a
for example, regarding the misconception of external/internal starting point for guidelines and standards of empirical software
validity vs. artificiality/practicality, and the relation between engineering, both for authors and reviewers.
studies and papers (1:1 or n:m). Hence, both insights might Finally, we would like to stress that our goal is not to judge
not follow to this extent from our sample, but we believe they or offend any reviewers or authors. On the contrary, we highly
would have occurred anyway: Only one participant mentioned appreciate the time and effort the participants took to answer
that external validity and practicality are not the same, and one our questions, which documents their interest in this issue.
other stated that we should not design studies to get papers
accepted. ACKNOWLEDGMENTS
Thanks to Lutz Prechelt for fruitful comments on this paper.
B. External Validity Also thanks to all program-committee and editorial-board
Reflecting on the insight about the mapping from studies members who shared their opinion with us. This work has
to papers, we revisited our design decisions for the survey. been supported by the DFG grants AP 206/4, AP 206/5, and
Admittedly, we also thought about how these decisions would AP 206/6.

18 ICSE 2015, Florence, Italy


R EFERENCES [19] A. Ko, T. LaToza, and M. Burnett. A Practical Guide to Controlled
Experiments of Software Engineering Tools with Human Participants.
Empirical Softw. Eng., pages 1382–3256, 2013. Online first.
[1] V. Basili. Software Modeling and Measurement: The
[20] J. Lung, J. Aranda, and S. Easterbrook. On the Difficulty of Replicating
Goal/Question/Metric Paradigm. Technical Report CS-TR-2956
Human Subjects Studies in Software Engineering. In Proc. Int’l Conf.
(UMIACS-TR-92-96), University of Maryland at College Park, 1992.
Software Engineering (ICSE), pages 191–200. ACM Press, 2008.
[2] V. Basili, F. Shull, and F. Lanubile. Building Knowledge through Families [21] M. Nagappan, T. Zimmermann, and C. Bird. Diversity in Soft-
of Experiments. IEEE Trans. Softw. Eng., 25(4):456–473, 1999. ware Engineering Research. In Proc. Europ. Software Engineering
[3] V. R. Basili, R. W. Selly, and D. Hutchens. Experimentation in Software Conf./Foundations of Software Engineering (ESEC/FSE), pages 466–476.
Engineering. IEEE Trans. Softw. Eng., 12(7):733–743, 1986. ACM Press, 2013.
[4] D. Budgen, B. A. Kitchenham, S. M. Charters, M. Turner, P. Brereton, [22] D. Parnas. Point: Empirical Research in Software Engineering: A Critical
and S. G. Linkman. Presenting Software Engineering Results Using View. IEEE Software, 26(6):56–59, 2009.
Structured Abstracts: A Randomised Experiment. Empirical Softw. Eng., [23] T. Roehm, R. Tiarks, R. Koschke, and W. Maalej. How Do Professional
13(4):435–468, 2008. Developers Comprehend Software? In Proc. Int’l Conf. Software
[5] T. Dybå, V. B. Kampenes, and D. Sjøberg. A Systematic Review of Engineering (ICSE), pages 255–265. IEEE CS, 2012.
Statistical Power in software Engineering Experiments. J. Information [24] R. Rosenthal and L. Jacobson. Teachers’ Expectancies: Determinants of
and Software Technology, 48(8):745–755, 2006. Pupils’ IQ Gains. Psychological Reports, 19(1):115–118, 1966.
[6] S. Hanenberg. An Experiment about Static and Dynamic Type Sytems: [25] W. Shadish, T. Cook, and D. Campbell. Experimental and Quasi-Experi-
Doubts about the Positive Impact of Static Type Systems on Development mental Designs for Generalized Causal Inference. Houghton Mifflin
Time. In Proc. Int’l Conf. Object-Oriented Programming, Systems, Company, 2002.
Languages and Applications (OOPSLA), pages 22–35. ACM Press, 2010.
[26] M. Shaw. Research Toward an Engineering Discipline for Software.
[7] M. Höst, B. Regnell, and C. Wohlin. Using Students as Subjects: A
In Proc. Int’l FSE/SDP Workshop on Future of Software Engineering
Comparative Study of Students and Professionals in Lead-Time Impact
Research (FoSER), pages 337–342. ACM Press, 2010.
Assessment. Empirical Softw. Eng., 5(3):201–214, 2000.
[27] F. Shull, J. Carver, S. Vegas, and N. Juristo. The Role of Replications in
[8] W. Hudson. Card Sorting. In Guide to Advanced Empirical Software
Empirical Software Engineering. Empirical Softw. Eng., 13(2):211–218,
Engineering. The Interaction Design Foundation, 2013.
2008.
[9] M. Ivarsson and T. Gorschek. A Method for Evaluating Rigor and [28] J. Siegmund and J. Schumann. Confounding Parameters on Program
Industrial Relevance of Technology Evaluations. Empirical Softw. Eng., Comprehension: A Literature Survey. Empirical Softw. Eng., pages 1–34,
16(3):365–395, 2011. 2014. Online first.
[10] A. Jedlitschka, M. Ciolkowski, and D. Pfahl. Reporting Experiments [29] D. Sjøberg, B. Anda, E. Arisholm, T. Dybå, M. Jørgensen, A. Kara-
in Software Engineering. In Guide to Advanced Empirical Software hasanovic, E. F. Koren, and M. Vokác. Conducting Realistic Experiments
Engineering, pages 201–228. Springer, 2008. in Software Engineering. In Proc. Int’l Symposium Empirical Software
[11] A. Jedlitschka and D. Pfahl. Reporting Guidelines for Controlled Engineering and Measurement (ESEM), pages 17–26. IEEE CS, 2002.
Experiments in Software Engineering. In Int’l Symposium Empirical [30] D. Sjøberg, T. Dybå, and M. Jørgensen. The Future of Empirical Methods
Software Engineering (ISESE), pages 95–104. IEEE CS, 2005. in Software Engineering Research. In Future of Software Engineering,
[12] N. Juristo and A. Moreno. Basics of Software Engineering Experimenta- pages 358–378. IEEE CS, 2007.
tion. Kluwer, 2001. [31] D. Sjøberg, J. Hannay, O. Hansen, V. B. Kampenes, A. Karahasanovic,
[13] N. Juristo and S. Vegas. The Role of Non-exact Replications in Software N.-K. Liborg, and A. Rekdal. A Survey of Controlled Experiments in
Engineering Experiments. Empirical Softw. Eng., 16(3):295–324, 2011. Software Engineering. IEEE Trans. Softw. Eng., 31(9):733–753, 2005.
[14] V. Kampenes, T. Dybå, J. Hannay, and D. Sjøberg. A Systematic Review [32] M. Svahnberg, A. Aurum, and C. Wohlin. Using Students as Subjects:
of Quasi-Experiments in Software Engineering. Information and Software An Empirical Evaluation. In Proc. Int’l Symposium Empirical Software
Technology, 51(1):71–82, 2009. Engineering and Measurement (ESEM), pages 288–290. ACM Press,
[15] B. Kitchenham, H. Al-Khilidar, M. A. Babar, M. Berry, K. Cox, J. Keung, 2008.
F. Kurniawati, M. Staples, H. Zhang, and L. Zhu. Evaluating Guidelines [33] W. Tichy. Hints for Reviewing Empirical Work in Software Engineering.
for Reporting Empirical Software Engineering Studies. Empirical Softw. Empirical Softw. Eng., 5(4):309–312, 2000.
Eng., 13(1):97–121, 2008. [34] W. Tichy, P. Lukowicz, L. Prechelt, and E. Heinz. Experimental
[16] B. Kitchenham, P. Brereton, D. Budgen, M. Turner, J. Bailey, and Evaluation in Computer Science: A Quantitative Study. Journal of
S. Linkman. Systematic Literature Reviews in Software Engineering: A Systems and Software, 28(1):9–18, 1995.
Systematic Literature Review. J. Information and Software Technology, [35] W. F. Tichy. Should Computer Scientists Experiment More? Computer,
51(1):7–15, 2009. 31(5):32–40, 1998.
[17] B. Kitchenham, P. Brereton, Z. Li, D. Budgen, and A. Burn. Repeatability [36] C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell, and A. Wesslén.
of Systematic Literature Reviews. In Proc. Int’l Conf. Evaluation and Experimentation in Software Engineering: An Introduction. Kluwer
Assessment in Software Engineering (EASE), pages 46–55. IET Software, Academic Publishers, 2000.
2011. [37] C. Zannier, G. Melnik, and F. Maurer. On the Success of Empirical
[18] B. Kitchenham and S. Charters. Guidelines for Performing Systematic Studies in the International Conference on Software Engineering. In
Literature Reviews in Software Engineering. Technical Report EBSE Proc. Int’l Conf. Software Engineering (ICSE), pages 341–350. ACM
2007-001, Keele University and Durham University Joint Report, 2007. Press, 2006.

19 ICSE 2015, Florence, Italy

You might also like