Guidelines For Performing Systematic Literature Reviews in Software Engineering
Guidelines For Performing Systematic Literature Reviews in Software Engineering
Guidelines For Performing Systematic Literature Reviews in Software Engineering
Software Engineering
Version 2.3
and
9 July, 2007
© Kitchenham, 2007
0. Document Control Section
0.1 Contents
i
6.3.4 Limitations of Quality Assessment......................................................29
6.4 Data Extraction ............................................................................................29
6.4.1 Design of Data Extraction Forms ........................................................29
6.4.2 Contents of Data Collection Forms......................................................30
Cross-company model .................................................................................31
Within-company model.................................................................................31
What measure was used to check the statistical significance of
prediction accuracy (e.g. absolute residuals, MREs)?...........................32
What statistical tests were used to compare the results?.......................32
What were the results of the tests? ............................................................32
Data Summary...............................................................................................32
6.4.3 Data extraction procedures ..................................................................33
6.4.4 Multiple publications of the same data ................................................33
6.4.5 Unpublished data, missing data and data requiring manipulation .......34
6.4.6 Lessons learned about Data Extraction................................................34
6.5 Data Synthesis..............................................................................................34
6.5.1 Descriptive (Narrative) synthesis.........................................................34
6.5.2 Quantitative Synthesis .........................................................................35
6.5.3 Presentation of Quantitative Results....................................................36
6.5.4 Qualitative Synthesis ...........................................................................37
6.5.5 Synthesis of qualitative and quantitative studies .................................38
6.5.6 Sensitivity analysis...............................................................................38
6.5.7 Publication bias....................................................................................39
6.5.8 Lessons Learned about Data Synthesis................................................39
7. Reporting the review (Dissemination) .................................................................39
7.1 Specifying the Dissemination Strategy........................................................39
7.2 Formatting the Main Systematic Review Report.........................................40
7.3 Evaluating Systematic Review Reports .......................................................40
7.4 Lessons Learned about Reporting Systematic Literature Reviews..............40
8 Systematic Mapping Studies................................................................................44
9 Final remarks .......................................................................................................44
10 References........................................................................................................45
Appendix 1 Steps in a systematic review ................................................................48
Appendix 2 Software Engineering Systematic Literature Reviews ........................50
Appendix 3 Protocol for a Tertiary study of Systematic Literature Reviews and
Evidence-based Guidelines in IT and Software Engineering ......................................53
ii
0.2 Document Version Control
iii
review construction process
Minor restructuring – Mapping
reviews and tertiary reviews
moved into section 3 to avoid
interfering with the flow of the
guidelines.
Further minor 2.2 4 April 2007 Typos and grammatical
revisions corrections.
A paragraph on how to read the
guidelines included in the
Introduction.
Revisions 2.3 20 July Amendments after external
after external review including the
review introduction of more examples.
iv
0.3 Document development team
v
0.4 Executive Summary
The guidelines presented in this report were derived from three existing guidelines
used by medical researchers, two books produced by researchers with social science
backgrounds and discussions with researchers from other disciplines who are involved
in evidence-based practice. The guidelines have been adapted to reflect the specific
problems of software engineering research.
The guidelines cover three phases of a systematic literature review: planning the
review, conducting the review and reporting the review. They provide a relatively
high level description. They do not consider the impact of the research questions on
the review procedures, nor do they specify in detail the mechanisms needed to
perform meta-analysis.
0.5 Glossary
Secondary study. A study that reviews all the primary studies relating to a specific
research question with the aim of integrating/synthesising evidence related to a
specific research question.
vi
Systematic mapping study (also referred to as a scoping study). A broad review of
primary studies in a specific topic area that aims to identify what evidence is available
on the topic.
Tertiary study (also called a tertiary review). A review of secondary studies related to
the same research question.
vii
1. Introduction
This document presents general guidelines for undertaking systematic reviews. The
goal of this document is to introduce the methodology for performing rigorous
reviews of current empirical evidence to the software engineering community. It is
aimed primarily at software engineering researchers including PhD students. It does
not cover details of meta-analysis (a statistical procedure for synthesising quantitative
results from different studies), nor does it discuss the implications that different types
of systematic review questions have on research procedures.
The original impetus for employing systematic literature review practice was to
support evidence-based medicine, and many guidelines reflect this viewpoint. This
document attempts to construct guidelines for performing systematic literature
reviews that are appropriate to the needs of software engineering researchers. It
discusses a number of issues where software engineering research differs from
medical research. In particular, software engineering research has relatively little
empirical research compared with the medical domain; research methods used by
software engineers are not as generally rigorous as those used by medical researchers;
and much empirical data in software engineering is proprietary.
1
1.2 The Guideline Construction Process
Throughout the guidelines we have incorporated examples taken from two recently
published systematic literature reviews [21] and [17]. Kichenham et al. [21] addressed
the issue of whether it was possible to use cross-company benchmarking datasets to
produce estimation models suitable for use in a commercial company. Jørgensen [17]
investigated the use of expert judgement, formal models and combinations of the two
approaches when estimating software development effort. In addition, Appendix 2
provides a list published systematic literature reviews assessed as high quality by the
authors of this report. These SLRs were identified and assessed as part of a systematic
literature review of recent software engineering SLRs. The protocol for the review is
documented in Appendix 3.
These guidelines are aimed at software engineering researchers, PhD students, and
practitioners who are new to the concept of performing systematic literature reviews.
Readers who are unsure about what a systematic literature review is should start by
reading Section 2.
Readers who understand the principles of a systematic literature review can skip to
Section 4 to get an overview of the systematic literature review process. They should
then concentrate on Sections 5, 6 and 7, which describe in detail how to perform each
review phase. Sections 3 and 8 provide ancillary information that can be omitted on
first reading.
Readers who have more experience in performing systematic reviews may find the list
of tasks in Section 4, the quality checklists in Tables 5 and 6 and the reporting
structure presented in Table 7 sufficient for their needs.
2
Readers with detailed methodological queries are unlikely to find answers in this
document. They may find some of the references useful.
There are many reasons for undertaking a systematic literature review. The most
common reasons are:
• To summarise the existing evidence concerning a treatment or technology e.g. to
summarise the empirical evidence of the benefits and limitations of a specific
agile method.
• To identify any gaps in current research in order to suggest areas for further
investigation.
• To provide a framework/background in order to appropriately position new
research activities.
However, systematic literature reviews can also be undertaken to examine the extent
to which empirical evidence supports/contradicts theoretical hypotheses, or even to
assist the generation of new hypotheses (see for example [14]).
Most research starts with a literature review of some sort. However, unless a literature
review is thorough and fair, it is of little scientific value. This is the main rationale for
undertaking systematic reviews. A systematic review synthesises existing work in a
manner that is fair and seen to be fair. For example, systematic reviews must be
undertaken in accordance with a predefined search strategy. The search strategy must
allow the completeness of the search to be assessed. In particular, researchers
performing a systematic review must make every effort to identify and report research
that does not support their preferred research hypothesis as well as identifying and
reporting research that supports it.
3
Systematic literature reviews in all disciplines allow us to stand on the shoulders of
giants and in computing, allow us to get off each others’ feet.
Some of the features that differentiate a systematic review from a conventional expert
literature review are:
• Systematic reviews start by defining a review protocol that specifies the research
question being addressed and the methods that will be used to perform the review.
• Systematic reviews are based on a defined search strategy that aims to detect as
much of the relevant literature as possible.
• Systematic reviews document their search strategy so that readers can assess their
rigour and the completeness and repeatability of the process (bearing in mind that
searches of digital libraries are almost impossible to replicate).
• Systematic reviews require explicit inclusion and exclusion criteria to assess each
potential primary study.
• Systematic reviews specify the information to be obtained from each primary
study including quality criteria by which to evaluate each primary study.
• A systematic review is a prerequisite for quantitative meta-analysis.
There are two other types of review that complement systematic literature reviews:
systematic mapping studies and tertiary reviews.
4
very broad then a systematic mapping study may be a more appropriate exercise than
a systematic review.
Budgen et al. [6] interviewed practitioners in a number of domains that use evidence
based approaches to research, and compared their research practices with those of
software engineering. Table 1 shows the results of their assessment of the similarity
between software engineering research practices and those of other domains. It shows
that software engineering is much more similar to the Social Sciences than it is to
medicine. This similarity is due to experimental practices, subject types and blinding
procedures. Within Software Engineering it is difficult to conduct randomised
controlled trials or to undertake double blinding. In addition, human expertise and the
human subject all affect the outcome of experiments.
These factors mean that software engineering is significantly different from the
traditional medical arena in which systematic reviews were first developed. For this
5
reason we have revised these guidelines to incorporate recent ideas from the area of
social science ([25], [11]). In addition, the choice of references on which to base these
guidelines was informed by our discussions with researchers in these disciplines.
This document summarises the stages in a systematic review into three main phases:
Planning the Review, Conducting the Review, Reporting the Review.
The stages listed above may appear to be sequential, but it is important to recognise
that many of the stages involve iteration. In particular, many activities are initiated
during the protocol development stage, and refined when the review proper takes
place. For example:
• The selection of primary studies is governed by inclusion and exclusion criteria.
These criteria are initially specified when the protocol is drafted but may be
refined after quality criteria are defined.
• Data extraction forms initially prepared during construction of the protocol will
be amended when quality criteria are agreed.
6
• Data synthesis methods defined in the protocol may be amended once data has
been collected.
The systematic reviews road map prepared by the Systematic Reviews Group at
Berkeley demonstrates the iterative nature of the systematic review process very
clearly [24].
5. Planning
Prior to undertaking a systematic review it is necessary to confirm the need for such a
review. In some circumstances systematic reviews are commissioned and in such
cases a commissioning document needs to be written. However, the most important
pre-review activities are defining the research questions(s) that the systematic review
will address and producing a review protocol (i.e. plan) defining the basic review
procedures. The review protocol should also be subject to an independent evaluation
process. This is particularly important for a commissioned review.
The need for a systematic review arises from the requirement of researchers to
summarise all existing information about some phenomenon in a thorough and
unbiased manner. This may be in order to draw more general conclusions about some
phenomenon than is possible from individual studies, or may be undertaken as a
prelude to further research activities.
Examples
Kitchenham et al. [21] argued that accurate cost estimation is important for the software
industry; that accurate cost estimation models rely on past project data; that many companies
cannot collect enough data to construct their own models. Thus, it is important to know
whether models developed from data repositories can be used to predict costs in a specific
company. They noted that a number of studies have addressed that issue but have come to
different conclusions. They concluded that it is necessary to determine whether, or under
what conditions, models derived from data repositories can support estimation in a specific
company.
Jørgensen [17] pointed out in spite of the fact that most software cost estimation research
concentrates on formal cost estimation models and that a large number of IT managers know
about tools that implement formal models, most industrial cost estimation is based on expert
judgement. He argued that researchers need to know whether software professionals are
simply irrational, or whether expert judgement is just as accurate as formal models or has
other advantages that make it more acceptable than formal models.
In both cases the authors had undertaken research in the topic area and had first hand
knowledge of the research issues.
7
• What were the inclusion/exclusion criteria and how were they applied?
• What criteria were used to assess the quality of primary studies?
• How were quality criteria applied?
• How were the data extracted from the primary studies?
• How were the data synthesised?
• How were differences between studies investigated?
• How were the data combined?
• Was it reasonable to combine the studies?
• Do the conclusions flow from the evidence?
1. Are the review’s inclusion and exclusion criteria described and appropriate?
2. Is the literature search likely to have covered all relevant studies?
3. Did the reviewers assess the quality/validity of the included studies?
4. Were the basic data/studies adequately described?
Examples
We applied the DARE criteria both to Kitchenham et al.’s study [21] and to Jørgensen’s study
[17]. We gave Kitchenham et al.’s study a score of 4 and Jørgensen’s study a score of 3.5.
Other studies scored using the DARE criteria are listed in Appendix 2.
From a more general viewpoint, Greenlaugh [12] suggests the following questions:
• Can you find an important clinical question, which the review addressed?
(Clearly, in software engineering, this should be adapted to refer to an important
software engineering question.)
• Was a thorough search done of the appropriate databases and were other
potentially important sources explored?
• Was methodological quality assessed and the trials weighted accordingly?
• How sensitive are the results to the way that the review has been done?
• Have numerical results been interpreted with common sense and due regard to the
broader aspects of the problem?
Sometimes an organisation requires information about a specific topic but does not
have the time or expertise to perform a systematic literature itself. In such cases it will
commission researchers to perform a systematic literature review of the topic. When
this occurs the organisation must produce a commissioning document specifying the
work required.
• Project Title
• Background
• Review Questions
8
• Advisory/Steering Group Membership (Researchers, Practitioners, Lay
members, Policy Makers etc)
• Methods of the review
• Project Timetable
• Dissemination Strategy
• Support Infrastructure
• Budget
• References
The commissioning document can be used both to solicit tenders from research
groups willing to undertake the review and to act as a steering document for the
advisory group to ensure the review remains focused and relevant in the context.
The commissioning phase of a systematic review is not required for a research team
undertaking a review for their own needs or for one being undertaken by a PhD
student. If the commissioning stage is not undertaken then the dissemination strategy
should be incorporated into the review protocol. As yet, there are no examples of
commissioned SLRs in the software engineering domain.
Specifying the research questions is the most important part of any systematic review.
The review questions drive the entire systematic review methodology:
• The search process must identify primary studies that address the research
questions.
• The data extraction process must extract the data items needed to answer the
questions.
• The data analysis process must synthesise the data in such a way that the
questions can be answered.
In software engineering, it is not clear what the equivalent of a diagnostic test would
be, but the other questions can be adapted to software engineering issues as follows:
• Assessing the effect of a software engineering technology.
• Assessing the frequency or rate of a project development factor such as the
adoption of a technology, or the frequency or rate of project success or failure.
• Identifying cost and risk factors associated with a technology.
• Identifying the impact of technologies on reliability, performance and cost
models.
9
• Cost benefit analysis of employing specific software development technologies or
software applications.
Medical guidelines often provide different guidelines and procedures for different
types of question. This document does not go to this level of detail.
The critical issue in any systematic review is to ask the right question. In this context,
the right question is usually one that:
• Is meaningful and important to practitioners as well as researchers. For example,
researchers might be interested in whether a specific analysis technique leads to a
significantly more accurate estimate of remaining defects after design inspections.
However, a practitioner might want to know whether adopting a specific analysis
technique to predict remaining defects is more effective than expert opinion at
identifying design documents that require re-inspection.
• Will lead either to changes in current software engineering practice or to
increased confidence in the value of current practice. For example, researchers
and practitioners would like to know under what conditions a project can safely
adopt agile technologies and under what conditions it should not.
• Will identify discrepancies between commonly held beliefs and reality.
Nonetheless, there are systematic reviews that ask questions that are primarily of
interest to researchers. Such reviews ask questions that identify and/or scope future
research activities. For example, a systematic review in a PhD thesis should identify
the existing basis for the research student’s work and make it clear where the
proposed research fits into the current body of knowledge.
Examples
In both cases, the authors were aware from previous research that results were mixed, so in
each case they added a question aimed at investigating the conditions under which different
results are obtained.
10
• The interventions, which are usually a comparison between two or more
alternative treatments.
• The outcomes, i.e. the clinical and economic factors that will be used to compare
the interventions.
More recently Petticrew and Roberts suggest using the PICOC (Population,
Intervention, Comparison, Outcome, Context) criteria to frame research questions
[25]. These criteria extend the original medical guidelines with:
Comparison: I.e. what is the intervention being compared with
Context: i.e. what is the context in which the intervention is delivered.
Population
In software engineering experiments, the populations might be any of the following:
• A specific software engineering role e.g. testers, managers.
• A category of software engineer, e.g. a novice or experienced engineer.
• An application area e.g. IT systems, command and control systems.
• An industry group such as Telecommunications companies, or Small IT
companies.
A question may refer to very specific population groups e.g. novice testers, or
experienced software architects working on IT systems. In medicine the populations
are defined in order to reduce the number of prospective primary studies. In software
engineering far fewer primary studies are undertaken, thus, we may need to avoid any
restriction on the population until we come to consider the practical implications of
the systematic review.
Intervention
The intervention is the software methodology/tool/technology/procedure that
addresses a specific issue, for example, technologies to perform specific tasks such as
requirements specification, system testing, or software cost estimation.
Comparison
This is the software engineering methodology/tool/technology/procedure with which
the intervention is being compared. When the comparison technology is the
conventional or commonly-used technology, it is often referred to as the “control”
treatment. The control situation must be adequately described. In particular “not using
the intervention” is inadequate as a description of the control treatment. Software
engineering techniques usually require training. If you compare people using a
technique with people not using a technique, the effect of the technique is confounded
with the effect of training. That is, any effect might be due to providing training not
the specific technique. This is a particular problem if the participants are students.
Outcomes
Outcomes should relate to factors of importance to practitioners such as improved
reliability, reduced production costs, and reduced time to market. All relevant
11
outcomes should be specified. For example, in some cases we require interventions
that improve some aspect of software production without affecting another e.g.
improved reliability with no increase in cost.
Context
For Software Engineering, this is the context in which the comparison takes place
(e.g. academia or industry), the participants taking part in the study (e.g. practitioners,
academics, consultants, students), and the tasks being performed (e.g. small scale,
large scale). Many software experiments take place in academia using student
participants and small scale tasks. Such experiments are unlikely to be representative
of what might occur with practitioners working in industry. Some systematic reviews
might choose to exclude such experiments although in software engineering, these
may be the only type of studies available.
Experimental designs
In medical studies, researchers may be able to restrict systematic reviews to primary
studies of one particular type. For example, Cochrane reviews are usually restricted to
randomised controlled trials (RCTs). In other circumstances, the nature of the
question and the central issue being addressed may suggest that certain study designs
are more appropriate than others. However, this approach can only be taken in a
discipline where the large number of research papers is a major problem. In software
engineering, the paucity of primary studies is more likely to be the problem for
systematic reviews and we are more likely to need protocols for aggregating
information from studies of widely different types.
Examples
Kitchenham et al.[21] used the PICO criteria and defined the question elements as
Population: software or Web project.
Intervention: cross-company project effort estimation model.
Comparison: single-company project effort estimation model
Outcomes: prediction or estimate accuracy.
Jørgensen [17] did not use a structured version of his research questions.
A review protocol specifies the methods that will be used to undertake a specific
systematic review. A pre-defined protocol is necessary to reduce the possibility of
researcher bias. For example, without a protocol, it is possible that the selection of
individual studies or the analysis may be driven by researcher expectations. In
medicine, review protocols are usually submitted to peer review.
The components of a protocol include all the elements of the review plus some
additional planning information:
12
• Background. The rationale for the survey.
• The research questions that the review is intended to answer.
• The strategy that will be used to search for primary studies including search terms
and resources to be searched. Resources include digital libraries, specific journals,
and conference proceedings. An initial mapping study can help determine an
appropriate strategy.
• Study selection criteria. Study selection criteria are used to determine which
studies are included in, or excluded from, a systematic review. It is usually
helpful to pilot the selection criteria on a subset of primary studies.
• Study selection procedures. The protocol should describe how the selection
criteria will be applied e.g. how many assessors will evaluate each prospective
primary study, and how disagreements among assessors will be resolved.
• Study quality assessment checklists and procedures. The researchers should
develop quality checklists to assess the individual studies. The purpose of the
quality assessment will guide the development of checklists.
• Data extraction strategy. This defines how the information required from each
primary study will be obtained. If the data require manipulation or assumptions
and inferences to be made, the protocol should specify an appropriate validation
process.
• Synthesis of the extracted data. This defines the synthesis strategy. This should
clarify whether or not a formal meta-analysis is intended and if so what
techniques will be used.
• Dissemination strategy (if not already included in a commissioning document).
• Project timetable. This should define the review schedule.
The protocol is a critical element of any systematic review. Researchers must agree a
procedure for evaluating the protocol. If appropriate funding is available, a group of
independent experts should be asked to review the protocol. The same experts can
later be asked to review the final report.
PhD students should present their protocol to their supervisors for review and
criticism.
The basic SLR review questions discussed in Section 5.1 can be adapted to assist the
evaluation of a systematic review protocol. In addition, the internal consistency of the
protocol can be checked to confirm that:
• The search strings are appropriately derived from the research questions.
• The data to be extracted will properly address the research question(s).
• The data analysis procedure is appropriate to answer the research questions.
13
5.6 Lessons learned for protocol construction
Brereton et al. [5] identify a number of issues that researchers should anticipate during
protocol construction:
• A pre-review mapping study may help in scoping research questions.
• Expect to revise questions during protocol development, as understanding of the
problem increases.
• All the systematic review team members need to take an active part in
developing the review protocol, so they understand how to perform the data
extraction process.
• Piloting the research protocol is essential. It will find mistakes in the data
collection and aggregation procedures. It may also indicate the need to change
the methodology intended to address the research questions including amending
the data extraction forms and synthesis methods.
Staples and Niazi [27] recommend limiting the scope of a systematic literature by
choosing clear and narrow research questions.
Once the protocol has been agreed, the review proper can start. However, as noted
previously, researchers should expect to try out each of the steps described in this
section when they construct their research protocol.
The aim of a systematic review is to find as many primary studies relating to the
research question as possible using an unbiased search strategy. The rigour of the
search process is one factor that distinguishes systematic reviews from traditional
reviews.
A general approach is to break down the question into individual facets i.e.
population, intervention, comparison, outcomes, context, study designs as discussed
in Section 5.3.2. Then draw up a list of synonyms, abbreviations, and alternative
spellings. Other terms can be obtained by considering subject headings used in
journals and data bases. Sophisticated search strings can then be constructed using
Boolean ANDs and ORs.
14
Initial searches for primary studies can be undertaken using digital libraries but this is
not sufficient for a full systematic review. Other sources of evidence must also be
searched (sometimes manually) including:
• Reference lists from relevant primary studies and review articles
• Journals (including company journals such as the IBM Journal of Research and
Development), grey literature (i.e. technical reports, work in progress) and
conference proceedings
• Research registers
• The Internet.
A problem for software engineering SLRs is that there may be relatively few studies
on a particular topic. In such cases it may be a good idea to look for studies in related
disciplines for example, sociology for group working practices, and psychology for
notation design and/or problem solving approaches.
Example
Jørgensen [16] investigated when we can expect expert estimates to have acceptable
accuracy in comparison with formal models by reviewing relevant human judgement studies
(e.g. time estimation studies) and comparing their results with the results of software
engineering studies.
Publication bias can lead to systematic bias in systematic reviews unless special
efforts are made to address this problem. Many of the standard search strategies
identified above are used to address this issue including:
• Scanning the grey literature
• Scanning conference proceedings
15
• Contacting experts and researchers working in the area and asking them if they
know of any unpublished results.
Once reference lists have been finalised the full articles of potentially useful studies
will need to be obtained. A logging system is needed to make sure all relevant studies
are obtained.
16
6.1.5 Lessons learned for Search Procedures
Brereton et al. [5] identify several issues that need to be addressed when specifying
electronic search procedures:
• There are alternative search strategies that enable you to achieve different sorts
of search completion criteria. You must select and justify a search strategy that
is appropriate for your research question. For example, knowing the publication
date of the first article on a specific topic restricts the years that need to be
searched. Also, if you are going to restrict your search to specific journals and
conference proceedings this needs to be justified.
• We need to search many different electronic sources; no single source finds all
the primary studies.
• Current software engineering search engines are not designed to support
systematic literature reviews. Unlike medical researchers, software engineering
researchers need to perform resource-dependent searches.
Examples
Kitchenham et al. [21] used their structured questions to construct search strings for use with
electronic databases. The identified synonyms and alternative spellings for each of the
question elements and linked them using the Boolean OR e.g.:
Population: software OR application OR product OR Web OR WWW OR Internet OR World-
Wide Web OR project OR development
Intervention: cross company OR cross organisation OR cross organization OR multiple-
organizational OR multiple-organisational model OR modeling OR modelling effort OR cost
OR resource estimation OR prediction OR assessment
Contrast: within-organisation OR within-organization OR within-organizational OR within-
organisational OR single company OR single organisation
Outcome: Accuracy OR Mean Magnitude Relative Error
The search strings were constructed by linking the four OR lists using the Boolean AND.
17
The search strings needed to be adapted to suit the specific requirements of the difference
data bases. In addition, the researchers searched several individual journals (J) and
conference proceedings (C) sources:
• Empirical Software Engineering (J)
• Information and Software Technology (J)
• Software Process Improvement and Practice (J)
• Management Science (J)
• International Software Metrics Symposium (C)
• International Conference on Software Engineering (C)
• Evaluation and Assessment in Software Engineering (manual search) (C)
These sources were chosen because they had published papers on the topic.
In addition, Kitchenham et al. checked the references of each relevant article and approached
researchers who published on the topic to ask whether they had published (or were in the
process of publishing) any other articles on the topic.
Jørgensen [17] used an existing database of journal papers that he had identified for another
review (Jørgensen and Shepperd [15]). Jørgensen and Shepperd manually searched all
volumes of over 100 journals for papers on software cost estimation. The journals were
identified by reading reference lists of cost estimation papers, searching the Internet, and the
researchers own experience. Individual papers were categorised and recorded in a publicly
available data base (www.simula.no\BESTweb.
For conference papers, Jørgensen searched papers identified by the INSPEC database using
the following search string:
He also contacted authors of the relevant papers and was made aware of another relevant
paper.
Kitchenham et al. used the procedure recommended by most guidelines for performing
systematic review. However, it resulted in extremely long search strings that needed to be
adapted to specific search engines. Jørgensen [17] used a database previously constructed
for a wide survey of software cost estimation. This is an example of how valuable a mapping
study can be. He also used a fairly simple search string on the INSPEC database.
Kitchenham et al attempted to produce a search string that was very specific to their research
question but they still found a large number of false positives. In practice, a simpler search
string might have been just as effective.
It is important to note that neither study based its search process solely on searching digital
libraries. Both studies had very specific research questions and the researchers were aware
that the number of papers addressing the topic would be small. Thus, both studies tried hard
to undertake a comprehensive search.
Once the potentially relevant primary studies have been obtained, they need to be
assessed for their actual relevance.
18
Inclusion and exclusion criteria should be based on the research question. They
should be piloted to ensure that they can be reliably interpreted and that they classify
studies correctly.
Examples
Kitchenham et al. used the following inclusion criteria:
• any study that compared predictions of cross-company models with within-
company models based on analysis of single company project data.
They used the following exclusion criteria:
• studies where projects were only collected from a small number of different sources
(e.g. 2 or 3 companies),
• studies where models derived from a within-company data set were compared with
predictions from a general cost estimation model.
Jørgensen [17] included papers that compare judgment-based and model-based software
development effort estimation. He also excluded one relevant paper due to “incomplete
information about how the estimates were derived”.
Issues:
• Medical standards make a point that it is important to avoid, as far as possible,
exclusions based on the language of the primary study. This may not be so
important for Software Engineering.
• It is possible that inclusion decisions could be affected by knowledge of the
authors, institutions, journals or year of publication. Some medical researchers
have suggested reviews should be done after such information has been removed.
However, it takes time to do this and experimental evidence suggests that
masking the origin of primary studies does not improve reviews [4].
The next step is to apply inclusion/exclusion criteria based on practical issues [11]
such as:
• Language
• Journal
• Authors
• Setting
• Participants or subjects
• Research Design
• Sampling method
• Date of publication.
Staples and Niazi point out that it is sometimes necessary to consider the questions
that are not being addressed in order to refine your exclusion criteria [27].
Example
Staples and Niazi’s research question was
19
• Why do organizations embark on CMM-based SPI initiatives?
They also defined complementary research questions that were not being investigated:
• What motivates individuals to support the adoption of CMM-based SPI in an
organization?
• Why should organizations embark on CMM-based SPI initiatives?
• What reasons for embarking on CMM-based SPI are the most important to
organizations?
• What benefits have organizations received from CMM-based SPI initiatives?
• How do organizations decide to embark on CMM-based SPI initiatives?
• What problems do organizations have at the time that they decide to adopt CMM-based
SPI?
This clarified the boundaries of their research question of interest for example they were
concerned with the motivations of organisations not the motivations of individuals and they
were concerned with why organisations rejected CMM not why the adopted it. They found that
this process directly improved and clarified their primary study selection and data extraction
process.
Most general SLR text books recommend maintaining a list of excluded studies
identifying the reason for exclusion. However, in our experience, initial electronic
searches results in large numbers of totally irrelevant papers, i.e. papers that not only
do not address any aspect of the research questions but do not even have anything do
with software engineering. We, therefore, recommend maintaining a list of excluded
papers, only after the totally irrelevant papers have been excluded, in particular,
maintaining a record of those candidate primary studies that are excluded as a result
of the more detailed inclusion/exclusion criteria.
A single researcher (such as a PhD student) should consider discussing included and
excluded papers with their advisor, an expert panel or other researchers. Alternatively,
individual researchers can apply a test-retest approach, and re-evaluate a random
sample of the primary studies found after initial screening to check the consistency of
their inclusion/exclusion decisions.
20
• As a means of weighting the importance of individual studies when results are
being synthesised.
• To guide the interpretation of findings and determine the strength of inferences.
• To guide recommendations for further research.
Most quality checklists (see Section 6.3.2) include questions aimed at assessing the
extent to which articles have addressed bias and validity.
Recently, Petticrew and Roberts [25] have suggested that this idea is too simplistic.
They point out that some types of design are better than others at addressing different
types of question. For example, qualitative studies are more appropriate than
randomised experiments for assessing whether practitioners find a new technology
appropriate for the type of applications they have to build. Thus, if we want to restrict
ourselves to studies of a specific type we should restrict ourselves to studies that are
best suited to addressing our specific research questions.
21
an issue that needs to be taken seriously in software engineering where much of our
research on topics such as software cost estimation and project success factors are
correlation studies. Good observational studies need to consider possible confounding
effects, put in place methods to measure them and adjust any analyses to allow for
their effect. In particular, they need to include sensitivity analysis to investigate the
impact of measured and unmeasured confounders.
Checklists are usually derived from a consideration of factors that could bias study
results. The CRD Guidelines [19], the Australian National Health and Medical
Research Council Guidelines [1], and the Cochrane Reviewers’ Handbook [7] all refer
to four types of bias shown in Table 4. (We have amended the definitions (slightly)
and protection mechanisms (considerably) to address software engineering rather than
medicine.) In particular, medical researchers rely on “blinding” subjects and
experimenters (i.e. making sure that neither the subject nor the researcher knows
which treatment a subject is assigned to) to address performance and measurement
bias. However, that protocol is usually impossible for software engineering
experiments.
22
Checklists are also developed by considering bias and validity problems that can
occur at the different stages in an empirical study:
• Design
• Conduct
• Analysis
• Conclusions.
There are many published quality checklists for different types of empirical study.
The medical guidelines all provide checklists aimed at assisting the quality
assessment undertaken during a systematic literature review as do Fink [11] and
Petticrew and Roberts [25]. In addition, Crombie [10] and Greenhalgh [12] also
provide checklists aimed at assisting a reader to evaluate a specific article. Shaddish et
al. [25] discuss quasi-experimental designs and provide an extensive summary of
validity issues affecting them. However, each source identifies a slightly different set
of questions and there is no standard agreed set of questions.
For quantitative studies we have accumulated a list of questions from [10], [11], [12],
[19] and [25] and organised them with respect to study stage and study type (see
Table 5). We do not suggest that anyone uses all the questions. Researchers should
adopt Fink’s suggestion [11] which is to review the list of questions in the context of
their own study and select those quality evaluation questions that are most appropriate
for their specific research questions. They may need to construct a measurement scale
for each item since sometimes a simple Yes/No answer may be misleading. Whatever
form the quality instrument takes, it should be assessed for reliability and usability
during the trials of the study protocol before being applied to all the selected studies.
Examples
Kitchenham et al. [21] constructed a quality questionnaire based on 5 issues affecting the
quality of the study which were scored to provide an overall measure of study quality:
1. Is the data analysis process appropriate?
1.1 Was the data investigated to identify outliers and to assess distributional properties
before analysis?
1.2 Was the result of the investigation used appropriately to transform the data and select
appropriate data points?
2. Did studies carry out a sensitivity or residual analysis?
2.1 Were the resulting estimation models subject to sensitivity or residual analysis?
2.2 Was the result of the sensitivity or residual analysis used to remove abnormal data
points if necessary?
3. Were accuracy statistics based on the raw data scale?
4. How good was the study comparison method?
4.1 Was the single company selected at random (not selected for convenience) from
several different companies?
4.2 Was the comparison based on an independent hold out sample (0.5) or random
subsets (0.33), leave-one-out (0.17), no hold out (0)? The scores used for this item reflect
the researchers opinion regarding the stringency of each criterion.
5. The size of the within-company data set, measured according to the criteria presented
below. Whenever a study used more than one within-company data set, the average score
was used:
• Less than 10 projects: Poor quality (score = 0)
• Between 10 and 20 projects: Fair quality (score = 0.33)
• Between 21 and 40 projects: Good quality (score = 0.67)
• More than 40 projects: Excellent quality (score = 1)
23
They also considered the reporting quality based on 4 questions:
1. Is it clear what projects were used to construct each model?
2. Is it clear how accuracy was measured?
3. Is it clear what cross-validation method was used?
4. Were all model construction methods fully defined (tools and methods used)?
It is good practice not to include quality of study and quality of reporting scores in a single
metric but Kitchenham et al. proposed using a weighted measure giving less weight to the
reporting quality score.
Kitchenham et al.’s quality questionnaire was based on the specific nature of the primary
studies (such as the method of cross-validation used) as well as more general quality issues
(such as sample size, and sensitivity analysis).
Jørgensen [17] did not undertake a specific quality assessment of the primary studies.
24
Table 5 Summary Quality Checklist for Quantitative Studies
Question Quantitative Empirical Correlation Surveys Experiments Source
Studies (no specific type) (observational
studies)
Design
Are the aims clearly stated? X X X X [11], [10]
25
Are the measures used in the study the most relevant ones X X X X [11], [19],[25]
for answering the research questions?
Is the scope (size and length) of the study sufficient to X X X [19], [12], [25]
allow for changes in the outcomes of interest to be
identified?
Conduct
Did untoward events occur during the study? X X X X [10]
Was outcome assessment blind to treatment group? X X [19], [12], [25]
Are the data collection methods adequately described? X X X X [11]
If two groups are being compared, were they treated X [12], [25]
similarly within the study?
If the study involves participants over time, what proportion X X X [10], [11]
of people who enrolled at the beginning dropped out?
How was the randomisation carried out? X [10]
Analysis
What was the response rate? X [10], [25]
Was the denominator (i.e. the population size) reported? X [25]
Do the researchers explain the data types (continuous, X X X X [11]
ordinal, categorical)?
Are the study participants or observational units adequately X X X X [12], [25]
described? For example, SE experience, type (student,
practitioner, consultant), nationality, task experience and
other relevant variables.
Were the basic data adequately described? X X X X [10]
Have “drop outs” introduced bias? X X X [11], [12], [25]
Are reasons given for refusal to participate? X X X [11]
Are the statistical methods described? X X X X [10], [11], [19]
Is the statistical program used to analyse the data X X X X [11]
referenced?
Are the statistical methods justified? X X X X [11]
Is the purpose of the analysis clear? X X X X [11]
Are scoring systems described? X X [11]
Are potential confounders adequately controlled for in the X X X X [11]
analysis?
Do the numbers add up across different tables and X X X X [10], [11]
26
subgroups?
If different groups were different at the start of the study or X X X [12], [25]
treated differently during the study, was any attempt made
to control for these differences, either statistically or by
matching?
If yes, was it successful? X X X [25]
Was statistical significance assessed? X X X X [10]
If statistical tests are used to determine differences, is the X X X X [11]
actual p value given?
If the study is concerned with differences among groups, X X X [11]
are confidence limits given describing the magnitude of any
observed differences?
Is there evidence of multiple statistical testing or large X X X X [10], [25]
numbers of post hoc analysis?
How could selection bias arise? X X X [10], [25]
Were side-effects reported? [10]
Conclusions
Are all study questions answered? X X X X [11]
What do the main findings mean? X X X X [10]
Are negative findings presented? X X X X [11]
If statistical tests are used to determine differences, is X X X X [11]
practical significance discussed?
If drop outs differ from participants, are limitations to the X X X [11]
results discussed?
How are null findings interpreted? (I.e. has the possibility X X X X [10], [12]
that the sample size is too small been considered?)
Are important effects overlooked? X X X X [10]
How do results compare with previous reports? X X X X [10]
How do the results add to the literature? X X X X [12]
What implications does the report have for practice? X X X X [10]
Do the researchers explain the consequences of any X X X X [11]
problems with the validity/reliability of their measures?
27
If a review includes qualitative studies, it will be necessary to assess their quality.
Table 6 provides a checklist of assessing the quality of qualitative studies.
It is of course possible to have both types of quality data in the same systematic
review.
Example
28
Kitchenham et al. [21] used the quality score to investigate whether the results of the primary
study were associated with study quality. They also investigated whether some of the
individual quality factors (i.e. sample size, validation method) were associated with primary
study outcome.
There is limited evidence of relationships between factors that are thought to affect
validity and actual study outcomes. Evidence suggests that inadequate concealment of
allocation and lack of double-blinding result in over-estimates of treatment effects,
but the impact of other quality factors is not supported by empirical evidence.
The objective of this stage is to design data extraction forms to accurately record the
information researchers obtain from the primary studies. To reduce the opportunity
for bias, data extraction forms should be defined and piloted when the study protocol
is defined.
In most cases, data extraction will define a set of numerical values that should be
extracted for each study (e.g. number of subjects, treatment effect, confidence
intervals, etc.). Numerical data are important for any attempt to summarise the results
29
of a set of primary studies and are a prerequisite for meta-analysis (i.e. statistical
techniques aimed at integrating the results of the primary studies).
Examples
Kitchenham et al. [21] used the extraction form shown in Table 7 (note the actual form also
included the quality questions).
30
Cross-company model
What technique(s) was used A preliminary productivity analysis
to construct the cross- was used to identify factors for
company model? inclusion in the effort estimation
model.
Generalised linear models (using
SAS). Multiplicative and Additive
models were investigated. The
multiplicative model is a
logarithmic model.
If several techniques were In all cases, accuracy assessment It can be assumed that
used which was most was based on the logarithmic linear models did not work
accurate? models not the additive models. well.
What transformations if any Not clear whether the variables Not important: the log
were used? were transformed or the GLM was models were used and they
used to construct a log-linear were presented in the raw
model data form – thus any
accuracy metrics were
based on raw data
predictions.
What variables were KLOC, Language subset, Category Category is the type of
included in the cross- subset, RELY application.
company model? RELY is reliability as
defined by Boehm (1981)
What cross-validation A hold-out sample of 9 projects
method was used? from the single company was used
to assess estimate accuracy
Was the cross-company Yes The baseline was the
model compared to a correlation between the
baseline to check if it was estimates and the actuals
better than chance? for the hold-out.
What was/were the The correlation between the
measure(s) used as prediction and the actual for the
benchmark? single company was tested for
statistical significance. (Note it was
significantly different from zero for
the 20 project data set, but not the
9 project hold-out data set.)
Within-company model
What technique(s) was used A preliminary productivity analysis
to construct the within- was used to identify factors for
company model? inclusion in the effort estimation
model.
31
method was used from the single company was used
to assess estimate accuracy
Comparison
What was the accuracy Accuracy on main single company Using the 79 cross-
obtained using the cross- data set (log model): company projects, Maxwell
company model? n=11 (9 projects omitted) et al. identified the best
MMRE=50% model for that dataset and
Pred(25)=27% the best model for the
r=0.83 single company data. The
Accuracy on single company hold two models were identical.
out data set This data indicates that for
n=4 (5 projects omitted) all the single company
MMRE=36% projects:
Pred(25)=25% n=15
R=0.16 (n.s) Pred(25)=26.7% (4 of 15)
MMRE=46.3%
What was the accuracy Accuracy on main single company
obtained using the within- data set (log model):
company model? n=14 (6 projects omitted)
2
R =0.92
MMRE=41%
Pred(25)=36%
r=0.99
Accuracy on single company hold
out data set
n=6 (3 projects omitted)
MMRE=65%
Pred(25)=50% (3 of 6)
r=0.96
Estimated and actual effort
What measure was used to
check the statistical
significance of prediction
accuracy (e.g. absolute
residuals, MREs)?
r, correlation between the
What statistical tests were prediction and the actual
used to compare the results?
Data Summary
Data base summary (all Effort min: 7.8 MM KLOC: non-blank, non-
projects) for size and effort Effort max: 4361 MM comment delivered 1000
metrics. Effort mean: 284 MM lines. For reused code
Effort median: 93 MM Boehm’s adjustment were
Size min: 2000 KLOC made (Boehm, 1981).
Size max: 413000 KLOC Effort was measured in
Size mean: 51010 KLOC man months, with 144 man
Size median: 22300 KLOC hours per man month
With-company data Effort min: Not specified
summary for size and effort Effort max:
metrics. Effort mean:
Effort median:
Size min:
Size max:
Size mean:
Size median:
Jørgensen [17] extracted design factors and primary study results. Design factors included:
• Study design
• Estimation method selection process
• Estimation models
32
• Calibration level
• Model use expertise and degree of mechanical use of model
• Expert judgment process
• Expert judgement estimation expertise
• Possible motivational biases in estimation situation
• Estimation input
• Contextual information
• Estimation complexity
• Fairness limitations
• Other design issues
Study results included:
• Accuracy
• Variance
• Other results
Jørgensen’s article includes the completed extraction form for each primary study.
If several researchers each review different primary studies because time or resource
constraints prevent all primary papers being assessed by at least two researchers, it is
important to employ some method of checking that researchers extract data in a
consistent manner. For example, some papers should be reviewed by all researchers
(e.g. a random sample of primary studies), so that inter-researcher consistency can be
assessed.
For single researchers such as PhD students, other checking techniques must be used.
For example supervisors could perform data extraction on a random sample of the
primary studies and their results cross-checked with those of the student.
Alternatively, a test-retest process can be used where the researcher performs a
second extraction from a random selection of primary studies to check data extraction
consistency.
Examples
Kitchenham et al. [21] assigned one person to be the data extractor who completed the data
extraction form and another person to be the data checker who confirmed that the data on
extraction form were correct. Because Kitchenham and Mendes co-authored some of the
primary studies, they also ensured that the data extractor was never a co-author of the
primary study. Any disagreements were examined and an agreed final data value recorded.
As a single researcher, Jørgensen [17] extracted all the data himself. However, he sent the
data from each primary study to an author of the study and requested that they inform him if
any of the extracted data was incorrect.
33
study. When there are duplicate publications, the most complete should be used. It
may even be necessary to consult all versions of the report to obtain all the necessary
data.
Reports do not always include all relevant data. They may also be poorly written and
ambiguous. Again the authors should be contacted to obtain the required information.
Sometimes primary studies do not provide all the data but it is possible to recreate the
required data by manipulating the published data. If any such manipulations are
required, data should first be reported in the way they were published. Data obtained
by manipulation should be subject to sensitivity analysis.
Data synthesis involves collating and summarising the results of the included primary
studies. Synthesis can be descriptive (non-quantitative). However, it is sometimes
possible to complement a descriptive synthesis with a quantitative summary. Using
statistical techniques to obtain a quantitative synthesis is referred to as meta-analysis.
Description of meta-analysis methods is beyond the scope of this document, although
techniques for displaying quantitative results will be described. (To learn more about
meta-analysis see [7].)
The data synthesis activities should be specified in the review protocol. However,
some issues cannot be resolved until the data is actually analysed, for example, subset
analysis to investigate heterogeneity is not required if the results show no evidence of
heterogeneity.
It is important to identify whether results from studies are consistent with one another
(i.e. homogeneous) or inconsistent (e.g. heterogeneous). Results may be tabulated to
display the impact of potential sources of heterogeneity, e.g. study type, study quality,
and sample size.
34
Examples
Kitchenham et al. [21] tabulated the data from the primary studies in three separate tables
based on the outcome of the primary study: no significant difference between the cross-
company model and the within company model, within-company model significantly better
than the cross-company model and no statistical tests performed. They also highlighted
studies that they believed should be excluded from the synthesis because they were
complete replications in terms of the cross-company database and the within company
database because they did not offer additional independent evidence.
They concluded that small companies producing specialised (niche) software would not
benefit from using a cross-company estimation model. Large companies producing
applications of similar size range to the cross-company projects might find cross-company
models helpful.
Jørgensen [17] tabulated the studies according to the relative accuracy of the model and the
experts. Thus he considered the accuracy of the most accurate expert and least accurate
expert compared with the most accurate and least accurate models. He also considered the
average accuracy of the models and the experts. He coded the studies chronologically (as did
Kitchenham et al.), so it was possible to look for possible associations with study age and
outcome.
He concluded that models are not systematically better than experts for software cost
estimation, possibly because experts possess more information than models or it may be
difficult to build accurate software development estimation models. Expert opinion is likely to
be useful if models are not calibrated to the company using them and/or experts have access
to important contextual information that they are able to exploit. Models (or a combination of
models and experts) may be useful when there are situational biases towards overoptimism,
experts do not have access to large amounts of contextual information, and/or models are
calibrated to the environment.
35
successful in reducing risk, for a desirable outcome a value greater than one
indicates that the intervention was successful in reducing risk.
• Relative risk (RR) (risk ratio, rate ratio). The ratio of risk in the intervention
group to the risk in the control group. An RR of one indicates no difference
between comparison groups. For undesirable events an RR less than one indicates
the intervention was successful, for desirable events an RR greater than one
indicates the intervention was successful.
• Absolute risk reduction (ARR) (risk difference, rate difference). The absolute
difference in the event rate between the comparison groups. A difference of zero
indicates no difference between the groups. For an undesirable outcome an ARR
less than zero indicates a successful intervention, for a desirable outcome an ARR
greater than zero indicates a successful intervention.
Each of these measures has advantages and disadvantages. For example, odds and
odds ratios are criticised for not being well-understood by non-statisticians (other than
gamblers), whereas risk measures are generally easier to understand. Alternatively,
statisticians prefer odds ratios because they have some mathematically desirable
properties. Another issue is that relative measures are generally more consistent than
absolute measures for statistical analysis, but decision makers need absolute values in
order to assess the real benefit of an intervention.
Figure 1 represents the ideal result of a quantitative summary, as the results of the
studies basically agree. There is clearly a genuine treatment effect and a single overall
36
summary statistic would be a good estimate of that effect. If effects were very
different from study to study, our results would suggest heterogeneity. A single
overall summary statistics would probably be of little value. The systematic review
should continue with an investigation of the reasons for heterogeneity.
To avoid the problems of post-hoc analysis (i.e. “fishing” for results), researchers
should identify possible sources of heterogeneity when they construct the review
protocol. For example, studies of different types may have different results, so it is
often useful to synthesise the results of different study types separately and assess
whether the results are consistent across the different study types.
Study 1
Study 2
Study 3
37
6.5.5 Synthesis of qualitative and quantitative studies
When researchers have a systematic literature review that includes quantitative and
qualitative studies, they should:
• Synthesise the quantitative and qualitative studies separately.
• Then attempt to integrate the qualitative and quantitative results by investigating
whether the qualitative results can help explain the quantitative results. For
example qualitative studies can suggest reasons why a treatment does or does not
work in specific circumstances.
When a formal meta-analysis is not undertaken but quantitative results have been
tabulated, forest plots can be annotated to identify high quality primary studies, the
studies can be presented in decreasing order of quality or in decreasing study type
hierarchy order. Primary studies where there are queries about the data extracted can
also be explicitly identified on the forest plot, by for example, using grey colouring
for less reliable studies and black colouring for reliable studies.
Examples
Jørgensen [17] reported the results of field studies as well as the results of all studies based
on the argument that field studies would have more external validity.
In a study of the Technology Acceptance Model (TAM), Turner et al. [29] investigated the
relationship between the TAM variables Perceived Ease of Use (PEU) and PU (Perceived
Usefulness) and Actual Use measured subjectively and objectively. As part of their sensitivity
38
analysis they investigated the impact on their results of removing primary studies authored by
the researcher who developed the TAM.
1/variance
Treatment effect
The final phase of a systematic review involves writing up the results of the review
and circulating the results to potentially interested parties.
39
Academics usually assume that dissemination is about reporting results in academic
journals and/or conferences. However, if the results of a systematic review are
intended to influence practitioners, other forms of dissemination are necessary. In
particular:
1. Practitioner-oriented journals and magazines
2. Press Releases to the popular and specialist press
3. Short summary leaflets
4. Posters
5. Web pages
6. Direct communication to affected bodies.
A journal or conference paper will normally have a size restriction. In order to ensure
that readers are able to properly evaluate the rigour and validity of a systematic
review, journal papers should reference a technical report or thesis that contains all
the details.
The structure and contents of reports suggested in [19] are presented in Table 8. This
structure is appropriate for technical reports and journals. For PhD theses, the entries
marked with an asterisk are not likely to be relevant.
Journal articles will be peer reviewed as a matter of course. Experts review PhD
theses as part of the examination process. In contrast, technical reports are not usually
subjected to any independent evaluation. However, if systematic reviews are made
available on the Web so that results are made available quickly to researchers and
practitioners, it is worth organising a peer review. If an expert panel were assembled
to review the study protocol, the same panel would be appropriate to undertake peer
review of the systematic review report, otherwise several researchers with expertise in
the topic area and/or systematic review methodology should be approached to review
the report.
The evaluation process can use the quality checklists for systematic literature reviews
discussed in Section 5.1.
Brereton et al. [5] identified two issues of importance during data extraction:
• Review teams need to keep a detailed record of decisions made throughout the
review process.
• The software engineering community needs to establish mechanisms for
publishing systematic literature reviews which may result in papers that are longer
than those traditionally accepted by many software engineering outlets or that
have appendices stored in electronic repositories.
40
Staples and Niazi [27] also emphasize the need to keep a record of what happens
during the conduct of the review. They point out that you need to report deviations
from the protocol.
41
Table 8 Structure and Contents of Reports of Systematic Reviews
Section Subsection Scope Comments
Title* The title should be short but informative. It should be based on the
question being asked. In journal papers, it should indicate that the study is
a systematic review.
Authorship* When research is done collaboratively, criteria for determining both who
should be credited as an author, and the order of author’s names should be
defined in advance. The contribution of workers not credited as authors
should be noted in the Acknowledgements section.
Executive summary Context The importance of the research A structured summary or abstract allows readers to assess quickly the
or Structured questions addressed by the review. relevance, quality and generality of a systematic review.
Abstract* Objectives The questions addressed by the
systematic review.
Methods Data Sources, Study selection, Quality
Assessment and Data extraction.
Results Main finding including any meta-
analysis results and sensitivity
analyses.
Conclusions Implications for practice and future
research.
Background Justification of the need for the Description of the software engineering technique being investigated and
review. its potential importance.
Summary of previous reviews.
Review questions Each review question should be Identify primary and secondary review questions. Note this section may be
specified. included in the background section.
Review Methods Data sources and search This should be based on the research protocol. Any changes to the original
strategy protocol should be reported.
Study selection
Study quality assessment
Data extraction
Data synthesis
Included and Inclusion and exclusion criteria. Study inclusion and exclusion criteria can sometimes best be represented
excluded studies List of excluded studies with rationale as a flow diagram because studies will be excluded at different stages in
for exclusion. the review for different reasons.
42
Results Findings Description of primary studies. Non-quantitative summaries should be provided to summarise each of the
Results of any quantitative summaries studies and presented in tabular form.
Details of any meta-analysis. Quantitative summary results should be presented in tables and graphs.
Sensitivity analysis
Discussion Principal findings These must correspond to the findings discussed in the results section.
Strengths and Weaknesses Strengths and weaknesses of the A discussion of the validity of the evidence considering bias in the
evidence included in the review. systematic review allows a reader to assess the reliance that may be placed
Relation to other reviews, particularly on the collected evidence.
considering any differences in quality
and results.
Meaning of findings Direction and magnitude of effect Make clear to what extent the results imply causality by discussing the
observed in summarised studies. level of evidence.
Applicability (generalisability) of the Discuss all benefits, adverse effects and risks.
findings. Discuss variations in effects and their reasons (for example are the
treatment effects larger on larger projects).
Conclusions Recommendations Practical implications for software What are the implications of the results for practitioners?
development.
Unanswered questions and
implications for future research.
Acknowledgements* All persons who contributed to the
research but did not fulfil authorship
criteria.
Conflict of Interest Any secondary interest on the part of the researchers (e.g. a financial
interest in the technology being evaluated) should be declared.
References and Appendices can be used to list studies included and excluded from the
Appendices study, to document search strategy details, and to list raw data from the
included studies.
43
8 Systematic Mapping Studies
Systematic Mapping Studies (also known as Scoping Studies) are designed to provide
a wide overview of a research area, to establish if research evidence exists on a topic
and provide an indication of the quantity of the evidence. The results of a mapping
study can identify areas suitable for conducting Systematic Literature Reviews and
also areas where a primary study is more appropriate. Mapping Studies may be
requested by an external body before they commission a systematic review to allow
more cost effective targeting of their resources. They are also useful to PhD students
who are required to prepare an overview of the topic area in which they will be
working. As an example of a mapping study see Bailey et al.’s mapping study which
aimed at investigating the extent to which software design methods are supported by
empirical evidence [3].
The main differences between a mapping study and systematic review are:
• Mapping studies generally have broader research questions driving them and often
ask multiple research questions.
• The search terms for mapping studies will be less highly focussed than for
systematic reviews and are likely to return a very large number of studies, for a
mapping study however this is less of a problem than with large numbers of
results during the search phase of the systematic review as the aim here is for
broad coverage rather than narrow focus.
• The data extraction process for mapping studies is also much broader than the data
extraction process for systematic reviews and can more accurately be termed a
classification or categorisation stage. The purpose of this stage is to classify
papers with sufficient detail to answer the broad research questions and identify
papers for later reviews without being a time consuming task.
• The analysis stage of a mapping study is about summarising the data to answer the
research questions posed. It is unlikely to include in depth analysis techniques
such as meta-analysis and narrative synthesis, but totals and summaries. Graphical
representations of study distributions by classification type may be an effective
reporting mechanism.
• Dissemination of the results of a mapping study may be more limited than for a
systematic review; limited to commissioning bodies and academic publications,
with the aim of influencing the future direction of primary research.
9 Final remarks
This report has presented a set of guidelines for planning, conducting, and reporting a
systematic review. The previous versions of these guidelines were based on guidelines
used in medical research. However, it is important to recognise that software
engineering research is not the same as medical research. We do not undertake
randomised clinical trials, nor can we use blinding as a means to reduce distortions
due to experimenter and subject expectations. For this reason, this version of the
guidelines has incorporated information from text books authored by researchers from
the social sciences.
These guidelines are intended to assist PhD students as well as larger research groups.
However, many of the steps in a systematic review assume that it will be undertaken
44
by a large group of researchers. In the case of a single researcher (such as a PhD
student), we suggest the most important steps to undertake are:
• Developing a protocol.
• Defining the research question.
• Specifying what will be done to address the problem of a single researcher
applying inclusion/exclusion criteria and undertaking all the data extraction.
• Defining the search strategy.
• Defining the data to be extracted from each primary study including quality data.
• Maintaining lists of included and excluded studies.
• Using the data synthesis guidelines.
• Using the reporting guidelines
In our experience this “light” version of a systematic review is manageable for PhD
students. Furthermore, research students often find the well-defined nature of a
systematic review helpful both for initial scoping exercises and for more detailed
studies that are necessary to position their specific research questions.
10 References
[1] Australian National Health and Medical Research Council. How to review the
evidence: systematic identification and review of the scientific literature, 2000.
IBSN 186-4960329.
[2] Australian National Health and Medical Research Council. How to use the
evidence: assessment and application of scientific evidence. February 2000,
ISBN 0 642 43295 2.
[3] Bailey, J., Budgen, D., Turner, M., Kitchenham, B., Brereton, P. and Linkman,
S. Evidence relating to Object-Oriented software design: A survey. ESEM07.
[4] Berlin, J.A., Miles, C.G., Crigliano, M.D. Does blinding of readers affect the
results of meta-analysis? Online J. Curr. Clin. Trials, 1997: Doc No 205.
[5] Brereton, Pearl , Kitchenham, Barbara A., Budgen, David, Turner, Mark and
Khalil, Mohamed. Lessons from applying the systematic literature review
process within the software engineering domain. JSS 80, 2007, pp 571-583.
[6] Budgen, David, Stuart Charters, Mark Turner, Pearl Brereton, Barbara
Kitchenham and Stephen Linkman Investigating the Applicability of the
Evidence-Based Paradigm to Software Engineering, Proceedings of WISER
Workshop, ICSE 2006, 7-13, May 2006, ACM Press.
[7] Cochrane Collaboration. Cochrane Reviewers’ Handbook. Version 4.2.1.
December 2003
[8] Cochrane Collaboration. The Cochrane Reviewers’ Handbook Glossary,
Version 4.1.5, December 2003.
[9] Cohen, J. Weighted Kappa: nominal scale agreement with provision for scaled
disagreement or partial credit. Pychol Bull (70) 1968, pp. 213-220.
[10] Crombie, I.K. The Pocket Guide to Appraisal, BMJ Books, 1996.
[11] Fink, A. Conducting Research Literature Reviews. From the Internet to Paper,
Sage Publication, Inc., 2005.
[12] Greenhalgh, Trisha. How to read a paper: The Basics of Evidence-Based
Medicine. BMJ Books, 2000.
[13] Hart, Chris. Doing a Literature Review. Releasing the Social Science Research
Imagination. Sage Publications Ltd., 1998.
45
[14] Jasperson, Jon (Sean), Butler, Brian S., Carte, Traci, A., Croes, Henry J.P.,
Saunders, Carol, S., and Zhemg, Weijun. Review: Power and Information
Technology Research: A Metatriangulation Review. MIS Quarterly, 26(4): 397-
459, December 2002.
[15] Jørgensen, M., and Shepperd, M. A Systematic Review of Software
Development Cost Estimation Studies IEEE Transactions on SE, 33(1), 2006,
pp33-53.
[16] Jørgensen, M.A review of studies of expert estimation of software development
effort, Journal of Systems and Software, 70, 2002, pp 37-60.
[17] Jørgensen, M. Estimation of Software Development Work Effort: Evidence on
Expert Judgment and Formal Models, International Journal of Forecasting,
2007.
[18] Jørgensen, M. Evaluation of guidelines for performing systematic literature
reviews in software engineering, version 2.2, 2007
[19] Khan, Khalid, S., ter Riet, Gerben., Glanville, Julia., Sowden, Amanda, J. and
Kleijnen, Jo. (eds) Undertaking Systematic Review of Research on
Effectiveness. CRD’s Guidance for those Carrying Out or Commissioning
Reviews. CRD Report Number 4 (2nd Edition), NHS Centre for Reviews and
Dissemination, University of York, IBSN 1 900640 20 1, March 2001.
[20] Khan, Khalid, S., Kunz, Regina, Kleijnen, Jos and Antes, Gerd. Systematic
Reviews to Support Evidence-based Medicine, The Royal Society of Medicine
Press Ltd., 2003.
[21] Kitchenham, B., Mendes, E., Travassos, G.H. (2007) A Systematic Review of
Cross- vs. Within-Company Cost Estimation Studies, IEEE Trans on SE, 33 (5),
pp 316-329.
[22] Lawlor, Debbie A., George Davey Smith, K Richard Bruckdorfer, Devi Kundu,
Shah. Ebrahim Those confounded vitamins: what can we learn from the
differences between observational versus randomised trial evidence? The
Lancet, vol363, Issue 9422, 22 May, 2004.
[23] Noblit, G.W. and Hare, R.D. Meta-Ethnography: Synthesizing Qualitative
Studies. Sage Publications, 1988.
[24] Pai, Madhukar., McCulloch, Michael., Gorman, Jennifer D., Pai, Nitika,
Enanoria, Wayne, Kennedy, Gail, Tharyan, Prathap, and Colford, John, M. Jr.
Systematic reviews and meta-analyses: An illustrated, step-by-step guide, The
National Medical Journal of India, 17(2), 2004, pp 84-95.
[25] Petticrew, Mark and Helen Roberts. Systematic Reviews in the Social Sciences:
A Practical Guide, Blackwell Publishing, 2005, ISBN 1405121106
[26] Shadish, W.R., Cook, Thomas, D. and Campbell, Donald, T. Experimental and
Quasi-experimental Designs for Generalized Causal Inference. Houghton
Mifflin Company, 2002.
[27] Staples, M. and Niazi, M. Experiences using systematic review guidelines.
Article available online, JSS.
[28] Sutcliffe, T.J., Harden, K., Oakley, A., Oliver, A., Rees,S., Brunton, R. and
Kavanagh, G. Children and Healthy Eating: A systematic review of barriers and
facilitators, London, EPPI-Centre, Social Science Research Unit, Institute of
Education, University og London, October 2003.
[29] Turner, M., Kitchenham, B., Budgen, D., Charters, S. and Brereton, P. A
Systematic Literature Review of the technology Acceptance Model and its
Predictive Capabilities, Keele University and University of Durham Joint
Technical Report, 2007.
46
47
Appendix 1 Steps in a systematic review
Guidelines for systematic review in the medical domain have different view of the
process steps needed in a systematic review. The Systematic Reviews Group (UC
Berkeley) present a very detailed process model [24], other sources present a coarser
process. These process steps are summarised in Table 9, which also attempts to
collate the different processes.
48
Table 9 Systematic review process proposed in different sources
Systematic Reviews Group [24] Australian National Cochrane Reviewers CRD Guidance [19] Petticrew and Roberts [25] Fink [11]
Health and Medical Handbook [7]
Research Council
[1]
Identification of the need for a
review.
Preparation of a proposal for a
systematic review
Identify appropriate Finding Studies Locating and selecting studies Identification of research Define Inclusion/Exclusion Select Bibliographic
databases/sources. for reviews criteria Databases and Web
Sites.
49
Researchers (at least 2) screen Apply Practical
titles & abstracts. Screening criteria
Researchers meet & resolve
differences.
Get full texts of all articles.
Researchers do second screen.
Articles remaining after second
screen is the final set for inclusion
Researchers extract data including Appraisal and Assessment of study quality Study quality assessment Assess study quality Apply methodological
quality data. selection of studies Quality Screen
Researchers meet to resolve Collecting data Data extraction & monitoring Train Reviewers
disagreements on data progress
Compute inter-rater reliability. Pilot the Reviewing
Enter data into database Process
management software
Do the Review
Import data and analyse using Summary and Analysing & presenting results Data synthesis Synthesize the evidence. Synthesize the results
meta-analysis software. synthesis of
Pool data if appropriate. relevant studies Explore heterogeneity and Produce a descriptive
Look for heterogeneity. publication bias review or perform
meta-analysis
Interpret & present data. Determining the Interpreting the results The report and Disseminate the results
Discuss generalizability of applicability of recommendations.
conclusions and limitations of the results. Getting evidence into practice.
review. Reviewing and
Make recommendations for appraising the
practice or policy, & research. economics
literature.
Software engineering SLRs published between 2004 and June 2007 that scored 2 or more on University of York, CRD DARE scale as assessed
by staff working on the Keele University and Durham University EBSE project.
50
Author Date Title Reference Details Topic type Topic area Quality
Score
Barcelos, R.F., and 2006 Evaluation approaches for Ibero-American Workshop on Technology Software 2.5
Travassos, G.H. Software Architectural Requirements Engineering and evaluation Architecture
Documents: A systematic Software Environments (IDEAS). Evaluation
Review La Plata, Argentina. Methods
Dyba, T; Kampenes, 2006 A systematic review of statistical Information and Software Research trends Power in SE 2.5
V.B. and Sjoberg, power in software engineering Technology, 48(8), pp 745-755. Experiments
D.I.K.. experiments
Glass, R.L., Ramesh, 2004 An Analysis of Research in CACM, Vol. 47, No. 6, pp89-94. Research Trends Comparative 2
V., and Vessey, I Computing Disciplines trends in CS,
IS and SE
Grimstad, S., 2006 Software effort estimation Information and Software Technology Cost 3
Jorgensen, M. and terminology: The tower of Babel Technology, 48 (4), pp 302-310 Estimation
Molokken-Ostvold,
K
Hannay, J E., 2007 A Systematic Review of Theory IEEE Trans on SE, 33 (2), pp 87- Research trends Theory in 2.5
Sjøberg, D.I.K and Use in Software Engineering 107. SE
Dybå. T Experiments Experiments
Jørgensen, M 2004 A review of studies on expert Journal of Systems and Software, Technology Cost 3
estimation of software 70 (1-2), pp37-60. Estimation
development effort,
Jørgensen, M., and 2007 A Systematic Review of IEEE Transactions on SE, 33(1), Research trends Cost 3
Shepperd, M. Software Development Cost pp33-53. Estimation
Estimation Studies
Kampenes, V.B., 2007 A systematic review of effect Information and Software Research trends Effect size 2.5
Dybå, T., Hannay, size in software engineering Technology, In press. in SE
J.E. and Sjøberg, experiments. experiments
D.I.K. (
Mair, C. and 2005 The consistency of empirical International Symposium on Technology Cost 2
Shepperd, M. comparisons of regression and Empirical Software Engineering evaluation Estimation
analogy-based software project
cost prediction
Mendes, E. 2005 A systematic review of Web International Symposium on Research Trends Web 2
engineering research. Empirical Software Engineering Research
51
Moløkken-Østvold, 2004 Survey on Software Estimation Proceedings Software Metrics Technology Cost 2
K.J., Jørgensen, M. in the Norwegian Industry Symposium. evaluation Estimation
Tanilkan, S.S.,
Gallis,H., Lien, A.C.
and Hove, S.E.
Petersson, H., 2004 Capture-recapture in software Journal of Systems and Software, Technology Capture- 2.5
Thelin, T, Runeson, inspections after 10 years 72, 2004, pp 249-264 evaluation recapture in
P, and Wohlin, C. research – theory, evaluation and Inspections
application
Runeson, P., 2006 What do we know about Defect IEEE Software, 23(3) 2006, pp 82- Technology Testing 2
Andersson, C., Detection Methods? 86. evaluation methods
Thelin, T., Andrews,
A. and Berling, T.
Sjoeberg, D.I.K., 2005 A survey of controlled IEEE Transactions on SE, 31 (9), Research trends SE 2
Hannay, J.E., experiments in software 2005, pp733-753. experiments
Hansen, O., engineering
Kampenes, V.B.,
Karahasanovic, A.,
Liborg, N.K. and
Rekdal, A.C.
Zannier, C, Melnick, 2006 On the Success of Empirical ICSE06, pp 341-350 Research Trends Empirical 3.5
G. and Maurer, F. Studies in the International studies in
Conference on Software ICSE
Engineering
52
Appendix 3 Protocol for a Tertiary study of Systematic
Literature Reviews and Evidence-based Guidelines in IT and
Software Engineering
Barbara Kitchenham, Pearl Brereton, David Budgen, Mark Turner, John Bailey and
Stephen Linkman
Background
Following these papers, staff at the Keele University School of Computing and
Mathematics proposed a research project to investigate the feasibility of EBSE. This
proposal was funded by the UK Economics and Physical Science Research Council
(EPSRC). The proposal was amended to include the Department of Computer
Science, University of Durham when Professor David Budgen moved to Durham. The
EPSRC have now funded a joint Keele and Durham follow-on project (EPIC).
The purpose of the study described in this protocol is to review the current status of
EBSE since 2004 using a tertiary study to review articles related to EBSE in particular
articles describing Systematic Literature reviews (SLRs)
Research Questions
53
• What are the limitations of current research?
Search Process
The search process is a manual search of specific conference proceedings and journal
papers since 2004. The nominated journals and conferences are shown in the
following Table.
Sources to be Searched
Source Responsible
Information and Software Technology Kitchenham
(IST)
Inclusion criteria
Articles on the following topics, published between Jan 1st 2004 and June 30th 2007,
will be included
• Systematic Literature Reviews (SLRs) i.e. Literature surveys with defined
research questions, search process, data extraction and data presentation
• Meta-analyses (MA)
Exclusion Criteria
54
The following types of papers will be excluded
• Informal literature surveys (no defined research questions, no search process, no
defined data extraction or data analysis process).
• Papers discussing process of EBSE.
• Papers not subject to peer-review.
When an SLR has been published in more than one journal/conference, the most
complete version of the survey will be used.
The relevant candidate and selected studies will be selected by a single researcher.
The rejected studies will be checked by another researcher. We will maintain a list
candidate papers that were rejected with reasons for the rejection.
Quality Assessment
Each SLR will be evaluated using the York University, Centre for Reviews and
Dissemination (CDR) Database of Abstracts of Reviews of Effects (DARE) criteria
(http://www.york.ac.uk/inst/crd/crddatabase.htm#DARE). The criteria are based on
four questions:
• Are the review’s inclusion and exclusion criteria described and appropriate?
• Is the literature search likely to have covered all relevant studies?
• Did the reviewers assess the quality/validity of the included studies?
• Were the basic data/studies adequately described?
• Question 1: Y (yes), the inclusion criteria are explicitly defined in the paper, P
(Partly), the inclusion criteria are implicit; N (no), the inclusion criteria are not
defined and cannot be readily inferred.
• Question 2: Y, the authors have either searched 4 or more digital libraries and
included additional search strategies or identified and referenced all journals
addressing the topic of interest; P, the authors have searched 3 or 4 digital
libraries with no extra search strategies, or searched a defined but restricted set
of journals and conference proceedings; N, the authors have search up to 2
digital libraries or an extremely restricted set of journals.
• Question 3: Y, the authors have explicitly defined quality criteria and extracted
them from each primary study; P, the research question involves quality issues
that are addressed by the study; N no explicit quality assessment of individual
papers has been attempted.
55
• Question 4: Y Information is presented about each paper; P only summary
information is presented about individual papers; N the results of the individual
studies are not specified.
Data Collection
Data Analysis
The data will be tabulated (ordered alphabetically by the first author name) to show
the basic information about each study. The number of studies in each major category
will be counted.
The tables will be reviewed to answer the research questions and identify any
interesting trends or limitations in current EBSE-related research as follows:
• Question 1 How much EBSE activity has there been since 2004? This will be
addressed by simple counts of the number of EBSE related papers per year.
• Question 2 What research topics are being addressed? This will be addressed by
counting the number of papers in each topic area. We will identify whether any
specific topic areas that have a relatively large number of SLRs.
• Question 3 Who is leading EBSE research? We will investigate whether any
specific organisation of researches have undertaken a relatively large number of
SLRs.
• Question 4 What are the limitations of current research? We will review the
range of SE topics, the scope of SLRs and the quality of SLRs to determine
56
whether there are any observable limitations. We will also investigate whether
the quality of studies is increasing over time by plotting the quality score against
the first publication date, and whether the quality of studies has been influenced
by the SLR guidelines (by comparing the average quality score of SLRs that
referenced the guidelines with the average score of SLRs that did not reference
the guidelines).
Dissemination
The results of the study should be of interest to the software engineering community
as well as researchers interested in EBSE. For that reason we plan to report the results
on a Web page. We will also document the full result of the study in a joint Keele
University and University of Durham technical report. A short version of the study
will be submitted to IEEE Software.
References
57