0% found this document useful (0 votes)
16 views16 pages

Higgins 2013

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views16 pages

Higgins 2013

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Original Article

Received 11 September 2012, Revised 17 June 2013, Accepted 2 July 2013 Published online 18 October 2013 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/jrsm.1092

A tool to assess the quality of a


meta-analysis
Julian P.T. Higgins,a* Peter W. Lane,b Betsy Anagnostelis,c
Judith Anzures-Cabrera,d Nigel F. Baker,e Joseph C. Cappelleri,f
Scott Haughie,g Sally Hollis,h Steff C. Lewis,i
Patrick Moneusej and Anne Whiteheadk
Background: Because meta-analyses are increasingly prevalent and cited in the medical literature, it is
important that tools are available to assess their methodological quality. When performing an empirical
study of the quality of published meta-analyses, we found that existing tools did not place a strong
emphasis on statistical and interpretational issues.
Methods: We developed a quality-assessment tool using existing materials and expert judgment as a
starting point, followed by multiple iterations of input from our working group, piloting, and discussion.
After having used the tool for our empirical study, agreement for four key items in the tool was measured
using weighted kappa coefficients.
Results: Our tool contained 43 items divided into four key areas (data sources, analysis of individual
studies, meta-analysis methods, and interpretation), and each area ended with a summary question. We
also produced guidance for completing the tool. Agreement between raters was fair to moderate.
Conclusions: The tool should usefully inform subsequent initiatives to develop quality-assessment tools for
meta-analysis. We advocate use of consensus between independent raters when assessing statistical
appropriateness and adequacy of interpretation in meta-analyses. Copyright © 2013 John Wiley & Sons, Ltd.

Keywords: meta-analysis; systematic reviews; quality; bias

1. Introduction
Meta-analyses are increasingly prevalent in the medical literature and are highly cited (Patsopoulos et al., 2005).
However, studies have led to concerns about their quality, including cross-speciality empirical investigations
by Jadad et al. (1998); Olsen et al. (2001); and Shea et al. (2002). The UK professional association PSI
(Statisticians in the Pharmaceutical Industry) set up an expert group in 2009 to investigate the quality of

a
School of Social and Community Medicine, University of Bristol, Bristol, UK
b
Statistical Consultancy Group; GlaxoSmithKline R&D, Stevenage, UK
c
Royal Free Hospital Medical Library, University College London, London, UK
d
Biostatistics, Roche Products Ltd, Welwyn Garden City, UK
e
Amgen Ltd, Cambridge, UK
f
Biostatistics, Pfizer Inc, Groton, Connecticut, USA
g
Primary Care Business Unit, Pfizer Global R&D, Sandwich, UK
h
Global Medicines Development, AstraZeneca, Macclesfield, UK
i
Centre for Population Health Sciences, University of Edinburgh Medical School, Teviot Place, Edinburgh EH8 9AG, UK
j
Biometrics, Vifor Pharma, Glattbrugg, Switzerland
k
Medical and Pharmaceutical Statistics Research Unit, Lancaster University, Lancaster, UK
351

*Correspondence to: Julian P.T. Higgins, School of Social and Community Medicine, University of Bristol, Canynge Hall, 39 Whatley Road,
Bristol BS8 2PS, UK.
E-mail: [email protected]

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

published meta-analyses, with particular reference to industry sponsorship and to any recent changes in
quality. The study is reported in detail in a companion paper (Lane et al., 2013). Because the available tools
for assessing the quality of meta-analyses (such as those by Oxman and Guyatt (1988) and Shea et al.
(2007)) did not make a detailed evaluation of the statistical methods, the Expert Group developed a new
assessment tool, which provides a qualitative assessment of statistical appropriateness and adequacy of
interpretation. In this paper, we describe the development of the tool and present the tool as it was used
in our empirical study.

2. Methods
2.1. Development
The tool sought to assess the quality of a meta-analysis addressing the efficacy and/or safety of a
pharmaceutical product. It was based on questions from the AMSTAR tool (Shea et al., 2007), the Cochrane
Handbook for Systematic Reviews of Interventions (Higgins and Green, 2008), and contributions from each
member of the Expert Group. Suggestions for items to include were solicited from all members of the
expert group (the co-authors of this report) and collated alongside the items in the AMSTAR tool by two
members of the group (JPTH and SCL). The list was reviewed by all the members in a series of iterations
that led to the development of our first draft of the assessment tool. All proposed changes were discussed
and agreed during teleconferences of the full expert group. The resulting tool was piloted by pairs of
assessors using two industry-supported meta-analyses and two non-industry-supported meta-analyses from
2005 or 2006. Further amendments were made in the light of any difficulties encountered, all agreed by
the full expert group during teleconferences.
An integral part of the assessment tool was the development of a document containing guidance to aid
members of the group while assessing a meta-analysis. The aim of the guidance document was to facilitate
consensus between assessors. For each question in the assessment tool, the guidance contains different points
to be taken into consideration when answering a question. Initial preparation of the guidance document was
divided among members of the expert group, and these were collated by the first author. The document went
through several iterations before and after the piloting process.

2.2. Application
In the main study, a pair of expert group members used the tool independently to assess each paper (Lane et al.,
2013). We calculated measures of agreement for key questions in the tool for a subset of 26 papers, which had
been evaluated by both an academic statistician and an industry statistician. Weighted kappa coefficients (which
correct for chance agreement) were computed using kap in STATA software (StataCorp, 2009). We used weights 1,
0.75, 0.5, 0.25, and 0 for discrepancies of 0, 1, 2, 3, and 4 ordered categories, respectively. The corresponding raw
proportions of agreement, which do not correct for chance agreement, were also calculated, both with and
without these weights.

3. Results
The tool is presented in Box 1 and the guidance for completing it in Box 2. The tool was designed for the
assessment of a single published report of a meta-analysis. Opening questions were included to collect
information about the report and the clinical question being addressed, including the drug, the comparator
intervention, and the disease or condition. The assessment tool contained 43 items in four categories, with
an overall view for each category being captured in a summary question. The categories (and summary
questions) were (A) data sources (Were the review methods adequate such that biases in location and
assessment of studies were minimized or able to be identified?); (B) analysis of individual studies by the
meta-analyst (Were the individual studies analyzed appropriately and without avoidable bias?); (C) general
meta-analysis (Were the basic meta-analysis methods appropriate?); and (D) reporting and interpretation
(Are the conclusions justified and the interpretation sound?). Each summary assessment had the five
response categories ‘Yes’, ‘Probably Yes’, ‘Unsure’, ‘Probably No’, and ‘No’. Some questions gathered factual
information beyond that directly relevant to assessment of quality (for example, the type of statistical
352

methods used and whether they were frequentist or Bayesian).


Results of the reliability assessment are presented in Table 1. Unweighted raw agreement ranged from 35%
(summary question C) to 42% (summary question A), and weighted raw agreement from 71% (summary

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

Table 1. Measures of agreement on items in the tool when applied to 26 papers, each assessed by one rater
from industry and one rater from academia.
Raw Weighted Weighted
Summary question agreement agreement kappa
(A) Data sources (Were the review methods adequate such that 42% 71% 0.31
biases in location and assessment of studies were minimized or
able to be identified?)
(B) Analysis of individual studies by the meta-analyst (Were the 36% 78% 0.30
individual studies analyzed appropriately and without avoidable bias?)
(C) General meta-analysis (Were the basic meta-analysis methods 35% 72% 0.34
appropriate?)
(D) Reporting and interpretation (Are the conclusions justified and 38% 79% 0.45
the interpretation sound?)

question A) to 79% (summary question D). Weighted kappa measures ranged from 0.30 (summary question B)
to 0.45 (summary question D). According to classifications of Landis and Koch (1977), these correspond to ‘fair’
or ‘moderate’ agreement.

4. Discussion
We have described a tool we developed to assess the methodological quality of a meta-analysis, with an
emphasis on the evaluation of statistical and interpretational issues. The tool was developed for a specific
purpose, namely to evaluate the quality of a set of published meta-analyses of pharmaceutical drugs, and
in particular, to enable the comparison of industry-sponsored with academic-sponsored meta-analyses
(Lane et al., 2013). The tool is more sophisticated (in length and content) than the tools previously
available (Oxman and Guyatt, 1988; Shea et al., 2007). More recently, a checklist has been developed to
provide a series of pertinent questions when evaluating an evidence synthesis (Ades et al., 2012).
Our quantitative evaluation of agreement, as well as our experiences in applying the tool, demonstrates
fairly substantial variation in assessments within the expert group of raters. Our expert group was comprised
entirely of medical statisticians, all with some knowledge of meta-analysis, although the extent of this
knowledge and experience of meta-analysis was variable. It is critical that a meta-analysis is driven by a clear
scientific question, and our tool did not incorporate detailed consideration of this, largely because of the
restricted scope of review questions included, and the nature of the assessors in our study. In general,
we consider it important that assessors of the quality of a meta-analysis are knowledgeable about the
methodology and also about the content area, suggesting the use of two or more assessors with different
perspectives. In our main study, the two independent assessors met by teleconference to resolve
discrepancies, and we believe the consensus assessments that resulted from this process were reasonable
summaries of the quality of the meta-analyses we reviewed.
Our aim in this paper is to make the tool available in the hope that it can usefully inform subsequent
initiatives to develop quality-assessment tools for meta-analyses. The tool will not be suitable in its current
form for other purposes. For instance, question 4 is specific to our project and a more general consideration
of conflict of interest would typically be made, compared with our focus on pharmaceutical sponsorship. We
realized during the process of applying the tool that some of the guidance can be further improved. We
also recognize that some questions may be unnecessary and that some issues are missing. For example,
we did not ask about changes in effects over time, which may affect the validity of a meta-analysis, and
we did not seek a critical evaluation of the scientific question being asked. We advocate use of consensus
between independent raters when assessing statistical appropriateness and adequacy of interpretation in
meta-analyses.

Financial disclosures
353

The authors are employees of the pharmaceutical industry, academia, or public sector research institutions, as
detailed in their affiliations. No funds were received by any author explicitly for this project. GlaxoSmithKline
teleconference facilities were used for meetings.

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

Box 1: Quality assessment tool for meta-analyses


354

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

355

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.
356

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

357

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

Box 2: Guidance for completing assessment form


Section A. Data sources

1. Eligibility criteria were stated and suitably specific for (check all that apply) …

• Eligibility criteria for trials should be


○ unambiguous;
○ operationalizable;
○ appropriate to the objectives of the review or meta-analysis.
• Check boxes if you think you could use the stated criteria to classify any study as included or excluded,
and end up with studies that would reasonably address the objectives of the review/meta-analysis.
• It is often not necessary to include outcomes as criteria for inclusion: Check the Outcomes box if this is
the case and outcomes are not listed as criteria for inclusion.
• Only meta-analyses of (or including) randomized trials are eligible for the project. Do not proceed further
if these are not part of the eligibility criteria.

2. Were any further restrictions placed on eligibility of studies or reports?


[Yes / No / Unclear]

• Studies (and not reports) are the units of interest in a meta-analysis.


• This item mainly addresses restrictions on reports (most criteria related to studies are covered in item 1).
• Common restrictions on reports include unpublished reports, reports in languages other than English
and duplicate or secondary publications.
• It is unwise to discard secondary reports about a study. If this is done, insert Comment.

3. Data for meta-analysis were sought from (check all that apply) …

• Published literature includes books and journal articles.


• Online repositories include data banks and summaries found on the Internet.
• Correspondence with trialists include data summaries from trial investigators who have not made
these summaries public or, if made public, who have supplemented data found elsewhere (e.g., in the
published literature).
• In-house IPD refers to individual patient data housed within the sponsoring institution whose
intervention is the subject of the meta-analysis.
• Others’ IPD refers to individual patient data obtained from outside of the institution whose intervention
is the subject of the meta-analysis.

4. Were data disclosed by industry sought specifically?


[Yes / No / Unclear / Not relevant]

• From any company, not including publications or correspondence.


• Answer Not relevant if data were in-house.

5. The search for trials used (check all that apply) …

• Grey literature refers to documents in print and electronic formats that are unpublished, have limited
distribution, and/or are not included in a bibliographical retrieval system.
• Reference lists are bibliographies of documents previously retrieved.
• Correspondence with industry refers to the industry sponsor of studies in the meta-analysis.
• Other correspondence refers to individuals, institutions or organizations that are not the sponsor of
studies in the meta-analysis.

6. Which bibliographic databases are mentioned?

7. The search strategy for bibliographic databases was…


[Not presented / Partially presented / Presented and comprehensive / Presented and not comprehensive]

• Partially presented means either


358

○ some or all search terms were presented but the full strategy was not; or
○ the overall strategy was described (with boolean operators) but all search terms were not.

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

• Presented means the full search strategy (complete with any boolean operators) are presented for at
least one database.
• Comprehensive means appropriate use of
○ full list of likely synonyms;
○ text words and subject headings (e.g. MeSH);
○ sound use of boolean logic.

8. Was the search for evidence reasonably comprehensive?


[Yes / No / Unclear]

• Yes: at least three categories, one of which must be electronic, and any two others (e.g. handsearching,
register) are reported. Key words and subject headings must be stated. The information sources should
be specified in detail (e.g. databases, registers, personal files, expert informants, agencies, handsearching),
along with any restrictions (years considered, publication status, language of publication).
• No: clearly less than required for a ‘Yes’.

9. Study selection was done …


[By one person / By one person, checked by another / By two or more people independently / Unstated or
unclear / Not relevant (e.g. in-house data)]

• This refers to the process of going from the complete list of ‘hits’ from the search to the decisions on
which studies are included in the review or meta-analysis.

10. Data extraction from published reports was done…


[By one person / By one person, checked by another / By two or more people independently / Unstated or
unclear / Not relevant (e.g. in-house data)]

• There should be at least two independent data extractors, and a consensus procedure for disagreements
should take place. If they did this but you have concerns about what they did, insert Comment.

11. Was risk of bias (or quality) assessed for each included study?
[Yes / No / Unclear]

• This means an explicit attempt to understand potential limitations (internal validity) of the studies.
• This does not include using bias-related criteria for eligibility in the meta-analysis (e.g. studies had to be
double blind or randomized).

12. Risk of bias (or quality) was assessed using (check all that apply)…

• A scale has a series of questions or criteria that yield a single numeric score (e.g. Jadad scale awards points
out of 5).
• A checklist has a series of factual questions producing a series of short answers, usually ‘Yes’, ‘No’ or
‘Unclear’, with no overall score.
• An item-by-item assessment has a series of questions or probes requiring longer factual answers 9e.g.
descriptions of what happened) or a judgement by the meta-analyst.
• Informal quality assessment uses none of these (e.g. just a narrative summary of strengths and
limitations of included trials).

13. Risk of bias (quality assessment) or eligibility criteria included (check all that apply) …

• Record which domains were assessed, not the findings. (See Question 47 for the interpretation of outcome).
• This includes criteria for including studies (e.g. studies had to be blinded, or randomized)
• Generation of allocation sequence just covers generation of allocations (use of randomization).
• Concealment of allocation sequence covers the ability for trialists to manipulate the randomized
sequence or recruitment of participants on the basis of the next assignment (e.g. use of using
consecutively numbered, sealed, opaque envelopes; telephone randomization; central computer etc.).
• Note: The commonly-used Jadad scale does not include allocation concealment.
• Blinding includes any sort of blinding after enrolment (e.g. outcome assessors, participants, trial
personnel, or statement of ‘double blind’).
• Other category may include many domains (e.g. baseline imbalance, funding source, early stopping,
power calculation). If more than three, state how many.
359

• If a published quality assessment tool is used, but details of its components are not provided, state the
tool under Comment (with reference if not well known).

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

Summary judgement (Section A): nature of the studies identified

14. Were the review methods adequate such that biases in location and assessment of studies were
minimized or able to be identified?
[Yes / Probably yes / Unclear / Probably no / No / Not applicable]

• Taking into account the choice of eligible trials, the ways in which the trials were sought and selected, the
collection of data and the assessment of whether the results of the trials should be believed, would the
reviewers end up with
○ the right sorts of trials to address the objectives;
○ all reasonably locatable trials;
○ an understanding of the key limitations of included studies?
• Answer Yes if the eligibility criteria were sound and the review methods adequate.
• Answer No if
○ there were serious problems with the eligibility criteria (e.g. they appear to have been derived
after conducting the full search);
○ important studies are likely to have been missed through an inadequate search;
○ study selection and/or data collection methods were seriously inadequate; or
○ no attempt was made to understand limitations of the included trials.

Section B. Analysis of individual studies by the meta-analysts

15. Are adequate methods used to address missing outcome data?


[Yes / No / Unclear / Not relevant]

Check if some trials have excluded some patients from their analyse. Have the authors added back
in information on missing patients if they can?

• Answer Yes if missing data were reinstated by the review authors, or imputed and accompanied by
sensitivity analyses.
• Answer Unclear if it is not clear how missing outcomes have been dealt with, or if data were imputed
without further evaluation (e.g. with successes or with failures).
• Answer No if there are missing outcomes but they have not been dealt with.
• Answer Not relevant if it is clear that there were no missing outcome data (very unlikely).

16. Cross-over trials were …


[Not found or not mentioned / Included appropriately / Included inappropriately / Explicitly excluded / Unclear]

• Answer Included appropriately if:


○ the design is appropriate to the clinical context (the condition is chronic and stable over time
without treatment, the intervention temporarily alleviates symptoms or condition; AND
○ first-period data is available from all cross-over trials, OR
○ a paired analysis was done for each cross-over trial (e.g. based on means of differences). This
may require imputation of standard deviations or correlation coefficients.
• Answer Included inappropriately if:
○ the design is inappropriate to the clinical context; OR
○ no paired analysis was done for at least one cross-over trial.
• Answer Unclear if first-period data are sought but not available from all cross-over trials, or otherwise
unclear.
• Answer Not found if no cross-over trials were found, irrespective of any appropriate or inappropriate
plans for dealing with them.

17. Cluster-randomized trials were …]


[Not found or not mentioned / Included appropriately / Included inappropriately / Explicitly excluded / Unclear]

In cluster-randomized trials, groups of individuals rather than individuals are randomized to different
interventions (e.g. schools, villages, medical practices or families). A common error found in the
360

literature is that these studies are incorrectly analysed as though the unit of allocation had been
the individual participants.

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

• Answer Included appropriately if:


○ analyses used clusters as the unit of analysis (although may reduce power);
○ analyses that properly account for the clustering were taken from the individual studies
(or performed using IPD) (e.g. multi-level or mixed models, GEE); OR
○ results were adjusted to account for intracluster correlation (typically by multiplying variance
by a ‘design effect’). This may require imputation of intraclass correlation coefficients (ICC).
• Answer Included inappropriately if:
○ data were included in the meta-analysis ignoring the clustering, OR
○ you judge imputed ICCs to be inappropriate.
• Answer Not found if no cluster-randomized trials were found, irrespective of any appropriate
or inappropriate plans for dealing with them.

18. Other study designs were …

[Not found or not mentioned / Included appropriately / Included inappropriately / Explicitly excluded / Unclear]

• Other designs include non-randomized studies, sequential trials, dose-escalation designs, adaptive
designs. Assess whether the analysis of such studies was appropriate.

Summary judgement (Section B): data from individual studies

19. Were the individual studies analysed appropriately and without avoidable bias?
[Yes / Probably yes / Unclear / Probably no / No / Not applicable]

• Do you have concerns about the data from the included trials in the meta-analyses that are sufficiently
strong for you to be doubtful about the authors’ conclusions?
• Unusual points in forest plots and tables may alert you to errors. A common example is a study having a much
smaller standard deviation than the others (the standard error of the mean may have been used by mistake).
• Numbers (of studies and participants) in results tables should match your expectations based on
methods and other text.

Section C. Meta-analysis

20. Were comparisons sensible within each meta-analysis?


[Yes / No / Unclear]

• This question seeks to examine whether the comparisons made are logical and in keeping with the
objectives of the meta-analysis. Clinical expertise is not expected, but common-sense judgement may
be used (and explained). Sometimes it is possible to tell that inappropriate pooling has taken place, for
example by mixing placebo-controlled studies with studies using active treatments as a control. Unless
explicitly justified, such a thing should be scored No.
• Answer No if
○ there was important variation in the types of control groups (e.g. different types of drugs, or drug
and behavioural interventions);
○ confounded comparisons were included with unconfounded comparisons (e.g. A vs B+C does not
give a sensible result to a question about A vs B. However, A+C vs B+C is OK, since C is balanced
across arms (unconfounded comparison).)

21. Were outcomes and time points sensible within each meta-analysis?
[Yes / No / Unclear]

• This question seeks to identify situations of gross pooling errors, such as combining outcomes measured
on different scales without standardization (e.g. weight loss in pounds with weight loss in grams), or
mixing different statistical effect measures. Such errors are expected to be uncommon. The question
should not require clinical expertise, but some common sense judgements about clinical similarity may
be made (and explained).
• Answer No if
○ there was a wide range of length of follow-up (e.g. 1 day to 1 year), or if you have strong concerns
361

about the similarity of the outcomes;


○ odds ratios were combined with risk ratios, or similar issues.

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

22. Do the authors avoid double-counting of individuals?


[Yes / No / Unclear]

A common error is to double-count individuals in a meta-analysis. There are two principal ways in which this
can occur. First, a treatment group (e.g. a control group) may be included more than once. For example, a
trial of low-dose aspirin vs high-dose aspirin vs placebo could contribute two entries: low-dose aspirin vs
placebo; and high-dose aspirin vs placebo. This is inappropriate unless measures are taken to address the
correlation between these outcomes. Second, data from multiple reports of the same study may be included
in the same meta-analysis, even though they relate to the same individuals. It is often difficult to tell whether
the second issue has occurred (do not spend large amounts of time in detective work).

• Check that the authors are clear that studies, rather than reports, are the units of interest in the meta-
analysis. Check for overt signs of including the same study more than once.
• Check that some aspects of studies have not been double-counted. For example, that one treatment arm
has been included more than once or that both short term data/treatment and long term data/treatment
have been included. It is acceptable to include a treatment group more than once if:
○ sample sizes have been split; or
○ correlation is accounted for (e.g. in multivariate analysis)

23. Presence of statistical heterogeneity was assessed by (check all that apply)

• Visualization: Heterogeneity was discussed or reported, but only using phrases such as ‘the forest plot
showed a large degree of heterogeneity’.
• Statistical test. Includes any mention of use of P values or chi-squared tests for heterogeneity.
• Other: Includes estimation and interpretation of among-study variance (tau-squared) – please give details.

24. The synthesis methods used in the paper included (check all that apply)…

• Meta-regression is the use of study-level covariates. IPD analyses may or may not involve meta-regression.

25. Synthesis methods were mainly


[Classical - basic / Classical - advanced / Bayesian]

• Answer Classical - basic for weighted averages, Peto method, Mantel-Haenszel, DerSimonian and Laird,
weighted linear regression, use of RevMan or metan in Stata.
• Answer Classical - advanced for mixed models, random-effects meta-regression, or extensions to basic
methods (e.g. Hartung-Knapp, Biggerstaff-Tweedie, Hardy-Thompson)

26. Was a sensible strategy used to address statistical heterogeneity in meta-analyses?


[Yes / Unclear / No / No heterogeneity observed]

• Judge whether any heterogeneity that was observed was appropriately accounted for in meta-analyses.
• A fixed-effect meta-analysis might be considered appropriate for combining results from a suitable set of
studies when there is no substantial heterogeneity. This is often assessed by calculating a Q-statistic and
testing for statistical significance, but such a two-stage strategy is inappropriate. A test based on Q has
little power in a meta-analysis of few studies even when substantial clinical heterogeneity is present,
whereas it is likely to have high power in a meta-analysis of many studies even when the magnitude
of heterogeneity is not of clinical interest. If there is substantial heterogeneity, a fixed-effect meta-
analysis will give an overly precise combined estimate.
• A random-effects meta-analysis is generally regarded as preferable when there is heterogeneity.
• Meta-regression is used to investigate and provide plausible explanation of heterogeneity. A fixed-effect
meta-regression might be appropriate, for example, if there is a categorization of the studies in terms of
a known and clinically plausible covariate that accounts for most of the heterogeneity. A random-effects
meta-regression is usually more appropriate as it allows for heterogeneity not explained by the covariate(s).

27. Were subgroups compared appropriately?


[Yes / Unclear / No / Not applicable]

• The appropriate way to do a subgroup analysis is to compare the magnitude of the treatment effect
between subgroups (or look at a treatment by subgroup interaction), not to examine the statistical
362

significance of the treatment effect within each subgroup.


• Answer Not applicable if there is no subgroup analysis.

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

28. Were any subgroup analyses apparently over-interpreted?


[Yes / Unclear / No / Not applicable]

• Subgroup analyses should be pre-specified, in a protocol or a statistical analysis plan written before
examining any results by randomized treatment, and any subgroups that are not pre-specified should
be clearly labelled as post hoc analyses, and interpreted very cautiously.
• Subgroup analyses should be of factors recorded prior to randomization.
• If there are many subgroup analyses, they should be interpreted very cautiously, as the more subgroup
analyses that are done, the more likely it is that a spurious treatment by subgroup interaction will be
found. It is hard to provide a definite estimate of many, but the more subgroups there are, the more
cautious the interpretation should be.
• Answer Yes if any of the above three points does not hold.
• Answer Unclear if any of the above three points is unclear.
• Answer No if all of the above three points hold.
• Answer Not applicable if there is no subgroup analysis.

29. Potential for reporting bias or small study effects was assessed using (check all that apply) …

• If Kendall’s tau has been used, then check Begg-Mazumdar.


• Other funnel plots asymmetry tests include Harbord 2006, Peters 2006, Schwarzer 2007 and Rücker 2008.
• Key citations for the main methods are as follows:
○ Egger test: Egger and Al. (1997) Bias in Meta-Analysis detected by a simple, graphical test. BMJ,
315:629–634
○ Begg-Mazumdar: Begg CB, Mazumdar M. (1994) Operating characteristics of a rank correlation test
for publication bias. Biometrics;50:1088–1101
○ Trim and Fill: Sue Taylor and Richard Tweedie (2000) Trim and Fill: A Simple Funnel Plot Based Method
of Adjusting for Publication Bias in Meta-analysis; Biometrics 56, 455–463

30. Was the choice of effect size appropriate (e.g. MD vs SMD)?


[Yes / Unclear / No / Not applicable]

• For continuous (and normally distributed) data, usual effect sizes are the mean difference (MD) and the
standardized mean difference (SMD) which is the MD divided by the standard deviation (SD). The SD
might be estimated from pooling the SDs of the two treatments (so-called Cohen’s d) or the SD of the
‘control’ group (after Glass). There are also bias-correction multiplication factors (e.g. Hedges) and large
sample approximations.
• Check where SMD has been used where MD would be more appropriate. For example, in areas where the
clinical relevance of effect sizes are well known and understood using the original measurement scale
(e.g. blood pressure in mmHg) presenting the SMD may not be helping interpretation, but hindering
it. On the other hand, the SMD may be more appropriate for questionnaire data where interpretation
of the MD may be poorly understood..
• There should be consistency across studies (and in the combined analysis, if performed) in the choice of
the point estimate for the effect size.

31. Was skew of data a potential problem that was not appropriately addressed?
[Yes / Unclear / No / Not applicable]

• If means are less than 1 standard deviation from the limit of a scale, then there is probably skew.
• Skew is not a problem if the studies are large.
• If data are not normally distributed (e.g. percentage change from baseline) then median differences and
confidence intervals may be presented (Hodges-Lehman).
• Transformation of the original outcome data may substantially reduce skew. Sometimes the analysis is
conducted using log-transformed values. Ratios of geometric means plus confidence intervals may then
be presented (e.g. for pharmacokinetic data). Log-transformed and untransformed data cannot be mixed
in a meta-analysis.

32. Were methods appropriate to rare events / sparse data?


[Yes / Unclear / No / Not applicable]
When analysing a binary response, meta-analysis techniques can run into numerical and inferential problems
363

if response-rates are either very low (say <5%) or very high (>95%). A major issue is the handling of studies
where the response-rate is observed at an extreme (0% or 100%) in one or both treatment arms.

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

• It is generally inappropriate to exclude all studies with an extreme observation, as this is likely to bias
the results.
• Methods such as inverse-variance weighting for risk-ratios or odds-ratios should not be used.
• For risk-differences, inverse-variance weighting can also lead to inappropriately high weights from some
studies with accidentally low (but not zero) response-rates.
• It is also generally inappropriate to adjust the observations with so-called ‘continuity corrections’, as the results
are likely to depend on the precise size of the chosen correction. It is better to use methods that can handle
extreme observations in one or other of the arms, such as the Peto method for odds ratios and the Mantel-
Haenszel method for risk-differences. However, the Peto method is also known to give biased results if the
treatment effect is substantial (odds-ratio >2) or the size of study arms are very unbalanced (>8:1)
• Some methods (such as logistic regression) may give estimates of variability based on asymptotic
behaviour, which may not be justified if the number of observations is not sufficiently large.

33. Were cut-points used to dichotomize continuous or ordinal outcomes justified?


[Yes / Unclear / No / Not applicable]

• Was there evidence that authors did not try a number of different cut-points (selecting the one which
gave the most significant P value for treatment effect)?

34. Were time-to-event data appropriately dealt with?


[Yes / Unclear / No / Not applicable]

• One would usually expect to see analyses of log-hazard ratios, but other methods may be used if justified.
• When based on summary statistics, were methods to extract and calculate log-hazard ratios and
variances appropriate?
• Was the ‘event’ defined in the same way in each study?
• Were there differences in the amount of and reasons for censoring across studies, and if so was this addressed?
• If there was major variation in the follow-up time of subjects between studies, was this addressed?

35. Were ordinal data appropriately dealt with?


[Yes / Unclear / No / Not applicable]

• If ordinal data are analysed as normally distributed data, were there sufficient categories, was the
distribution approximately bell-shaped and was it reasonable to assume an interval scale?
• If data were dichotomized, was a rationale given for the choice of cut-point?
• If there are a small number of categories and numbers in each category presented, then one could
consider proportional-odds model. Could combine data from different rating scales if proportional-odds
assumption appropriate for all scales. Did the proportional-odds assumption seem reasonable?

36. Were indirect comparisons performed appropriately?


[Yes / Unclear / No / Not applicable]

Informal indirect comparisons are unreliable. For example, a naïve comparison – comparing A with B in terms
of the summary event rates in A from one set of studies and in B from another set – is generally wrong. This
naïve method ignores the randomized nature of the data, and it is subject to confounding that will bias the
estimate by an unpredictable amount. The adjusted indirect comparison method is based on comparing A
with a third treatment C from one set of studies, for example, and combining this with a comparison of B
with C from another set. There are extensions to this where networks of comparisons are used, but adhering
to the principle of working with treatment comparisons taken from within trials. There are still potential
problems because of the observational nature of the technique: effectively a type of meta-regression. One
crucial aspect is whether the ‘treatments’ compared in different trials were actually equivalent, or whether
medical practice may have changed over time, for example, or different populations have been used.

Summary judgement (Section C): quality of meta-analysis

37. Were the basic meta-analysis methods appropriate?


[Yes / Probably yes / Unclear / Probably no / No / Not applicable]
In formulating this judgement, review the following:
364

• Concerns over clinical heterogeneity of studies in meta-analyses


○ Will the meta-analytic results be nonsensical?

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

• Whether statistical heterogeneity was reliably identified.


• Methods used to combine studies in meta-analysis.
○ If the synthesis is based on a simple pooling of information, there is a risk of confounding study
effects with treatment effects (leading to phenomena such as Simpson’s Paradox). If simple pooling
has been used it should be classed as inappropriate.
○ Fixed-effect meta-analyses in the presence of heterogeneity may be very misleading.
○ Double counting can introduce over-precision.
• Methods used to investigate heterogeneity.
○ Spurious findings?
• Presence of reporting bias or small-study effects.
○ These call into question the appropriateness of a random-effects model (since these up-weight
smaller studies compared with fixed-effect analyses).
• Special issues for the type of data.

Section D. Reporting and interpretation

38. Were results of risk of bias (methodological quality) assessments reported?


[Yes in a table / Yes in the text / Unclear / No]

This may be at the study level (typically in a table) or a summary across studies (typically in a table).

39. Were results appropriately interpreted in the light of risk of bias in included studies?
[Yes / Unclear / No]

• Was there discussion of the likely reasons for bias?


• Was the magnitude of any likely bias discussed?
• Were sensitivity analyses undertaken?

40. Were results appropriately interpreted in the light of risk of reporting bias?
[Yes / Unclear / No]

There is a demonstrated tendency for peer-reviewed publications to favour studies with positive
results. This may also lead to ‘small-study effects’, where the smaller studies in a meta-analysis show
larger treatment effects. Therefore, a meta-analysis based just on a search of published literature runs
the risk that combination of results from the selected studies will lead to a biased estimate. This can be
mitigated by assessment of bias using the techniques listed in Item 34. On the other hand, searches
that include clinical trial registries that can be expected to contain nearly all trials relevant to a chosen
treatment comparison are much less likely to suffer from reporting bias.

• Was the likelihood for reporting bias discussed?


• Was the magnitude of any likely bias discussed?
• Were sensitivity analyses undertaken?

41. Were results appropriately interpreted in the light of any multiplicity?

[Yes / Unclear / No]

• Similar considerations that apply to individual studies also apply to meta-analyses in relation to
multiplicity. Multiple endpoints, multiple time-points and multiple treatments (giving rise to, potentially,
several pair-wise assessments between treatments) all arise in meta-analyses.
○ To handle multiple endpoints, one (or occasionally two or more) endpoints should have been declared
as primary for the meta-analysis. Practically speaking, it is likely that you will not have access to a
prospectively written meta-analysis protocol and will need to trust the meta-analysis report if / when it
states that the primary and secondary endpoints were defined a priori (and what these endpoints were).
○ For multiple time-points, a repeated measures ANOVA or regression may have been used, but
when these are performed it is still possible to see individual several time-points presented separately.
This may be acceptable for meta-analyses that draw from exploratory studies, but it would still be
preferable, for the meta-analysis, if a summary of the effect over all time-points (e.g. area under curve,
365

maximum, minimum) is presented. Alternatively the effect size at a particular time-point may have
been declared as primary, which is acceptable.

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366
J. P. T. HIGGINS ET AL.

○ For multiple treatments, the meta-analysis should generally present multiple treatment meta-
analysis methods, regression or ANOVA based assessments of effect size rather than presenting effect
sizes derived from several pair-wise comparisons. Alternatively the effect size at a particular dose (e.g.
highest dose or lowest dose versus placebo) may have been declared as primary, which is acceptable.

Summary judgement (Section D): conclusions

42. Are the conclusions sound?


[Yes / Probably yes / Unclear / Probably no / No / Not applicable]

• Give your general view on whether the authors have let their biases or prejudices influence their
conclusions and interpretations, or whether these are ‘spun’ in favour of one particular intervention.
• The conclusions should match the results of the meta-analysis.
• The conclusions should express the results in correct language to avoid giving a mistaken impression to
readers who may only look at the bottom line. For example, lack of statistical significance of the
combined estimate should not be reported as absence of effect, and statistical significance should not
be reported as a clinically relevant effect.

43. Source of funding

• Note any support from a pharmaceutical company.


• If a report mentions that it was written by a medical writer but does not mention any industry funding or
influence, then make a note of this.

Acknowledgements
Nicola Hewson (MSc, Syne qua non, Diss, UK) was an original member of the expert group and contributed to
development of the quality assessment tool. JPTH was supported in part by the UK Medical Research Council (Unit
Programme number U105285807). We are grateful to two peer reviewers for helpful comments on an earlier draft
of the paper.

References
Ades AE, Caldwell DM, Reken S, Welton NJ, Sutton AJ, Dias S. NICE DSU technical support document 7:
Evidence synthesis of treatment efficacy in decision making: a reviewer’s checklist. 2012; available from
http://www.nicedsu.org.uk.
Higgins JPT, Green S (eds). 2008. Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons,
Chichester, UK.
Jadad AR, Cook DJ, Jones A, Klassen TP, Tugwell P, Moher M, Moher D. 1998. Methodology and reports of
systematic reviews and meta-analyses: A comparison of Cochrane reviews with articles published in paper-
based journals. J Am Med Assoc 280(3): 278–80.
Landis JR, Koch GG. 1977. The measurement of observer agreement for categorical data. Biometrics 33(1):
159–174.
Lane PW, Higgins JPT, Anagnostelis B, Anzures-Cabrera J, Baker NF, Cappelleri JC, Haughie S, Hollis S, Lewis SC,
Moneuse P, Whitehead A. 2013. Methodological quality of meta-analyses: comparisons over time and between
industry-sponsored and academic-sponsored reports. Research Synthesis Methods 4(4): 342–350.
Olsen O, Middleton P, Ezzo J, Gotzsche P, Hadhazy V, Herxheimer A, Kleijnen J, McIntosh H. 2001. Quality of
Cochrane reviews: Assessment of sample from 1998. BMJ 323(7317): 829–832.
Oxman AD, Guyatt GH. 1988. Guidelines for reading literature reviews. Can Med Assoc J 138(8): 697–703.
Patsopoulos NA, Analatos AA, Ioannidis JPA. 2005. Relative citation impact of various study designs in the health
sciences. JAMA 293(19): 2362–6.
Shea BJ, Grimshaw JM, Wells GA, Boers M, Andersson N, Hamel C, Porter AC, Tugwell P, Moher D, Bouter LM. 2007.
Development of AMSTAR: A measurement tool to assess the methodological quality of systematic reviews. BMC
Med Res Methodol 7: 10.
Shea B, Moher D, Graham I, Pham B, Tugwell P. 2002. A comparison of the quality of Cochrane reviews and
366

systematic reviews published in paper-based journals. Eval Health Prof 25(1): 116–129.
StataCorp. 2009. Stata Statistical Software: Release 11. StataCorp LP, College Station, Texas, USA

Copyright © 2013 John Wiley & Sons, Ltd. Res. Syn. Meth. 2013, 4 351–366

You might also like