Reliability: What Is It, and How Is It Measured?: 94 Key Words
Reliability: What Is It, and How Is It Measured?: 94 Key Words
Reliability, measurement,
quantitative measures,
statistical method.
by Anne Bruton
Reliability: Joy H Conway
Stephen T Holgate
What is it, and
how is it measured?
Table 1: Common reasons why therapists perform Table 2: Examples of quantitative measures
measurements performed by physiotherapists
minutiae associated with reliability measures, Table 3: Repeated maximum inspiratory pressure measures data
for which readers are referred to standard demonstrating good relative reliability
books on medical statistics. MIP Rank
Subject Day 1 Day 2 Difference Day 1 Day 2
Measurement Error
1 110 120 +10 2 2
It is very rare to find any clinical
2 94 105 +11 4 4
measurement that is perfectly reliable, as all
instruments and observers or measurers 3 86 70 --16 5 5
(raters) are fallible to some extent and all 4 120 142 +22 1 1
humans respond with some inconsistency. 5 107 107 0 3 3
Thus any observed score (X) can be thought
of as a function of two components, ie a true
score (T) and an error component
(E): X = T ± E Table 4: Repeated maximum inspiratory pressures measures data
demonstrating poor relative reliability
The difference between the true value and
the observed value is measurement error. In MIP Rank
statistical terms, ‘error’ refers to all sources Subject Day 1 Day 2 Difference Day 1 Day 2
of variability that cannot be explained by the 1 110 95 --15 2 5
independent (also known as the predictor, 2 94 107 +13 4 3
or explanatory) variable. Since the error
3 86 97 +11 5 4
components are generally unknown, it is
4 120 120 0 1 2
only possible to estimate the amount of any
measurement that is attributable to error 5 107 129 +22 3 1
and the amount that represents an accurate
reading. This estimate is our measure of
reliability.
Measurement errors may be systematic or by some type of correlation coefficient, eg
random. Systematic errors are predictable Pearson’s correlation coefficient, usually
errors, occurring in one direction only, written as r. For table 3 the data give a
constant and biased. For example, when Pearson’s correlation coefficient of r = 0.94,
using a measurement that is susceptible to a generally accepted to indicate a high degree
learning effect (eg strength testing), a retest of correlation. In table 4, however, although
may be consistently higher than a prior test the differences between the two measures
(perhaps due to improved motor unit co- look similar to those in table 1 (ie –15 to +22
ordination). Such a systematic error would cm of water), on this occasion the ranking
not therefore affect reliability, but would has changed. Subject 4 has the highest MIP
affect validity, as test values are not true on day 1, but is second highest on day 2,
representations of the quantity being subject 1 had the second highest MIP in day
measured. Random errors are due to chance 1, but the lowest MIP on day 2, and so on.
and unpredictable, thus they are the basic For table 4 data r = 0.51, which would be
concern of reliability. interpreted as a low degree of correlation.
Correlation coefficients thus give infor-
Types of Reliability mation about association between two
Baumgarter (1989) has identified two types variables, and not necessarily about their
of reliability, ie relative reliability and proximity.
absolute reliability. Absolute reliability is the degree to which
Relative reliability is the degree to which repeated measurements vary for individuals,
individuals maintain their position in a ie the less they vary, the higher the reliability.
sample over repeated measurements. Tables This type of reliability is expressed either in
3 and 4 give some maximum inspiratory the actual units of measurement, or as a
pressure (MIP) measures taken on two proportion of the measured values. The
occasions, 48 hours apart. In table 3, standard error of measurement (SEM),
although the differences between the two coefficient of variation (CV) and Bland and
measures vary from –16 to +22 centimetres Altman’s 95% limits of agreement (1986)
of water, the ranking remains unchanged. are all examples of measures of absolute
That is, on both day 1 and day 2 subject 4 reliability. These will be described later.
had the highest MIP, subject 1 the second
highest, subject 5 the third highest, and so
on. This form of reliability is often assessed
Authors Why Estimate Reliability? estimate calculated for their data. Table 5
Anne Bruton MA MCSP is Reliability testing is usually performed to summarises the more common reliability
currently involved in assess one of the following: indices found in the literature, which are
postgraduate research, described below.
Joy H Conway PhD MSc ■ Instrumental reliability, ie the reliability of
MCSP is a lecturer in the measurement device. Table 5: Reliability indices in common use
physiotherapy, and
■ Rater reliability, ie the reliability of the Hypothesis tests for bias, eg paired t-test, analysis
Stephen T Holgate MD of variance.
DSc FRCP researcher/observer/clinician
administering the measurement device. Correlation coefficients, eg Pearson’s, ICC.
is MRC professor of
Standard error of measurement (SEM).
immunopharmacology, ■ Response reliability, ie the
all at the University of reliability/stability of the variable being Coefficient of variation (CV).
Southampton. measured. Repeatability coefficient.
This article was received Bland and Altman 95% limits of agreement.
on November 16, 1998,
How is Reliability Measured?
and accepted on
September 7, 1999. As described earlier, observed scores consist Indices Based on Hypothesis Testing for Bias
of the true value ± the error component. The paired t-test, and analysis of variance
Since it is not possible to know the true techniques are statistical methods for
Address for value, the true reliability of any test is not detecting systematic bias between groups
Correspondence calculable. It can however be estimated, of data. These estimates, based upon
based on the statistical concept of variance, hypothesis testing, are often used in
Ms Anne Bruton, Health
Research Unit, School of ie a measure of the variability of differences reliability studies. However, they give
Health Professions and among scores within a sample. The greater information only about systematic
Rehabilitation Sciences, the dispersion of scores, the larger the differences between the means of two sets of
University of variance; the more homogeneous the scores, data, not about individual differences. Such
Southampton, Highfield, the smaller the variance. tests should, therefore, not be used in
Southampton SO17 1BJ. If a single measurer (rater) were to record isolation, but be complemented by other
the oxygen saturation of an individual 10 methods, eg Bland and Altman agreement
times, the resulting scores would not all be tests (1986).
Funding
identical, but would exhibit some variance.
Anne Bruton is currently Some of this total variance is due to true Correlation Coefficients (r)
sponsored by a South and differences between scores (since oxygen As stated earlier, correlation coefficients give
West Health Region R&D
saturation fluctuates), but some can be information about the degree of association
studentship.
attributable to measurement error (E). between two sets of data, or the consistency
Reliability (R) is the measure of the amount of position within the two distributions.
of the total variance attributable to true Provided the relative positions of each
differences and can be expressed as the ratio subject remain the same from test to test,
of true score variance (T) to total variance high measures of correlation will be
or: T obtained. However, a correlation coefficient
R=T+E will not detect any systematic errors. So it is
This ratio gives a value known as a possible to have two sets of scores that are
reliability coefficient. As the observed score highly correlated, but not highly repeatable,
approaches the true score, reliability as in table 6 where the hypothetical data
increases, so that with zero error there is give a Pearson’s correlation coefficient of
perfect reliability and a coefficient of 1, r = 1, ie per fect correlation despite a
because the observed score is the same as systematic difference of 40 cm of water
the true score. Conversely, as error increases for each subject.
reliability diminishes, so that with maximal Thus correlation only tells how two sets of
error there is no reliability and the scores vary together, not the extent of
coefficient approaches 0. There is, however, agreement between them. Often researchers
no such thing as a minimum acceptable level need to know that the actual values obtained
of reliability that can be applied to all by two measurements are the same, not just
measures, as this will vary depending on the proportional to one another. Although
use of the test. published studies abound with correlation
used as the sole indicator of reliability, their
Indices of Reliability results can be misleading, and it is now
In common with medical literature, recommended that they be no longer used
physiotherapy literature shows no in isolation (Keating and Matyas, 1998;
consistency in authors’ choice of reliability Chinn, 1990).
MIP Rank
Subject Day 1 Day 2 Difference Day 1 Day 2
Intra-class Correlation Coefficient (ICC) subjects to the sum of error variance and
The intra-class correlation coefficient (ICC) subject variance. If the variance between
is an attempt to overcome some of the subjects is sufficiently high (that is, the data
limitations of the classic correlation come from a heterogeneous sample) then
coefficients. It is a single index calculated reliability will inevitably appear to be high.
using variance estimates obtained through Thus if the ICC is applied to data from a
the partitioning of total variance into group of individuals demonstrating a wide
between and within subject variance (known range of the measured characteristic,
as analysis of variance or ANOVA). It thus reliability will appear to be higher than
reflects both degree of consistency and when applied to a group demonstrating a
agreement among ratings. narrow range of the same characteristic.
There are numerous versions of the ICC
(Shrout and Fleiss, 1979) with each form Standard Error of Measurement (SEM)
being appropriate to specific situations. As mentioned earlier, if any measurement
Readers interested in using the ICC can find test were to be applied to a single subject an
worked examples relevant to rehabilitation infinite number of times, it would be
in various published articles (Rankin and expected to generate responses that vary a
Stokes, 1998; Keating and Matyas, 1998; little from trial to trial, as a result of
Stratford et al, 1984; Eliasziw et al, 1994). The measurement error. Theoretically these
use of the ICC implies that each component responses could be plotted and their
of variance has been estimated appropriately distribution would follow a normal curve,
from sufficient data (at least 25 degrees of with the mean equal to the true score,
freedom), and from a sample representing and errors occurring above and below the
the population to which the results will be mean.
applied (Chinn, 1991). In this instance, The more reliable the measurement
degrees of freedom can be thought of as the response, the less error variability there
number of subjects multiplied by the would be around the mean. The standard
number of measurements. deviation of measurement errors is therefore
As with other reliability coefficients, there a reflection of the reliability of the test
is no standard acceptable level of reliability response, and is known as the standard error
using the ICC. It will range from 0 to 1, with of measurement (SEM). The value for the
values closer to one representing the higher SEM will vary from subject to subject, but
reliability. Chinn (1991) recommends that there are equations for calculating a group
any measure should have an intra-class estimate, eg SEM = sx √1 – rxx (where sx is the
correlation coefficient of at least 0.6 to be standard deviation of the set of observed test
useful. The ICC is useful when comparing scores and rxx is the reliability coefficient for
the repeatability of measures using different those data -- often the ICC is used here.)
units, as it is a dimensionless statistic. It is The SEM is a measure of absolute
most useful when three or more sets of reliability and is expressed in the actual units
observations are taken, either from a single of measurement, making it easy to interpret,
sample or from independent samples. It ie the smaller the SEM, the greater the
does, however, have some disadvantages as reliability. It is only appropriate, however, for
described by Rankin and Stokes (1998) that use with interval data (Atkinson and Neville,
make it unsuitable for use in isolation. As 1998) since with ratio data the amount of
described earlier, any reliability coefficient is random error may increase as the measured
determined as the ratio of variance between values increase.
considered acceptable for practical use. instrument will have a certain degree of
There are no firm rules for making this reliability when applied to certain
decision, which will inevitably be context populations under certain conditions. The
based. An error of ±5° in goniometry issue to be addressed is what level of
measures may be clinically acceptable in reliability is considered to be clinically
some circumstances, but may be less acceptable. In some circumstances there
acceptable if definitive clinical decisions (eg may be a choice only between a measure
surgical intervention) are dependent on the with lower reliability or no measure at all, in
measure. Because of this dependence on the which case the less than perfect measure
context in which they are produced, it is may still add useful information.
therefore very difficult to make comparisons In recent years several authors have
of reliability across different studies, except recommended that no single reliability
in very general terms. estimate should be used for reliability
studies. Opinion is divided over exactly
Conclusion which estimates are suitable for which
This paper has attempted to explain the circumstances. Rankin and Stokes (1998)
concept of reliability and describe some of have recently suggested that a consensus
the estimates commonly used to quantify it. needs to be reached to establish which tests
Key points to note about reliability are should be adopted universally. In general,
summarised in the panel below. Reliability however, it is suggested that no single
should not necessarily be conceived as a estimate is universally appropriate, and that
property that a particular instrument or a combination of approaches is more likely
measurer does or does not possess. Any to give a true picture of reliability.
Key Messages
Reliability is:
■ Population specific.
■ Not an all-or-none phenomenon. ■ Related to the variability in the group
■ Open to interpretation. studied.
■ Not the same as clinical acceptability. ■ Best estimated by more than one index.