Has My Patient Responded?
Has My Patient Responded?
Has My Patient Responded?
To correctly interpret clinical measurements it is necessary to understand the standard deviation and the standard error; the former reects the range or variability of individuals within a sample and the latter reects the precision for which the group parameters have been estimated. When evaluating an individual patient, test measurement properties such as repeatability will assist in concluding whether a repeated test, measured to monitor the response to an intervention, has changed beyond its natural variability. Using the best test has an inherent bias and ignores the natural test variation, whereas the average of repeated tests is more representative of the true value, making it more discriminative to change. Serial measurements to follow progress will increase a clinicians condence in the observed effects of treatment. Keywords: statistics; repeatability; reproducibility of results
Clinical measurements are used in all respiratory disciplines for assessing a patients response to an intervention and for monitoring status. Certain properties of a measurement need to be known and correctly applied to interpret whether there has been a true change in an individual patient to enable appropriate decision-making. For example, in a patient who appears to have improved, it is not uncommon to observe an increase in their 6-minute-walk distance (6MWD) after pulmonary rehabilitation that is less than the typical 30 to 50 m change reported in the literature (1). However, it is also not uncommon to observe a substantial improvement in 6MWD, after an initial practice test, between two consecutive days at the beginning of the program in the absence of any intervention; such day-to-day changes may exceed 70 m. Both observations are reasonable as they reect a natural, relatively large, test-to-test variability in an individuals test response. Clinical trials commonly report only group test-to-test responses for reliability, but this information does not inform regarding the patient test-to-test responses. This can be confusing, especially for those who use results of group clinical outcome measures, with favorable
(Received in original form March 18, 2011; accepted in nal form May 20, 2011) Supported by the National Sanitarium Association-University of Toronto Chair in Respiratory Rehabilitation Research (R.S.G.). Author Contributions: T.E.D. contributed to conception, analysis, and interpretation of data and drafted the article. K.H. contributed to interpretation of data and revising the article critically for important intellectual content. R.A.E. contributed to interpretation of data and revising the article critically for important intellectual content. R.S.G. contributed to interpretation of data and revising the article critically for important intellectual content. Correspondence and requests for reprints should be addressed to R. S. Goldstein, 82 Buttonwood Avenue, Toronto, ON, M6M 2J5 Canada. E-mail: roger.goldstein@ westpark.org
Am J Respir Crit Care Med Vol 184. pp 642646, 2011 Originally Published in Press as DOI: 10.1164/rccm.201103-0497CC on June 16, 2011 Internet address: www.atsjournals.org
measurement properties for groups of patients, to assess change in a single patient. The confusion can be alleviated by appreciating the natural variation of any test using common statistical parameters. Population sample characteristics such as the standard deviation (SD) and standard error (SE) of the mean are important in the interpretation of clinical trials. However, they need to be considered in the context of test repeatability when the results are used to evaluate change in an individual patient. The aim of this article is to (1) describe how an individuals variability can be recognized by a tests coefcient of repeatability, an important property when making clinical decisions in the management of an individual patient; (2) present some of the consequences in clinical practice when data are interpreted without an awareness of the repeatability; (3) describe the effects of reporting best effort rather than average of measurements; and (4) suggest ways to address the need to use and interpret clinical outcomes by accounting for individual variability. The 6-minute-walk Test (6MWT) is used to illustrate the above points because of its frequent use in clinical trials for respiratory disease (2). It is acknowledged as being inexpensive, safe, and reliable, and a large multicenter study in which the 6MWT was performed according to American Thoracic Society (ATS) standards, has just been published (3). The points made in this article are, however, equally applicable to any clinical outcome measure.
Eighty-nine patients with chronic obstructive pulmonary disease (COPD) entered a pulmonary rehabilitation program (4). After completing a practice 6MWT, they then completed two 6MWTs before commencing training. The group results from these two 6MWTs are presented in Table 1.
Standard Deviation, Standard Error, and Condence Intervals
Scientic and statistical vocabulary should be used with the same precision required of the measurements. The standard deviation of the mean (SD) informs about the variability (dispersion) within the original sample population. In the example, most (95% by denition) of the individual distances are within the sample mean 6 2 standard deviations, assuming that the values are normally distributed. In other words, 95% of the distances of the rst test should lie between 361 6 194 m (i.e., 167 to 555 m). This information is useful to describe a sample, but we are also interested in knowing how precisely the description represents the total population of patients with COPD.
Clinical Commentary TABLE 1. OUTCOME OF 89 PATIENTS WITH CHRONIC OBSTRUCTIVE PULMONARY DISEASE ENTERING PULMONARY REHABILITATION AFTER COMPLETION OF A PRACTICE TEST
First 6MWT before Rehabilitation Mean, m Standard deviation, m Standard error, m Condence interval, m Minimum, m Maximum, m 361 97 10 341 to 381 93 570 Second 6MWT before Rehabilitation 366 99 11 344 to 388 95 605 Second 6MWT First 6MWT (Difference)* 5 41 4 23 to 14 2165 110
643
Denition of abbreviation: 6MWT 6-minute-walk Test. * Column cells are parameters (mean, SD, SE, etc.) of the difference.
The mean distance in the rst walk was 361 m. However, different samples drawn from the total COPD population would likely have different mean values. To know how precise this estimate is, that is, quantifying the degree of uncertainty of the extent to which the sample mean estimates the true mean of the total population, the standard error of the mean (SE) can be calculated. This is the standard deviation of all the possible means of samples of a given size and is calculated from the standard deviation of the original sample (numerator) and the square root of the sample size (denominator). The precision of our estimate of the population mean is determined by the 95% condence intervals (twice the SE). As the sample size increases, so does the precision of this estimate of the population mean. In the example, the relatively precise estimate of the mean (a small SE of 10 m) is inuenced by the reasonable sample size (n 89). This has resulted in moderately narrow condence intervals for the estimate of the population mean (341 to 381 m). So, although we are 95% condent that the real population mean for the rst 6MWD lies between 341 and 381 m, this SE just tells us about group data and not about individual distances.
Coefcient of Repeatability: The Individuals Test Variability
In the example (Table 1), there was no signicant difference in the 6MWD between the rst (postpractice) and second tests. The 95% condence interval for the mean change was narrow (3 to 14 m), which tells us that the error in estimating the true difference in distance between the two tests is likely small. This is commonly interpreted as the test being reliable. Although it is tempting to assume that there is little difference between the two tests for every patient, it is a common misinterpretation to use group data to comment on individual results. Each individuals difference between tests is plotted against their own average walk in Figure 1, commonly referred to as a Bland-Altman plot (5). To understand individual variability we need to calculate the coefcient of repeatability, using the standard deviation
of all the individual differences between tests. From these data we can also calculate the precision of our estimate of repeatability. The precision of our estimate, but not the coefcient itself, will improve with increasing sample size. The coefcient of repeatability is dened as twice the standard deviation of the difference (SDdiff). If the test is repeated more than twice then the coefcient of repeatability can be calculated from an analysis of variance. In our example, the standard deviation of the difference is 41 m and therefore the coefcient of repeatability is 82 m (Figure 1). This tells us that usually 95% of the differences between the two tests for individual patients will be between 82 m above or 82 m below the mean difference of 5 m. This coefcient implies that an individual patient would need to change their 6MWD by greater than 82 m for a clinician to be 95% condent that this change was not due to a natural variation. The probability of a change greater than 82 m, without any alteration in clinical circumstances, is about 5%. No further standardization, reasonable control, or practice will affect the inherent natural variability reected by the test repeatability. There are strategies, however, that can be used to lessen the restriction this natural variation has on our decision-making (see Sections 3 and 4). The term repeatability is often interchanged in the literature to reproducibility; a summary of measurement terms is provided in Table 2 for clarication. A repeatability of 82 m for the 6MWD is consistent with data from several programs (6). N. Eiser and colleagues administered three walks per day for three straight days (total of nine walks). The individual variation in the third walk of each day provides an estimate of the repeatability coefcient of 77 m (N. Eiser, personal communications). The ATS consensus group on standards for the 6MWT (2) estimated a coefcient of variation (standard deviation of the difference divided by the mean) of 8%. Therefore, if the average walk distance was 400 m and the coefcient of variation is 8%, the SD of the difference between the two tests is 32 m and the repeatability is 64 m. In summary, knowledge of a tests repeatability is important when making clinical decisions about the management of
Figure 1. Patients difference between rst and second walk plotted versus the patients average distance (n 89). Repeatability of the 6-minute-walk distance (broken lines about the mean) was 82 m. Note, as expected, ve patients (z5%) lie outside this boundary (95%) of 82 m. The dotted lines represent the precision of the estimate of repeatability (5).
VOL 184
2011
Conditions
Closeness of a measured or calculated quantity to its actual (true) value Closeness to which further measurements or calculations will show the same or similar results Closeness between two methods of measuring the same quantity. Precision under repeatability conditions Precision under reproducibility conditions
Test results are obtained on identical test items in the same laboratory by the same operator using different equipment or methods within short intervals of time Independent test results are obtained with the same method on identical test items in the same laboratory by the same operator using the same equipment Test results are obtained with the same method on identical test items in different laboratories with different operators using different equipment.
individual patients. Repeatability quanties the variance of repeated measures within a group, is derived from the SD of the difference in repeated measures on individuals, and does not change with sample size.
certainty could be lowered. For example, in the 6MWT one could accept 66 m (1.65 SD or 90% condence interval) rather than 80 m (2 SD or 95% condence interval) as the threshold to dene a responder. The degree of risk accepted is weighed against the consequences of being wrong.
Figure 2. A comparison of the repeatability determined as a function of the number of tests completed in each session. Repeatability is calculated from the difference in the average and best effort methods of reporting results.
Clinical Commentary
645
unbalanced. This is illustrated by the example (Figure 3) of a pulmonary rehabilitation program in which the better of two walks at baseline is used (despite evidence that the difference between the two is no more than natural variation) and only one walk is completed at discharge. When the average of the session is used the mean stays faithful to the true value of the population independent of how many walks are done per session. However, the value of the best effort moves further from the true value as the number of walks per session is increased. If the program uses best effort and several attempts at baseline, it will underestimate the treatment effect because baseline measures were inated compared with the true value. Moreover, a control group without a true treatment effect would appear to have worsened. Using the best effort also makes it difcult to compare results among clinical trials and programs. When comparing two programs that use best effort, subjects will appear to be less severe in the program that uses the most walk attempts whereas there would be no difference between the mean of the programs if the average walk result was used to establish the group mean, irrespective of how many 6MWTs were performed at the time of measurement. Patients will appear to have a greater capacity to walk depending on the number of attempts they are given to achieve their best effort. For example, if the same 100 patients were tested at two different centers that report best effort but used 2 and 7 attempts, they would report best effort distances of 363 and 380 m, respectively, even though the true mean 6MWD was 351 m. In contrast, when using the average walk distance for each patient, the centers would report mean values of 351 m regardless of how many attempts they made. In other words, when any intervention is applied to a group of patients you would always report the mean response, never the best response from the sample, to interpret whether the intervention had worked. The best effort issue is commonly encountered when clinicians use effort-dependent tests such as the 6MWT or even the FEV1. Use of the highest value as the true value ignores the natural biological variation inherent in every test and inuences the evaluation of treatment effects. In summary, there is an inherent variability in an individuals response that cannot be improved on, but can be quantied by the coefcient of repeatability. Under typical clinical circumstances
reporting the average of tests is closer to their true group mean value than reporting the best test results.
SUMMARY
To correctly interpret clinical measurements it is necessary to understand the standard deviation and the standard error; the former reects the range or variability of individuals within a sample and the latter reects the precision for which the group parameters have been estimated. When evaluating an individual patient, test measurement properties such as repeatability will assist in concluding whether a repeated test, measured to monitor the response to an intervention, has changed beyond its natural variability. Using the best test has an inherent bias and ignores the natural test variation, whereas the average of repeated tests is more representative of the true value, making it more discriminative to change. Serial measurements to monitor progress will increase a clinicians condence in the observed effects of treatment.
Author Disclosure: None of the authors has a nancial relationship with a commercial entity that has an interest in the subject of this manuscript.
References
1. Lacasse Y, Goldstein R, Lasserson TJ, Martins S. Pulmonary rehabilitation for chronic obstructive pulmonary disease. Cochrane Database Syst Rev [serial on the Internet]. 2006 [accessed June 2011]; 4:CD003793. Available from: http://www.cochrane.org/reviews/ clibintro.htm 2. American Thoracic Society. ATS statement: guidelines for the SixMinute Walk Test. Am J Respir Crit Care Med 2002;166:111117. 3. Casanova C, Celli BR, Barria P, Casas A, Cote C, de Torres JP, Jardim J, Lopez MV, Marin JM, Montes de Oca M, et al. The 6-min walk distance in healthy subjects: reference standards from seven countries. Eur Respir J 2011;37:150156. 4. Goldstein RS, Gort EH, Stubbing D, Avendano MA, Guyatt GH. Randomised controlled trial of respiratory rehabilitation. Lancet 1994;344:13941397. 5. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307 310. 6. Wise R, Brown C. Minimal clinically important differences in the SixMinute Walk Test and the incremental shuttle walking test. COPD 2005;2:125129. 7. Redelmeier DA, Bayoumi AM, Goldstein RS, Guyatt GH. Interpreting small differences in functional status: the Six Minute Walk Test in chronic lung disease patients. Am J Respir Crit Care Med 1997;155: 12781282. 8. Puhan MA, Mador MJ, Held U, Goldstein R, Guyatt GH, Schunemann HJ. Interpretation of treatment changes in 6-minute walk distance in patients with COPD. Eur Respir J 2008;32:637643.
Figure 3. A comparison of the mean walk distance as a function of the number of tests completed in each session. The grand mean is calculated from the difference in the average and best effort methods of reporting results for each patient.
646
VOL 184
2011
9. Holland AE, Hill CJ, Rasekaba T, Lee A, Naughton MT, McDonald CF. Updating the minimal important difference for six-minute walk distance in patients with chronic obstructive pulmonary disease. Arch Phys Med Rehabil 2010;91:221225. 10. Glantz S. What do the data really show? In: Primer of biostatistics, 3rd ed. New York: McGraw Hill; 1993. p. 440. 11. International Organization for Standardization. Accuracy (trueness and precision) of measurement methods and results. 2. Basic method for
the determination of repeatability and reproducibility of a standard measurement method. ISO 5725-2:1994. Available at http://www.iso. org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber 11834 12. International Organization for Standardization. Accuracy (trueness and precision) of measurement methods and results. 1. General principles and denitions. ISO 5725-1:1994. Available at http://www.iso.org/iso/ iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber11833