Measurement in Physical Education and Exercise Science
Measurement in Physical Education and Exercise Science
Measurement in Physical Education and Exercise Science
Measurement in Physical
Education and Exercise Science
Publication details, including instructions for
authors and subscription information:
http://www.tandfonline.com/loi/hmpe20
To cite this article: Randall D. Penfield & Peter R. Giacobbi, Jr. (2004) Applying a
Score Confidence Interval to Aiken's Item Content-Relevance Index, Measurement
in Physical Education and Exercise Science, 8:4, 213-225, DOI: 10.1207/
s15327841mpee0804_3
Taylor & Francis makes every effort to ensure the accuracy of all the
information (the “Content”) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness,
or suitability for any purpose of the Content. Any opinions and views
expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the
Content should not be relied upon and should be independently verified with
primary sources of information. Taylor and Francis shall not be liable for any
losses, actions, claims, proceedings, demands, costs, expenses, damages,
and other liabilities whatsoever or howsoever caused arising directly or
indirectly in connection with, in relation to or arising out of the use of the
Content.
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan,
sub-licensing, systematic supply, or distribution in any form to anyone is
expressly forbidden. Terms & Conditions of access and use can be found at
http://www.tandfonline.com/page/terms-and-conditions
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013
MEASUREMENT IN PHYSICAL EDUCATION AND EXERCISE SCIENCE, 8(4), 213–225
Copyright © 2004, Lawrence Erlbaum Associates, Inc.
Randall D. Penfield
Department of Educational and Psychological Studies
University of Miami
Requests for reprints should be sent to Randall D. Penfield, School of Education, P.O. Box 248065,
University of Miami, Coral Gables, FL 33124-2040, E-mail: [email protected]
214 PENFIELD AND GIACOBBI
chology researchers creating scales for use in applied research settings (Dunn et
al., 1999).
Item content-relevance is commonly assessed by obtaining ratings from a panel
of expert judges on the extent to which the item in question matches the intended
content domain. Although the precise form of the rating method may vary across
applications, typically the item content-relevance ratings are obtained using either
a 5- or 7-point Likert-type rating scale, where the lowest possible rating corre-
sponds to very poor content-relevance and the highest possible rating corresponds
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013
to very good content-relevance. The obtained ratings for each item are then sum-
marized using an appropriate descriptive statistic, such as the mean or some trans-
formation of the mean (Crocker, Llabre, & Miller, 1988; Crocker et al., 1989;
Sireci & Geisinger, 1995).
Dunn et al. (1999) reviewed published articles over a two-decade period in The
Sport Psychologist, the Journal of Sport and Exercise Psychology, the Journal of
Applied Sport Psychology, and Research Quarterly for Exercise and Sport. Their
intent was to assess item content-relevance procedures reported by authors of stud-
ies whose main focus included the development of new psychological inventories.
Of the articles reviewed, several trends were noted. First, the number, characteris-
tics, and qualifications of expert judges used to assess item content-relevance var-
ied considerably from study to study. In many of the studies reviewed, Dunn et al.
(1999) noted that little to no information was presented regarding the judges’ char-
acteristics or why specific judges were chosen to serve as expert raters. Dunn et al.
(1999) recommended that “authors provide some information regarding experts’
familiarity not only with the construct domains under investigation, but also with
the population for whom the test is intended” (p. 18).
A second trend noted by Dunn et al. (1999) was that little emphasis was placed
on using statistical procedures to appropriately summarize the obtained judges’
ratings. To provide guidance concerning available procedures that can be used to
summarize the obtained ratings, Dunn et al. (1999) recommended the use of
Aiken’s V statistic (Aiken, 1980, 1985) because it can not only be used to summa-
rize the magnitude of the obtained expert ratings, but also to test specific hypothe-
ses concerning the values of the ratings for the population. The V statistic is com-
puted using the formula
X -l
V= (1)
k
where X represents the sample mean of the judges’ ratings, l represents the lowest
possible rating, and k represents the range of possible values of the rating scale
used (e.g., a scale having possible values extending from 1 to 5 has l = 1 and k = 5 –
1 = 4). The statistic V provides an index of rater endorsement that ranges from 0 to
CONTENT-RELEVANCE CONFIDENCE LEVEL 215
1. A value of V = 0 is obtained when all judges select the lowest possible rating, and
a value of V = 1 is obtained when all judges select the highest possible rating. Hy-
pothesis tests concerning the unknown population value of V, denoted Vp, can also
be conducted. For example, a scale developer may wish to test the null hypothesis
that Vp = 0.50 against the directional alternative hypothesis that Vp > 0.5; any item
for which the null hypothesis is rejected may be deemed to have a sufficient level
of item content-relevance. The hypothesis test is based on an exact binomial test
(see Aiken, 1985 for details), and Aiken (1985) provides a table containing the crit-
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013
ical values of V required to reject the specific null hypothesis Vp = 0.5 in favor of
the alternative hypothesis Vp > 0.5. Applications of this approach are presented by
Aiken (1985) and Dunn et al. (1999).
Although Aiken’s V provides a useful framework for making descriptive state-
ments about the level of content-relevance of an item, the inferential procedure for
testing hypotheses concerning Vp has several drawbacks. First, the critical values
of V listed in the table provided by Aiken (1985) are only applicable for the null hy-
pothesis that Vp = 0.5, a somewhat arbitrary null hypothesis. Because hypothesis
tests of Vp greater than 0.5 (e.g., 0.6, 0.7, or 0.8) may be of great interest to re-
searchers wishing to place a more conservative criteria on the value of Vp for inclu-
sion of the item on the scale, the table of critical values of V provided by Aiken
(1985) may be of limited use to some researchers assessing item content-rele-
vance. Second, the computation of the binomial probabilities required for the hy-
pothesis test can be intensive, and thus unless the rating specifications being used
are within the criteria of Aiken’s (1985) table of critical values, the researcher must
be able to compute the binominal tail probabilities either by hand or using statisti-
cal software. Third, the discrete nature of the data inherent in the exact binomial test
leads to difficulties in making inferential statements, particularly when the number
of raters is small, because the critical values of V do not correspond precisely to the
intended Type I error rate. As a result, the specific critical values of V listed in the ta-
ble can be somewhat misleading for a researcher intending to assume a Type I error
rate of 0.05 or 0.01, a commonly encountered problem in conducting exact hypothe-
sis tests of discrete variables (see Agresti, 1990). Fourth, the outcome of a hypothesis
test alone provides little information about the actual value of Vp. That is, the hypoth-
esis test leads only to a decision of whether or not Vp equals a particular value, but
does not provide information concerning what the value of Vp might actually be.
Fifth, the hypothesis test alone provides no information concerning the expected er-
ror of V as an estimate of Vp, and thus provides no information concerning how close
the sample value of V is expected to be to the unknown value of Vp.
The five drawbacks of the binomial-based hypothesis test of Vp discussed earlier
can be overcome through the use of a confidence interval for Vp. The advantages of
a confidence interval for Vp include (a) the existence of rich information concern-
ing the actual value of Vp, in contrast to the reject–accept nature of a hypothesis
test; (b) the existence of information concerning the amount of error expected in
216 PENFIELD AND GIACOBBI
with the growing emphasis placed on the use of confidence intervals in reporting
all quantitative psychological research (Fidler, 2002).
The difficulty in constructing a confidence interval for Vp is the bounded nature
of V, making confidence intervals based on asymptotic normal distribution as-
sumptions inappropriate. That is, because V is not normally distributed, traditional
confidence intervals for a population mean, such as the Wald interval (Wald, 1943)
given by most introductory statistics texts as
æ s ö
X ± tdf çç ÷÷÷,
çè n ø
Consider the case of a group of n judges rating an item using ratings that have a
possible range of k. Note that k can be computed as the highest possible rating mi-
nus the lowest possible rating, or as the number of points on the rating scale minus
one. Based on the ratings of the n judges, suppose that the statistic V is computed
using Equation 1. Then, the lower (L) and upper (U) limits to a C% Score confi-
dence interval for Vp can be obtained using the following form originally devel-
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013
A+ B
U= (5)
C
where
A = 2nkV + z2 (6)
C = 2(nk + z2 ) (8)
The Score confidence interval has the desirable property of being asymmetric
about V. If V is greater than 0.5, then the Score confidence interval will extend fur-
ther below V than above V, and if V is less than 0.5, then the Score confidence inter-
val will extend further above V than below V. In addition, the bounds of the Score
confidence interval cannot extend below 0 or above 1.0, thus overcoming a prob-
lem of impossible confidence interval limits commonly encountered in the ap-
218 PENFIELD AND GIACOBBI
plication of the traditional Wald interval to bounded variables. The results of em-
pirical investigations of the Score confidence interval indicate that the Score
confidence interval is typically substantially shorter in length, and has a higher
probability of containing the population parameter of interest than the traditional
Wald confidence interval (Ghosh, 1979; Newcombe, 1998).
A NUMERIC EXAMPLE
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013
4.29 - 1.00
V= = 0.82.
4.00
The obtained value of V tells us that the sampled raters tended to provide relatively
high ratings for this item. The value of V may deviate substantially, however, from
the population value it estimates (Vp), and thus it is useful to construct a confidence
interval for Vp. Let us construct a 95% confidence interval for Vp using the Score
confidence interval. Note that a 95% confidence interval uses z = 1.96. Using this
information, the terms A, B, and C of Equations 6, 7, and 8 are given by
Substituting the values of A, B, and C into Equations 4 and 5 yield the lower and
upper limits of
49.76 - 8.85
L= = 0.64
63.68
49.76 + 8.85
U= = 0.92.
63.68
CONTENT-RELEVANCE CONFIDENCE LEVEL 219
Thus, we can be 95% confident that the value of Vp lies between 0.64 and 0.92.
Note that the lower bound of 0.64 lies 0.18 units below V, and the upper bound of
0.92 lies 0.10 units above V. The Score confidence interval provides more room for
error below V than above V because the value of V was closer to 1.0 than to 0.
TABLE 1
Outcomes of Ratings, Values of Aiken’s V, and 90% and 95% Score
Confidence Interval for 20 Items of the Life Skills Questionnaire
Note. The critical value of V for testing the null hypothesis that Vp = 0.5 according to Aiken’s
(1985) table of critical values is 0.75 under a Type I error rate of 0.05. The items for which the null hy-
pothesis is rejected according to Aiken’s critical value are noted with *. CI = confidence interval.
and 1.20 on a range of 1 to 5). As a result, we viewed the cited criteria to be mean-
ingful for applied settings, but acknowledge that the criteria adopted by a particu-
lar researcher may vary depending on the content area, and intended use of the ob-
tained scale scores. We are not aware of any research providing guidelines
concerning criteria for acceptable lengths of confidence intervals for item con-
tent-relevance studies.
The Score confidence interval can also be used to assess hypotheses concerning
the value of Vp. For a directional test of the null hypothesis that Vp equals some
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013
value, V0, using a Type I error rate of α, the acceptance of the null hypothesis is as-
sociated with a (1 – 2α) × 100% confidence interval about V. The null hypothesis is
accepted if the confidence interval for Vp contains the null value of V0, and the null
hypothesis is rejected in favor of the directional alternative hypothesis if the lower
limit of the confidence interval exceeds the null value, V0. As an example, consider
a researcher interested in testing the null hypothesis that Vp = 0.5 against the alter-
native hypothesis that Vp > 0.5 (note that the value of Vp = 0.5 corresponds to mean
rating equaling the middle point on the rating scale). Items for which the null hy-
pothesis is rejected are retained, and items for which the null hypothesis is ac-
cepted are flagged for revision or removal.
Applying the hypothesis test to the items presented in Table 1, we see that Items
2, 8, 9, 10, 14, and 18 have a 90% confidence interval that contains 0.5, and thus for
each of these items we accept the null hypothesis that Vp = 0.5. These items should
be examined for their content, and either revised or removed from the scale. The
remaining items are retained because there is sufficient evidence to support the hy-
pothesis that Vp exceeds 0.5, and thus that the mean rating in the population of rat-
ers reflects a positive endorsement of the item (e.g., the mean rating in the popula-
tion exceeds 3.0 on a 5-point scale). Note that using the significance test proposed
by Aiken (1980, 1985), the critical value of V for testing the null hypothesis that Vp
= 0.5 at α = 0.05 is equal to 0.75 (Aiken, 1985, p. 134). The items for which the null
hypothesis is rejected using this critical value are denoted by an asterisk next to the
value of V in Table 1. Using Aiken’s critical value, the null hypothesis is accepted
for Items 2, 3, 4, 7, 8, 9, 10, 13, 14, and 18. Based on these results, Aiken’s critical
value appears to be more conservative than that Score confidence interval. The
conservative nature of Aiken’s hypothesis test, relative to that of the results pro-
vided by the Score confidence interval, is most likely due to the fact that the critical
values provided by Aiken’s (1985) table do not correspond precisely to the in-
tended Type I error rate because of the highly discrete nature of the variable under
investigation. This problem, as noted earlier, is commonly encountered with exact
tests of discrete variables (Agresti, 1996).
Unlike the hypothesis test proposed by Aiken (1980, 1985), the use of confi-
dence intervals permits us to assess any arbitrary null hypothesis. For example, we
may wish to make the criteria of item revision more stringent, through testing the
null hypothesis that Vp = 0.75. Note that a value of 0.75 is associated with an aver-
222 PENFIELD AND GIACOBBI
age rating of 4 of a 5-point scale with response options ranging from 1 to 5, or good
fit to the intended construct. In this case, determining the items for which the null
hypothesis of Vp = 0.75 is accepted using a Type I error rate of 0.05 can be con-
ducted by determining the items for which the 90% confidence interval contains
0.75 (all items but 1, 12, 15, and 16). Although the criterion value of 0.75 may be
too stringent in practical applications, we present it here strictly for didactic pur-
poses to illustrate the flexibility of the Score confidence interval over the hypothe-
sis test proposed by Aiken (1980, 1985). Researchers in the beginning stages of
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013
scale development may choose to select a more liberal criterion value (e.g., V0 =
0.4) or use a higher Type I error rate (e.g., α = 0.10), particularly if the number of
expert raters used is small.
As a final note on the application of the Score confidence interval to Aiken’s V,
an often useful method for assessing item content-relevance is to ask each expert
judge to rate the content-relevance of each item regarding each subconstruct in-
tended to be measured by the scale. Computing Aiken’s V for each combination of
item and subconstruct permits the scale developer to obtain an index of con-
tent-relevance for each item in relation to each subconstruct (see Dunn et al.,
1999). Because an item should yield higher values of V for subconstructs intended
to be measured by the item than subconstructs not intended to be measured by the
item, this approach can yield useful convergent and divergent validity information,
and as a result lead to more accurate conclusions concerning item content-rele-
vance. The Score confidence interval can be applied to this situation in a similar
fashion as described earlier. In this case, a Score confidence interval would be con-
structed for each subconstruct in relation to each item. Although the content-rele-
vance ratings collected for the life skills scale described earlier do not accommo-
date this particular analysis (because each item was not rated in relation to each
subconstruct), we viewed it important to bring this potentially useful application of
the Score confidence internal to the reader’s attention.
DISCUSSION
dence intervals displayed in Table 1 for the 20 items of the life skills scale were
computed in just a few minutes.
In conclusion, the application of the Score confidence interval to Aiken’s V can
enhance the analysis of item content-relevance by providing valuable information
concerning the expected precision of Aiken’s V as an estimator of the unknown
population value, Vp. The primary obstacle to the implementation of the Score con-
fidence interval is its computational complexity; however, as described earlier, the
Score confidence interval can be computed with little difficulty using any data
management software. One unresolved issue of the application of the Score confi-
dence interval to item-content relevance studies concerns criteria of acceptable
length of the interval. Because meaningful guidelines concerning acceptable inter-
val lengths have not yet been established, this is an important topic for future re-
search in the area of scale validation.
REFERENCES
Lynn, M. R. (1986). Determination and quantification of content validity. Nursing Research, 35,
382–385.
Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: Comparison of
seven methods. Statistics in Medicine, 17, 857–872.
Penfield, R. D. (2003). A score method of constructing asymmetric confidence intervals for the mean of
a rating scale item. Psychological Methods, 8, 149–163.
Sireci, S. G. (1998). The construct of content validity. Social Indicators Research, 45, 83–117.
Sireci, S. G., & Geisinger, K. F. (1995). Using subject-matter experts to assess content representation:
An MDS analysis. Applied Psychological Measurement, 19, 241–255.
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of ob-
servations is large. Transactions of the American Mathematical Society, 54, 426–482.
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the
American Statistical Association, 22, 209–212.
Yalow, E. S., & Popham, W. J. (1983). Content validity at the crossroads. Educational Researcher,
12(8), 10–14, 21.
APPENDIX
p - π0
z= . (A1)
π0 (1- π0 )
x
When both sides of Equation A1 are squared, the terms can be rearranged to give
Next, the solution to the quadratic form of Equation A3 with respect to π0 can be
solved using
2 px + z2 ± z 4 px(1- p) + z2
. (A4)
2( x + z2 )
CONTENT-RELEVANCE CONFIDENCE LEVEL 225