tmp7B5E TMP
tmp7B5E TMP
Rasch Model
Purya Baghaei
Islamic Azad University, Iran
Takuya Yanagida
University of Applied Sciences, Austria
Moritz Heene
Ludwig-Maximilians University, Germany
________________________________
Author info: Correspondence should be sent to: Dr. Purya Baghaei, English
Department, Mashhad Branch, Islamic Azad University, Mashhad, Iran.
[email protected]
North American Journal of Psychology, 2017, Vol. 19, No. 1, 155-168.
NAJP
156 NORTH AMERICAN JOURNAL OF PSYCHOLOGY
Fischer-Scheiblechners S test
In Fischer-Scheiblechners (Fischer & Scheiblechner, 1970)
approach, the sample is divided into two subsamples and the item
parameters are estimated in each of the subsamples. The difference
between the item parameters across the subsamples is tested with the
usual z-test of difference:
Present study
The goal of the present study is to develop global fit statistics for
checking the dichotomous Rasch model. In general, for descriptive fit
statistics to be useful they should meet several conditions. First, in the
absence of differential item functioning (DIF),- i.e., where items have
different parameter estimates based on different subsamples of the same
location on the latent trait- the fit statistic should be near a constant
value. Second, in the presence of DIF, the fit statistic should quantify the
extent of DIF. In particular, it should become larger depending on the
number of DIF items and the magnitude of DIF. Lastly, when
quantifying DIF, the fit statistic should not be affected by sample size.
More specifically, in the absence of DIF, the fit measure should be near a
constant value independent of the sample size while, in the presence of
DIF, the value should only quantify DIF without being affected by
sample size. In order to develop fit measures for testing the Rasch model,
the study investigated properties of various measures to evaluate if those
requirements are fulfilled.
METHOD
Proposed fit measures for testing the Rasch model
We propose some descriptive fit measures based on the principle of
stability of item parameter across subsamples, which will then be
examined with a simulation study.
Root-mean-square deviation (RMSD). RMSD is the square root of
the mean square difference between item parameters estimated in two
subgroups after bringing them onto a common scale:
Baghaei, Yanagida, & Heene RASCH MODEL FIT 159
second subgroup (e.g., examinees with high scores), and is the number
of items. Following the rationale of the Andersens LR test, if the Rasch
model holds in the population, equivalent item parameter estimates
should be obtained, apart from sampling error, which means the RMSD
should be close to zero.
Standardized root-mean-square deviation (SRMSD). SRMSD is the
RMSD divided by the pooled standard deviation (SD pooled) of item
parameters for both subgroups:
Likewise, if the Rasch model holds, the RMSD should be near zero.
Normalized root-mean-square deviation (NRMSD). The NRMSD is
the RMSD divided by the range of estimated item parameters in both
subgroups:
both subgroups. Again, if the Rasch model holds, the SRMSD should be
near zero.
Chi square to degree of freedom ratio X2/df. The chi square to degree
of freedom ratio is commonly applied in the framework of structural
equation modeling (SEM) to assess model fit (see West, 2012). The
rationale is that the expected value of the X2 for a correct model equals
the degree of freedom. Thus, if the Rasch model holds, X2/df should be
close to one. The current study investigated X2/df for both the Andersens
LR test and the Fischer and Scheiblechners S statistic.
160 NORTH AMERICAN JOURNAL OF PSYCHOLOGY
When the chi-square is less than the degree of freedom, the RMSEA
is set to zero. In the current study, the RMSEA based on both the
Andersens LR test and the Fischer and Scheiblechners S statistic is
investigated. If the Rasch model holds, the RMSEA should be near zero.
Simulation study
In order to investigate the properties of the proposed measures,
simulations based on two general conditions were carried out: (1) without
differential item functioning or null hypothesis conditions and (2) with
differential item functioning or alternative hypothesis conditions. In both
conditions, data were simulated with n = 100, 200, 300, 400, 500, 600,
700, 800, 900, and 1,000 examinees in combination with k = 10, 20, 30,
40, and 50 items. In the alternative hypothesis conditions, data were
simulated with eight DIF items. The magnitude of DIF was 0.6 or 1/10
of the range of the simulated item parameters.
The item parameters were set as equally spaced within the interval [-
3, 3], which corresponds to the whole spectrum of item difficulties that
arise in practice. Meanwhile, the person parameters of examinees were
randomly drawn from N (0, 1.5), again corresponding to the values of
person parameters that are likely to occur in practice. Moreover,
simulations were conducted in R (R Core Team, 2015) using the eRm
package (Mair, 2015).
In order to compute the proposed fit statistics, data sets were divided
into high scorers and low scorers, based on the mean of the raw scores.
Next, the item parameters were estimated separately in the two
subsamples. Lastly, the item parameters where brought on to a common
scale.
In each condition, the fit statistic in question was computed for
10,000 replications. In addition, for each fit statistic, we computed mean,
standard deviation as well as minimum and maximum over all
replications.
RESULTS
Null hypothesis condition
First, the null hypothesis condition was investigated, that is, those
without differential item functioning. As for the RMSD, results revealed
that this fit statistic is highly dependent on sample size; the larger the
sample size, the lower the fit statistic. For example, while the mean
Baghaei, Yanagida, & Heene RASCH MODEL FIT 161
FIGURE 1 Mean LR test 2/df statistic for N = 100, 200, 300, 400, 500,
600, 700, 800, 900, and 1,000 for k = 10, 20, 30, 40, and 50 items under
the null hypothesis condition (left panel) and alternative hypothesis
condition (right panel) with 8 DIF items
1
A table which depicts the mean of all proposed statistics in the null
hypothesis condition (when there is no DIF) for different sample sizes
and test lengths can be obtained from the authors.
162 NORTH AMERICAN JOURNAL OF PSYCHOLOGY
when sample size is lower than N = 400 in the case of k = 10 and lower
than 300 in the case of k > 10. For instance, the Andersens RMSEA for
N = 100 and k = 10 is 0.03, while this value drops to 0.01 for N = 400.
These properties of the investigated fit statistics seem to hold for k > 10.
The results of the null hypothesis condition with eight DIF items are
shown in Figure 1.
In sum, the results suggest that RMSD, SRMSD, and NRMSD are not
suitable as fit statistics because they are highly dependent on sample size
in the absence of DIF. For this reason, only 2/df and RMSEA will be
discussed in the alternative hypothesis condition.
i.e., 16% of the items, the value is 1.15 for a sample size of 100.
Therefore, the interpretation of 2/df depends on the amount of DIF we
are ready to accept in the data. If we consider 100 as the smallest
acceptable sample size for conducting Rasch model analysis and 0% as
the smallest tolerable magnitude of DIF in the data we need 2/df values
FIGURE 2. Mean LR test RMSEA Statistic for N = 100, 200, 300, 400,
500, 600, 700, 800, 900, and 1,000 for k = 10, 20, 30, 40, and 50 Items
under the Null Hypothesis Condition (left panel) & Alternative
Hypothesis Condition (right panel) with 8 DIF Items.
a lot smaller than 1.15. Therefore, a maximum value of 1.03 (the largest
value for 2/df in the null hypothesis condition where there was no DIF)
for this value should indicate perfect fit to the Racsh model. However,
note that this value has different standard errors for different test lengths
in the null hypothesis condition which allows for more generous cut-off
2
values .
DISCUSSION
In this study, an attempt was made to develop descriptive measures of
fit for the dichotomous Rasch model. Accordingly, a number of fit
statistics based on the property of parameter invariance of the Rasch
model were evaluated in a simulation study. Furthermore, the simulation
2
A table which depicts the mean of 2/df and RMSEA for the Andersen
LR test and the S statistic in the alternative hypothesis condition (with
eight DIF items) for different sample sizes and test lengths can be
obtained from the authors.
164 NORTH AMERICAN JOURNAL OF PSYCHOLOGY
studies were carried out under the specific conditions of test length and
sample size.
Most of the available global model fit measures are based on
statistical hypothesis testing. Such fit assessment procedures are sensitive
to large sample sizes since statistical power increases. Furthermore, such
methods evaluate perfect fit of the data to the Rasch model. In this study
a descriptive method, namely, Andersens 2/df, is suggested to evaluate
the overall fit of data to the Rasch model. The proposed method in this
study is not based on statistical null hypothesis testing and is independent
of sample size. Based on simulation studies, cut-off values for the
statistic for different test lengths are suggested. The statistic is a
complement to the available fit statistics based on null hypothesis testing
and not a replacement.
Results showed that while all the fit statistics are more or less
independent of test length in the null hypothesis condition, three of
themRMSD, SRMSD, and NRMSDare dependent on sample size.
The means of these statistics vary substantially across sample sizes, and
therefore do not meet the requirements we specified above for efficient
fit values. Meanwhile, the other four measuresAndersen 2/df, S
statistic 2/df, Andersen RMSEA, and S statistic RMSEAare
independent of sample size in the null hypothesis condition. In this
condition, the mean values for Andersen 2/df and S statistic 2/df are
near one, and for Andersen RMSEA and S statistic RMSEA, they are
near zero across all sample sizes. As a result, the S statistic 2/df seems to
be dependent on the test length to some degree, as the value for a test
length of 10 is around 1.10 but the value approaches one as test length
increases. However, the problem with Andersen RMSEA and S statistic
RMSEA values is that these measures, although being robust against
sample size and test length, are insensitive to model violations. In the H1
condition, where the Rasch model does not hold, these values are around
.10 (k=10) and .06 and .04 when k= 30 and k=40, respectively. This
indicates that there is not much difference in these values in the H0 and
H1 conditions, which limits their utility as indicators of model violation.
Hence, the practical measure seems to be Andersens 2/df as it is
near one in the H0 condition across all sample sizes and test lengths and
noticeably deviates from one in the H1 condition. The standard deviation
of this measure, however, varies across different test lengths, which
restricts building a single confidence interval for use in applied settings.
Therefore, we need to devise different cut-off values depending on the
test length. Using the mean standard errors across all sample sizes the
one-sided 68% confidence intervals in Table 1 can be built as cut-off
values for Andersen 2/df for different test lengths.
Baghaei, Yanagida, & Heene RASCH MODEL FIT 165
The reason for building 68% confidence intervals instead of 95% was
to lower the chances of false acceptance of the Rasch model based on the
suggested fit measure. If Andersen 2/df exceeds these values for
different test lengths, the Rasch model should be rejected, whereas if it
falls below these values, the Rasch model holds. For using this fit
statistic no especial software is needed. If the Rasch model package
computes Andersens LR test then the statistic can easily be computed by
dividing the chi square value by its associated degrees of freedom.
10 1.45
20 1.32
30 1.26
40 1.23
50 1.20
k = number of items
REFERENCES
Andersen, E. B. (1973). A goodness of fit test for the Rasch model.
Psychometrika, 38, 123-140.
Baghaei, P. (2009). Understanding the Rasch model. Mashhad: Mashhad Islamic
Azad University Press.
Draxler, C. (2010). Sample size determination for Rasch model Tests.
Psychometrika, 75, 708-724.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists.
Mahwah, NJ: Lawrence Erlbaum.
Fischer, G. H., & Scheiblechner, H. (1970). Algorithmen und Programme fuer
das probabilistische Testmodell von Rasch [Algorithms and programs for
Rasch's probabilistic test model]. Psychologische Beitraege, 12, 23-51.
Fisher, G. H. (2006). Rasch models. In C. Rao & S. Sinharay (Eds.). Handbook of
statistics, Volume 26: Psychometrics (pp. 979-1027). Amsterdam, The
Netherlands: Elsevier.
Gustafsson, J. E. (1980). Testing and obtaining fit of data to the Rasch model.
British Journal of Mathematical and Statistical Psychology, 33, 205-233.
Glas, C.A.W. (1988). The derivation of some tests for the Rasch model from the
multinomial distribution. Psychometrika, 53, 525546.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of
item response theory. Newbury Park, CA: Sage.
Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of
Applied Measurement, 1, 152-176.
Kubinger, K. D. (2005). Psychological test calibration using the Rasch model:
some critical suggestions on traditional approaches. International Journal of
Testing, 5, 377-394.
Kubinger, K. D., Rasch, D., & Yanagida, T. (2009). On designing data-sampling
for Rasch model calibrating an achievement test. Psychology Science
Quarterly, 51, 370-384.
Kubinger, K. D., Rasch, D., & Yanagida, T. (2011). A new approach for testing
the Rasch model. Educational Research and Evaluation, 17, 321-333.
Linacre J.M. (1998). Detecting multidimensionality: which residual data-type
works best? Journal of Outcome Measurement, 2, 266-283.
Linacre, J. M. (2009). A users guide to WINSTEPS. Chicago, IL: Winsteps.
Mair, P., & Hatzinger, R. (2015). eRm: Extended Rasch modeling. R package
version 0.15-5. http://erm.r-forge.r-project.org/
Martin-Lf, P. (1973). Statistiska modeller [Statistical models.] Anteckningar
frn seminarier lasret 1969-1970, utarbetade av Rolf Sundberg. Obetydligt
ndrat nytryck, Oktober 1973. Stockholm: Instittet fr
Frskringsmatemetik och Matematisk Statistisk vid Stockholms Universitet.
Maydeu-Olivares, A. (2013). Goodness-of-fit assessment of item response theory
models. Measurement, 11, 71101.
Molenaar, I. W., and Hoijtink, H. (1990). The many null distributions of person
fit indices. Psychometrika, 55, 75-106.
Baghaei, Yanagida, & Heene RASCH MODEL FIT 167