Test Bias
Test Bias
Test Bias
Myeongsun Yoon
Chris J. Price
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
In the English language there are many words and phrases that
receive extra pay from those wholike Humpty Dumpty give
them many different meanings. One phrase in particular that has
received a great deal of extra pay is test bias, which we found to
have a multitude of meanings (Reynolds & Lowe, 2009). The
purpose of this article is to explore the nature of test bias from the
perspective of psychometrics (i.e., the science of mental testing).
We intend to technically define the phrase and expound upon five
different ways that test bias is often used, discuss the nature of
item content as it relates to bias, and the benefits of standardized
testing for diverse examinees especially in the realm of education. We believe that the exploration of the phrase is important
because test bias is a hotly debated topic in education and psychology, but some of these debates have not been productive
because of those expressing opposing sides are often using the
phrase differently (e.g., the exchange between Mercer, 1979, and
Clarizio, 1979).
Although an article about semantics and terminology would
itself be useful, it would probably be of limited interest to the
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
571
According to Richert, not only are tests biased, but the evidence
is overwhelming that bias is an inherent characteristic of standardized tests. Similarly, Salend and his colleagues stated, Research
indicates that norm-referenced standardized tests are culturally
and socially biased . . . (Salend, Garrick Duhaney, & Montgomery, 2002, p. 290, emphasis added). Beliefs about the inherently
biased nature of tests are also found among other authors (e.g.,
Mensch & Mensch, 1991), including some that are highly respected in their fields (e.g., Ford, 2003; Gould, 1981). Such claims
are common in the journalistic media, too (Cronbach, 1975; Gottfredson, 1994; Phelps, 2003; Reynolds, 2000).
Others have a much more sinister view of standardized testing.
Moss described her experience teaching at a high school where,
Most of my students were poor and African American . . . (p.
217). She stated,
By the end of 13 years of experience, I became convinced that it did
not matter how successful students of color became, the test would be
revised to insure we start over in the cyclical process of teaching
students how to demonstrate their ability to take culturally biased
standardized tests. (Moss, 2008, p. 217)
For Moss, standardized tests are not just biased as some accident
of their creation. Rather, the writers of the tests her students took
were nefarious in their work, and the test creators intended to use
the tests to discriminate against her students (see Carter & Goodwin, 1994; Mercer, 1979; and Smith, 2003; for a similar viewpoint
of standardized tests).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
572
Figure 1. Score distributions for two groups. The mean difference between the two groups scores is .5 SD.
of the test results. The lesson from this thought experiment is that
mean score gaps are not evidence of test bias because there may be
other explanations of score gaps. In fact, score gaps may indicate
that the test is operating exactly as it should and measures real
differences among groupsas is the case with this hypothetical
test of job satisfaction (Clarizio, 1979; Linn & Drasgow, 1987).
573
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Figure 2. (a) Scatterplots for the scores of two groups that share the same regression line. Notice that Group A on
average obtains lower scores than members of Group B. (b) Scatterplots for the scores of two groups with different,
but parallel, regression lines. The middle dashed line represents the regression line for both groups combined. Notice
that Group A on average obtains lower scores, but the combined regression line predicts more favorable outcomes than
would be expected from a regression line solely based on Group As data. (c) Scatterplots for the scores of two groups
with different, nonparallel regression lines. The middle dashed line represents the regression line for both groups combined.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
574
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
575
screened for DIF. Therefore, one must accept that the total test
score is unbiased as a prerequisite to test individual items for bias;
then, once DIF is not found in any of the items, a researcher can
assert that the total test score is not biased. This circular reasoning
(i.e., where the test score must be assumed to be unbiased in order
to determine that items are unbiased) has been correctly criticized
by researchers (e.g., Navas-Ara & Gmez-Benito, 2002).
Many DIF procedures also create problems statistically; if one
item with DIF is found but the total test score used to match
examinees across groups, then the advantage that a group receives
from the DIF item must be balanced out by other items that favor
the other group(s). This creates statistical artifacts in which items
are labeled as incorrectly having DIF (Andrich & Hagquist, 2012).
These are issues that the testing field is still grappling with, and
fully satisfactory solutions have not yet been found. DIF procedures based on latent variable methods such as confirmatory factor
analysis and item response theory are promising and have reduced
the severity of these statistical and logical problems. Nevertheless,
DIF has shed light on important questions about test construction
and has undoubtedly made tests fairer today than they were in
previous generations.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
576
Figure 3. (a) A 7-item test with two factors: four items load onto the first factor and three items load onto the
second factor. (b) A 7-item test with all items loading on a single factor.
necessarily behave the same way for both groups of subjects and
test biasas defined by AERA et al. (1999)may be present
because the scores from the two groups would have different
meanings. Differing factor structures may indicate a number of
possibilities, including the following:
The test items may be interpreted differently by the two
different groups.
The psychological construct (e.g., depression, personality,
intelligence, language arts achievement) may have different structures for the two groups. The nature of the construct may vary
across groups because of cultural, developmental, or other differences.
The test may measure completely different constructs for the
two groups.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
the same across groups (e.g., Beaujean, McGlaughlin, & Marguiles, 2009; Benson, Hulac, & Kranzler, 2010; Dolan, 2000).
Evaluations of factor structure are especially complex and require large datasets (Meredith, 1993). However, they are generally
agreed to be the best methods of evaluating test bias (Borsboom,
2006). Because tests of invariance are somewhat new, there are
still several aspects of them that remain unresolved. First, the
results of a test of invariance are frequently not as clear and
unambiguous as the hypothetical example shown in Figure 3, a and
b. Rather, differences among factor structure are often a matter of
degree, not of kind. It is often difficult to know what to do with a
test when some parts of it operate the same across demographic
groups and other parts do not (Millsap & Kwok, 2004). Alsojust
like DIFit is not always clear why factor structures vary across
tests or why parts of a test function differently for different groups
(Schmitt & Kuljanin, 2008).
577
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
578
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
579
Figure 4. Trend in WhiteBlack National Assessment of Educational Progress (NAEP) reading average scores
and score gaps for 9-year-old students. Source: National Center for Education Statistics (2009, p. 14).
Conclusion
We now end this article the same way we began itwith an
excerpt from Through the Looking-Glass:
When I use a word, Humpty Dumpty said, in rather scornful tone,
it means just what I choose it to meanneither more nor less.
The question is, said Alice, whether you can make words mean so
many different things. (Carroll, 1871/1917, p. 99)
580
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
References
ACT. (2007). The ACT technical manual. Retrieved from http://www.act
.org/aap/pdf/ACT_Technical_Manual.pdf
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory.
Long Grove, IL: Waveland Press.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999).
Standards for educational and psychological testing. Washington, DC:
American Educational Research Association.
Andrich, D., & Hagquist, C. (2012). Real and artificial differential item
functioning. Journal of Educational and Behavioral Statistics, 37, 387
416. doi:10.3102/1076998611411913
Beaujean, A. A., McGlaughlin, S. M., & Margulies, A. S. (2009). Factorial
validity of the Reynolds Intellectual Assessment Scales for referred students. Psychology in the Schools, 46, 932950. doi:10.1002/pits.20435
Benson, N., Hulac, D. M., & Kranzler, J. H. (2010). Independent examination of the Wechsler Adult Intelligence ScaleFourth Edition
(WAIS-IV): What does the WAIS-IV measure? Psychological Assessment, 22, 121130. doi:10.1037/a0017767
Bersoff, D. N. (1981). Testing and the law. American Psychologist, 36,
10471056. doi:10.1037/0003-066X.36.10.1047
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika,
71, 425 440. doi:10.1007/s11336-006-1447-6
Buckendahl, C. W., & Hunt, R. (2005). Whose rules? The relation between
the rules and law of testing. In R. P. Phelps (Ed.), Defending
standardized testing (pp. 147158). Mahwah, NJ: Erlbaum.
Camara, W. J. (2009). College admission testing: Myths and realities in an
age of admissions hype. In R. P. Phelps (Ed.), Correcting fallacies about
educational and psychological testing (pp. 147180). Washington, DC:
American Psychological Association. doi:10.1037/11861-004
Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 221256). Westport, CT: Praeger.
Carlson, J. F., & Geisinger, K. F. (2009). Psychological diagnostic testing:
Addressing challenges in clinical applications of testing. In R. P. Phelps
(Ed.), Correcting fallacies about educational and psychological testing
(pp. 67 88). Washington, DC: American Psychological Association.
doi:10.1037/11861-002
Carroll, L. (1871/1917). Through the looking-glass and what Alice found
there. New York, NY: Rand McNally.
Carter, R. T., & Goodwin, A. L. (1994). Racial identity and education.
Review of Research in Education, 20, 291336. doi:10.2307/1167387
Cizek, G. J., Fitzgerald, S. M., & Rachor, R. A. (1995). Teachers assessment practices: Preparation, isolation, and the kitchen sink. Educational
Assessment, 3, 159 179. doi:10.1207/s15326977ea0302_3
Clarizio, H. F. (1979). In defense of the IQ test. School Psychology Review,
8, 79 88.
Cleary, T. A. (1968). Test bias: Prediction of Negro and White students in
integrated colleges. Journal of Educational Measurement, 5, 115124.
doi:10.1111/j.1745-3984.1968.tb00613.x
Cleary, T. A., & Hilton, T. L. (1968). An investigation of item bias.
Educational and Psychological Measurement, 28, 6175. doi:10.1177/
001316446802800106
Cleary, T. A., Humphreys, L. G., Kendrick, S. A., & Wesman, A. (1975).
Educational uses of tests with disadvantaged students. American Psychologist, 30, 15 41. doi:10.1037/0003-066X.30.1.15
Crocker, L., & Algina, J. (2008). Introduction to classical and modern test
theory. Mason, OH: Cengage Learning.
Cronbach, L. J. (1975). Five decades of public controversy over mental testing.
American Psychologist, 30, 114. doi:10.1037/0003-066X.30.1.1
Cronbach, L. J. (1980). Selection theory for a political world. Public
Personnel Management, 9, 3750.
Dolan, C. V. (2000). Investigating Spearmans hypothesis by means of
multi-group confirmatory factor analysis. Multivariate Behavioral Research, 35, 2150. doi:10.1207/S15327906MBR3501_2
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
581
Popham, W. J. (1997). Consequential validity: Right concernwrong concept. Educational Measurement: Issues and Practice, 16, 9 13. doi:
10.1111/j.1745-3992.1997.tb00586.x
Posselt, J. R., Jaquette, O., Bielby, R., & Bastedo, M. N. (2012). Access
without equity: Longitudinal analyses of institutional stratification by
race and ethnicity, 19722004. American Educational Research Journal, 49, 1074 1111. doi:10.3102/0002831212439456
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor
analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552566. doi:
10.1037/0033-2909.114.3.552
Reschly, D. J. (1980). Psychological evidence in the Larry P. opinion: A
case of right problemwrong solution? School Psychology Review, 9,
123135.
Reschly, D. J., & Sabers, D. L. (1979). Analysis of test bias in four groups
with the regression definition. Journal of Educational Measurement, 16,
19. doi:10.1111/j.1745-3984.1979.tb00080.x
Reynolds, C. R. (1980). An examination of bias in a preschool battery
across race and sex. Journal of Educational Measurement, 17, 137146.
doi:10.1111/j.1745-3984.1980.tb00822.x
Reynolds, C. R. (2000). Why is psychometric research on bias in mental
testing so often ignored? Psychology, Public Policy, and Law, 6, 144
150. doi:10.1037/1076-8971.6.1.144
Reynolds, C. R., & Lowe, P. A. (2009). The problem of bias in psychological assessment. In C. R. Reynolds & T. B. Gutkin (Eds.), The
handbook of school psychology (pp. 332374). New York, NY: Wiley.
Richert, E. S. (2003). Excellence with justice in identification and programming. In N. Colangelo & G. A. Davis (Eds.), Handbook of gifted
education (3rd ed., pp. 146 158). Boston, MA: Allyn & Bacon.
Roid, G. H. (2003). Stanford-Binet Intelligence Scales, fifth edition, technical manual. Itasca, IL: Riverside Publishing.
Rushton, J. P., & Jensen, A. R. (2005). Thirty years of research on race
differences in cognitive ability. Psychology, Public Policy, and Law, 11,
235294. doi:10.1037/1076-8971.11.2.235
Salend, S. J., Garrick Duhaney, L. M., & Montgomery, W. (2002). A
comprehensive approach to identifying and addressing issues of disproportionate representation. Remedial and Special Education, 23, 289
299. doi:10.1177/07419325020230050401
Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Nonparametric item
analyses of the Beck Depression Inventory: Evaluating gender item bias
and response option weights. Psychological Assessment, 6, 255270.
doi:10.1037/1040-3590.6.3.255
Schafer, W. D. (2000). GI Forum v. Texas Education Agency: Observations
for states. Applied Measurement in Education, 13, 411 418. doi:
10.1207/S15324818AME1304_07
Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning for
minority examinees on the SAT. Journal of Educational Measurement,
27, 67 81. doi:10.1111/j.1745-3984.1990.tb00735.x
Schmitt, N., & Kuljanin, G. (2008). Measurement invariance: Review of
practice and implications. Human Resource Management Review, 18,
210 222. doi:10.1016/j.hrmr.2008.03.003
Sireci, S. G. (2005). The most frequently unasked questions about testing.
In R. P. Phelps (Ed.), Defending standardized testing (pp. 111121).
Mahwah, NJ: Erlbaum.
Smith, R. A. (2003). Race, poverty, & special education: Apprenticeships
for prison work. Poverty & Race, 12, 1 4.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item
functioning using logistic regression procedures. Journal of Educational
Measurement, 27, 361370. doi:10.1111/j.1745-3984.1990.tb00754.x
Terman, L. M. (1928). The influence of nature and nurture upon intelligence scores: An evaluation of the evidence in Part I of the 1928
Yearbook of the National Society for the Study of Education. Journal of
Educational Psychology, 19, 362373. doi:10.1037/h0071466
582
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.