Replacing Paper-Based Testing With Computer-Based Testing in Assessment: Are We Doing Wrong?
Replacing Paper-Based Testing With Computer-Based Testing in Assessment: Are We Doing Wrong?
Replacing Paper-Based Testing With Computer-Based Testing in Assessment: Are We Doing Wrong?
com
Abstract
The standards for developing computerized-assessment required equivalent test scores to be established for the paper-
based test (PBT) and computer-based test (CBT) modes. However, in most studies, the two modes were nearly
identical, yet significant differences of test scores were observed. Therefore the validity of replacing PBT with CBT
in educational assessment was questioned. This study employed an achievement test, a psychological test and a
motivation questionnaire in a Solomon four-group design to examine validity of the CBT and its effects on test
performance and motivation. The findings of this study provide evidences for the issue of CBT’s validity in
educational and psychological assessment.
© 2012
© 2012Published
Published
by by Elsevier
Elsevier Ltd. Selection
Ltd. Selection and/or and/or peer-review
peer-review under responsibility
under responsibility of TheScience
of The Association Association
Science Education
Education and Technology
and Technology Open access under CC BY-NC-ND license.
Keywords: Assessment; computer-based testing; testing effect; performance; motivation
1. Introduction
The interest in developing and using computer-based test (CBT) in educational assessment in schools
and educational institutions has heightened in recent years. Delivering assessments via computers is
becoming more and more prevalent in educational assessment domain as changes are made in assessment
methodologies that reflect practical changes in pedagogical methods (Kate Tzu, 2012; Genc, 2012; Hsiao,
Tu & Chung, 2012; OECD, 2010). CBT is seen as a catalyst for change, bringing transformation of
learning, pedagogy and curricula in educational institutions (Scheuermann & Pereira, 2008).
To establish a valid and reliable CBT, the International Guidelines on Computer-Based Testing
(International Test Commission 2004) stated that equivalent test scores should be established for the
conventional paper-based testing (PBT) and its computer-based mode. This set of testing standards is
supported by the classical true-score test theory – the basis of computer-based and paper-based testing
(Allen & Yen 1979). Under this theory, a test taker who takes the same test in the two modes is expected
to obtain nearly identical test scores. The standards are also supported by empirical studies (OECD, 2010;
1877-0428 © 2012 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of The Association Science Education and Technology
Open access under CC BY-NC-ND license. doi:10.1016/j.sbspro.2012.11.077
656 Chua Yan Piaw / Procedia - Social and Behavioral Sciences 64 (2012) 655 – 664
Wilson, Genco, & Yager, 1985). For example, OECD (2010) reported that there were no difference in test
performance between CBT and PBT among student participants (n = 5,878) from Denmark, Iceland and
Korea.
Interestingly, however, in a review of educational and psychological measurement approaches,
Bunderson, Inouye & Olsen (1989) reported that 48% of previous studies showed no difference between
the two testing modes in test performance, 13% of studies showed the superiority of CBT and 39% of
studies showed that PBT was superior. The concept of equivalence was supported by only nearly half of
the studies, and the differences were ascertained in achievement tests such as science, language and
mathematics tests, and also obviously in psychological tests such as personality and neuropsychological
assessment (e.g. Friedrich & Bjornsson, 2008; Choi, Kim & Boo, 2003; DeAngelis, 2000).
A possible explanation for this phenomenon is either CBT has a low validity as an assessment tool for
educational and psychological measurements, or there might have been other effect that confounded the
effect of testing mode on test performance in these repeated-measures studies. As observed by Yu &
Ohlund (2010), a possible confounding variable is testing effect; the effect of taking a pretest on taking a
posttest that systematically confounds the treatment effect of CBT on test performance.
3. Effects of testing motivation on the relationship between testing modes and test
performance
Another issue that needs to be clarified in a PBT and CBT comparability study, as raised by Wise and
DeMars (2003) is motivational factors which might also have an impact on test performance. Wise and
DeMars pointed out that regardless of how much psychometric care is applied to test development, or
how equal the testing modes are, to the extent that test takers are not motivated to respond to the test (e.g.
due to low efficacy or boredom), test score validity will be compromised. The test taker motivation model
(Pintrich, 1989) specifies that the effort test takers will direct towards a test is a function of how well they
feel they will do on the test, how they perceive the test to be, and it related to their affective reactions
regarding the test. This is the theoretical model that underlies the relationship among motivation, testing
mode and test performance. Besides that, the self-determination theory (Wenemark, Persson, Brage,
Svensson & Kristenson, 2011) states that increases test-takers’ motivation will increase the willingness to
take the test or response rates, and thus it will enhance learning. Therefore, testing motivation is an aspect
worth investigating in testing mode comparability studies because it can pose a threat to the validity of
inferences made regarding assessment test results (Shuttleworth, 2009).
Chua Yan Piaw / Procedia - Social and Behavioral Sciences 64 (2012) 655 – 664 657
One of the barriers to the implementation of CBT in educational and psychological measurements in
education is insufficient study of the equivalence of CBT and PBT (Bugbee, 1996). To overcome the
potential for misinterpreting experimental results caused by testing effects, Yu & Ohlund (2010) strongly
recommended the use of the Solomon four-group design. This design helps researchers to detect the
occurrence of testing effects in an experimental study. Therefore, this study employed a Solomon four-
group experimental design to examine the validity and effectiveness of CBT by comparing it with the
PBT. It examined whether testing effects occur in CBT and PBT, and investigated the effects of testing
motivation on the relationship between testing modes and test performance.
4. Method
Note: M = Measurement
Fig. 1. Design of the Study
658 Chua Yan Piaw / Procedia - Social and Behavioral Sciences 64 (2012) 655 – 664
To analyse the data for the design, two steps are needed: (1) A two independent samples t-test is
performed to identify the testing effects (M4–M3) or (M6–M5) and (2) A Split-Plot ANOVA analysis is
carried out to identify the treatment effects. A CBT treatment effect is detected if a significant interaction
effect occurs. Split-Plot ANOVA is one of the most powerful quantitative research methods for testing
causal hypotheses (Yu & Ohlund, 2010; Chua, 2009a).
Fig. 2. An Example Test Item and the Results of the Test in Graphical Form
Chua Yan Piaw / Procedia - Social and Behavioral Sciences 64 (2012) 655 – 664 659
(c) The Testing Motivation Questionnaire - The third instrument is the adapted version of the
Testing Motivation Questionnaire or TMQ (Wigfield, Guthrie & McGough, 1996) (see Appendix A). It
measured overall testing motivation and four motivation components (self-efficacy, extrinsic, intrinsic
and social motivations) of the participants towards the two testing modes for comparison. The
components consist of eleven dimensions of motivation. Challenge and efficacy are categorised under
self-efficacy motivation. Curiosity, involvement, importance and work avoidance are categorised under
intrinsic motivation. Competition, recognition and grades are listed under extrinsic motivation, and finally
social and compliance are the dimensions of social motivation. Although questions have been raised
about the factor structure of the motivation dimensions (Watkins & Coffey, 2004), several studies
examining its validity and reliability have supported these eleven dimensions (Parault & Williams, 2009;
Unrau & Schlackman, 2006; Wigfield & Guthrie, 1997). Based on the motivation dimensions, Wigfield,
Guthrie & McGough (1996) developed a 54-item motivation questionnaire to examine a group of
students’ reading motivation. Since motivation is a universal human behaviour and is identical across
disciplines (Guthrie & Wigfield, 1999, p. 199), the eleven dimensions were adapted into this study as the
dimensions of testing motivation. The TMQ was developed based on a five-point Likert scale to assess
participants’ motivation towards the two testing modes. The scores ranged from 1 (very different from
me) to 5 (a lot like me). The internal consistency reliabilities (Cronbach’s alpha) for the eleven motivation
dimensions in the PBT and CBT versions were ranged between .72 and 83.
4.3. Participants
The participants in this study were 140 Malaysian undergraduate student teachers from a teacher
training institute located in Peninsular Malaysia. Among the participants, there were 61 males (43.57%)
and 79 females (56.43%) with an average age of 21 years. The participants were randomly selected from
a student teacher population (N = 219) based on the sample size determination table of Krejcie and
Morgan (Chua, 2011b, p.211) at a 95% (p < .05) confidence level. They were enrolled in a teacher
education programme (mathematics and science), and have the same educational history and background.
They have the same level of computer applications skill and received formal computer instruction in their
academic curriculum. Based on their performances in a biology monthly test and the recommendation of
their lecturers, the student teachers with similar abilities were arranged into 35 equivalent groups (each
with four equivalent participants). The four participants in each group were then assigned into four groups
through a simple random sampling procedure, each with a sample size of 35. The four groups were then
randomly assigned to two control and two treatment groups for the experimental study.
4.4. Procedures
At the first phase, control group 2 answered PBT mode of the Biology Test and YBRAINS test and
treatment group 2 answered their CBT modes (pretests for test performance). Immediately after the tests,
the two groups answered the TMQ questionnaire to identify their motivation towards the two testing
modes (pretests for testing motivation). Two week later, at the second phase, the four groups answered
the Biology Test and the YBRAINS test. The two control groups answered the PBT modes and the two
treatment groups answered the CBT modes (posttests for test performance). Immediately after the tests,
the four groups answered the same TMQ questionnaire to identify their motivation towards the two
testing modes (posttests for testing motivation).
A key advantage of the control-treatment repeated-measures experimental design is that individual
differences between participants are removed as a potential confounding variable during the course of the
experiment (PsychoMetrics, 2010). These individual differences include history and maturity effects.
History effects refer to external events (e.g. reading books, watching TV programme or exposure to other
sources) that can affect the responses of the research participants, while maturity effects refer to changes
in a participant’s behaviour during the course of the experiment (Chua, 2009b; Dane, 1990).
660 Chua Yan Piaw / Procedia - Social and Behavioral Sciences 64 (2012) 655 – 664
5. Results
Table 1. Testing effects for PBT and CBT modes on test performance and testing motivation
participants. The data also indicates that treatment effects significantly occurred in five of the eleven test
motivation dimensions, they are challenge, efficacy, curiosity, involvement and social, and the treatment
effect sizes were medium to large (d values were between .57 to 1.37). It indicates that the CBT mode has
significantly increased the motivation level of the participants.
Table 2. Split-Plot ANOVA analysis results for the effect of CBT on test performance and testing motivation
To further understand the association among test performance and testing motivation, a Pearson
Product-moment inter-correlation test was conducted (see Table 3). Besides that, since there was a
treatment effect of CBT on testing motivation, an Analysis of Covariance (see Table 4) was performed to
identify whether testing motivation is a moderator variable for the association between testing mode and
test performance.
Table 3. Pearson Product-moment Inter-correlation between test performance and testing motivation
Test Performance
Correlation
Biology Score Critical Style Creative Style
Testing Motivation -.20 -.17 .13
Table 4. Analysis of Covariance for testing motivation towards the effect of CBT on test performance
Table 3 indicates that there were no significant correlation between the three test performance scores
with and testing motivation. It means answering the test with greater testing motivation would not
662 Chua Yan Piaw / Procedia - Social and Behavioral Sciences 64 (2012) 655 – 664
necessary help a test taker to achieve a higher test performance score. Furthermore, the data in Table 4
shows that there were no significant main effects of CBT on the three test performance scores and testing
motivation was not a significant moderator for the effect of CBT on test performances of the achievement
test and psychological test.
6. Discussion
Results of the analyses indicate that no significant testing and treatment effects were found for test
performance in the two testing modes. In other words, the test scores were consistent over time and across
the two testing modes. It shows that a participant who sits for both the CBT and PBT would most
probably yield similar pretest and posttest scores. The two CBT tests are valid in terms of test
performance and can be used as a replacement for their PBT.
The results also indicate that the achievement test and psychological test have fulfilled the
requirements of the international guidelines on computer-based testing (International Test Commission
2004) and consistent with true-score test theory (Allen & Yen, 1979) that parallel tests are required to
show nearly equal mean scores. However, it does not support the suggestion of some researchers (e.g.
Clariana & Wallace, 2002) that it is not necessary that equivalent measures be produced from CBT and
PBT versions; at the same time it suggests that it is the responsibility of instructional designers to craft
and design high-quality CBTs that parallel the conventional PBTs, and extensively pilot test them to
ensure equality before implementing computer-based testing.
The results of this study also provide an explanation for why some previous studies have revealed a
significant difference between the two testing modes in test performance although theoretically no
difference should be observed. Testing effects did occur in this testing mode comparability study although
none was identified and reported by the researchers of past studies; instead they found significant
treatment effects. However, for the researchers to conclude that CBT has an effect on the experimental
variables (test performance) is misleading because there is a possibility that the changes in the experiment
variables are caused by testing effects, and not by the treatment effects. Thus, the findings of these studies
might have been jeopardised by testing effects and misinterpreted.
The findings also show that the CBT mode is more stable and consistent in terms of internal and
external validity because no testing effects were found in all of the four testing motivation components.
For treatment effect, the results indicate that there was a significant treatment effect on testing motivation.
The CBT had increased the participants’ self-efficacy, intrinsic and social motivation. It reflects the
ability of the CBT to stimulate the participants to answer the CBT posttest with higher concentration.
However, answering the tests with greater testing motivation did not help a test taker to achieve higher
scores; no significant treatment effects were found in the two tests. This is another interesting finding,
that testing motivation is not a catalyst for the effect of testing mode on test performance. The study
rejects the prediction of some previous studies that motivation level of test takers to answer the CBT and
PBT might have an impact on test performances (e.g. Wise & DeMars, 2003). It provides evidence that
testing motivation is not a moderator of the relationship between testing mode and test performance. It is
consistent with the finding of OECD (2010), that the effects of motivational factors on the relationship
between testing mode and test performance are insignificant and either very weak or non-existent.
Since testing is an aid to learning and it is a practice that is part and parcel of a good educational
system, an advantage of using CBT, as generated from this study is that it produces more valid test results
for repeated measures and increases test-takers’ motivation which will, in turn, heighten their willingness
to be tested and increases testing participation rate. Based on the results of this study, computer-based
testing can be used as a valid replacement for the conventional paper-based testing in educational
institutions.
Chua Yan Piaw / Procedia - Social and Behavioral Sciences 64 (2012) 655 – 664 663
References
Al-Amri, S. (2008). Computer-based testing vs. paper-based testing: A comprehensive approach to
examining the comparability of testing modes. Essex Graduate Student Papers in Language and
Linguistics, 10, 22–44. Retrieved January 28, 2012 from
http://www.essex.ac.uk/linguistics/publications/egspll/volume_10/pdf/EGSPLL10_22-
44SAA_web.pdf .
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Monterey, CA: Brooks/Cole.
Bugbee, A. C. (1996). The equivalent of paper-and-pencil and computer-based testings. Journal of
Research on Computing in Education, 28(3), 282–299.
Bunderson, C. V., Inouye, D. K., & Olsen, J. B. (1989). The four generations of computerized educational
measurement. In R. L. Linn (Ed.), Educational Measurement (pp. 367–407). Washington, DC:
American Council on Education.
Choi, I. C., K. S. Kim, and J. Boo. (2003). Comparability of a paper-based language test and a computer-
based language test. Language Testing, 20(3), 295–320.
Chua, Y. P. (2004). Creative and critical thinking styles. Serdang, Malaysia: Universiti Putra Malaysia
Press.
Chua, Y. P. (2008). Research methods and statistics book 3: Data analysis for nominal and ordinal
scales. Shah Alam, Malaysia: McGraw-Hill Education.
Chua, Y. P. (2009a). Writing a series of best-selling research reference books. Journal of Scholarly
Publishing, 40(4), 408–419. doi: 10.3138/jsp.40.4.408.
Chua, Y. P. (2009b). Research methods and statistics book 4: Univariate and multivariate tests. Shah
Alam, Malaysia: McGraw-Hill Education.
Chua, Y. P. (2011a). Establishing a brain styles test: The YBRAINS test. Procedia Social and Behavioral
Sciences, 15, 4019–4027. Retrieved August 16, 2011 from
http://www.sciencedirect.com/science/article/pii/S1877042811009530.
Chua, Y. P. (2011b). Research methods and statistics book 2: Statistics basic (2nd ed.). Shah Alam,
Malaysia: McGraw-Hill Education.
Clariana, R., & Wallace, P. (2002). Paper-based versus computer-based assessment: Key factors
associated with the test mode effect. British Journal of Educational Technology, 33(5), 593–602.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd edition). Hillsdale, NJ:
Lawrence Earlbaum Associates.
Dane, F. C. (1990). Research methods. California: Brooks/Cole Publishing Company.
DeAngelis, S. (2000). Equivalency of computer-based and paper-and-pencil testing. Journal of Allied
Health, 29(3), 161–164.
Friedrich, S., & Bjornsson, J. (2008). The transition to computer-based testing – New approaches to skills
assessment and implications for large-scale testing. http://crell.jrc.it/RP/reporttransition.pdf
(accessed May 23, 2011).
Genc, H. (2012). An evaluation study of a call application: with belt or without belt. TOJET: The Turkish
Online Journal of Educational Technology, 11(2). Retrieved July 2, 2011 from
http://www.tojet.net/articles/v11i2/1125.pdf
Guthrie, J. T., & Wigfield, A. (1999). How motivation fits into a science of testing. Scientific Studies of
Testing, 3, 199–205.
Hsiao, H. C., Tu, Y. L., Chung, H. N. (2012). Perceived social supports, computer self-efficacy, and
computer use among high school students. TOJET: The Turkish Online Journal of Educational Technology,
11(2).
International Test Commission. (2004). International Guidelines on Computer-Based and Internet-
Delivered Testing. Retrieved January 21, 2011 from http://www.intestcom.org/itc_projects.htm.
664 Chua Yan Piaw / Procedia - Social and Behavioral Sciences 64 (2012) 655 – 664
ITEX ’10. (2010). Results of the International Invention, Innovation and Technology Exhibition 2010,
May 14–16, 2010. Retrieved January 2, 2012 from
http://www.ippp.um.edu.my/images/ippp/doc/itex%202010.pdf.
Kate Tzu, C. C. (2012). Elementary EFL teachers’ computer phobia and computer self-efficacy in
Taiwan. TOJET: The Turkish Online Journal of Educational Technology, 11(2). Retrieved June 18, 2012
from http://www.tojet.net/articles/v11i2/11210.pdf
Morgan, C., and O’Reilly, M. (2001). Innovations in online assessment. In F. Lockwood and A. Gooley
(Eds.), Innovation in Open and Distance Learning: Successful Development of Online and Web-
based Learning (pp. 179–188). London: Kogan Page.
OECD. (2010). PISA Computer-based assessment of student skills in science.
http://www.oecd.org/publishing/corrigenda (accessed December 21, 2011).
Parault, J. S., & Williams, H. M. (2009). Testing motivation, testing amount, and text comprehension in
deaf and hearing adults. Oxford, UK: Oxford University Press. doi:10.1093/deafed/enp031
Pintrich, P. R. (1989). The dynamic interplay of student motivation and cognition in the college
classroom. In C. Ames and M. Maehr (Eds.), Advances in Achievement and Motivation, 6, 117–160.
PsychoMetrics. (2010). Repeated measures designs. Retrieved January 12, 2012 from
http://www.psychmet.com/id16.html.
Shuttleworth, M. (2009). Repeated measures design. Experiment Resources. http://www.experiment-
resources.com/repeated-measures-design.html (accessed January 25, 2012).
Scheuermann, F., & Pereira, A. G. (2008). Towards a Research Agenda on Computer-based Assessment -
Challenges and Needs for European Educational Measurement. JRC Scientific and Technical Report,
23306 EN.
Unrau, N., & Schlackman, J. (2006). Motivation and its relationship with reading achievement in an
urban middle school. Journal of Educational Research, 100, 81–101.
Wang, H., & Shin, C. D. (2010). Comparability of computerized adaptive and paper-pencil tests. Test,
Measurement and Research Service Bulletin, 13, 1–7.
Watkins, M. W., & Coffey, D. Y. (2004). Testing motivation: Multidimensional and indeterminate.
Journal of Educational Psychology, 96, 110–118.
Wenemark, M., Persson, A., Brage, H. N., Svensson, T., & Kristenson, M. (2011). Applying motivation
theory to achieve increased response rates, respondent satisfaction and data quality. Journal of
Official Statistics, 27(2), 393–414.
Wigfield, A., & Eccles, J. S. (2000). Expectancy-value theory of achievement motivation. Contemporary
Educational Psychology, 25, 68-81.
Wigfield, A., & Guthrie. J. T. (1997). Relations of children’s motivation for testing to the amount and
breath of their testing. Journal of Educational Psychology, 89, 420–432.
Wilson, F. R., Genco, K. T., & Yager, G. G. (1985). Assessing the equivalence of paper-and-pencil vs.
computerized tests: Demonstration of a promising methodology. Computers in Human Behavior, 1,
265–275.
Wise, S. L., & DeMars, C. E. (2003), June 12. Examinee motivation in low-stakes assessment: Problems
and potential solutions. Paper presented at the annual meeting of the American Association of
Education Assessment Conference, Seattle.
Yu, C. H., & Ohlund, B. (2010). Threats to validity of research design. Retrieved January 12, 2012 from
http://www.creative-wisdom.com/teaching/WBI/threat.shtml.