Measuring Cognitive Performance On Programming Knowledge - Classical Test Theory Versus Item Response Theory

2017 7th World Engineering Education Forum (WEEF)
Measuring Cognitive Performance on programming

Knowledge: Classical Test Theory versus Item
Response Theory
Marlina Mohamad Abdul Jalil Omar

Faculty of Technical and Vocational Education Faculty of Technology Management and Business
Universiti Tun Hussein Onn Malaysia Universiti Tun Hussein Onn Malaysia
Malaysia Malaysia
[email protected] [email protected]
Abstract—This research aims to study the strengths and study the strengths and weaknesses of IRT and CTT and to
weaknesses of Classical Test Theory (CTT) and Item Response make suggestion of what situation should the theories be
Theory (IRT) and to make suggestion of what situation should adopted. The test-instruments are developed to measure the
the theories be adopted. The test-instruments were developed to cognitive performance of students when learning programming
measure the cognitive performance of students when learning language through web-mediated environment. The intention of
programming language through web-mediated environment. this research article is to provide the non-technical comparison
This quasi-experimental study at the same time purpose to between CTT and IRT.
measure the effects of web-mediated learning environment in
acquisition abstract knowledge. However, this paper only reports
part of the research findings which involves the comparison II. BACKGROUND
between CTT and IRT mainly on test-instruments. The test-items Investigation of the effects of independent variables on human
show different interval scale when locating at the same performance in an intervention must rely heavily on the
measurement scale. The location of test-items analyzing using
raw scores have different intervals between them. However, when
validity of the cognitive measuring instruments. However,
using logit (the measurement unit in IRT), the test-items locate at research in the social sciences tends to lack the rigour in their
the same distance apart. Therefore, by simply summing the raw measuring instruments compared to research in the physical
scores to measure students’ cognitive performance will leads to sciences. In physical science studies (for instance, that
inconsistency in constructing a defensible unit of measurement. measure weight, height or speed), the research participant
The finding suggests that the used of CTT and IRT should be being measured and the measuring instrument are assumed to
carefully examining depends on the context of study. fulfil the requirements of fundamental measurement, and the
instruments are made on a ratio scale. This practice allows the
Keywords—classical test theory; item response theory; rasch
researcher to employ further analysis of the data that involves
mode;, programming; cognitive performance
various parametric statistical techniques. Consider, for
example, a study that measures a person’s height. The measure
I. INTRODUCTION used (which in this case is a ruler) is independent of the person
There are two currently popular statistical frameworks for being measured. The state of the ruler remains constant while
addressing measurement problems such as test-instruments, measuring one person’s height against another person’s
equating test-scores and the identification of test- items biased, height. The ruler has equal interval measurement units that can
that is classical test theory (CTT) and item response theory be positioned on a single dimensional scale. Therefore, direct
(IRT). Although the CTT has served the measurement comparison can be made between persons of different heights.
community for most of this century, IRT has witnessed an As a result, further analysis using parametric statistical
exponential growth in recent decades. The major advantage of techniques (such as analysis of variance (ANOVA) and the t
CTT is its relatively weak theoretical assumptions, which make test), can be easily applied to the data [2].
CTT easy to apply in many testing situations [1]. IRT, on the
other hand, is more theory grounded and models the
However, in social science research (for instance, exploring
probabilistic distribution of examinees’ success at the item-
level. As its name indicates, IRT primarily focuses on the item- attitudes, behaviour or performance), invariant comparisons
level information in contrast to the CTT’s primarily focus on and linear scales are not as easily obtained as they are in the
test-level information. Despite theoretical differences between physical sciences. Observations are commonly made on
CTT and IRT, there is a lack of empirical knowledge about ordinal scales; thus, parametric statistical techniques cannot be
how and to what extent, the IRT- and CTT-based item and applied to the data [3][4]. The measurement values using an
person statistic behave differently. This research aims to ordinal scale are only interpretable according to their
analyse the validity and reliability of the test-instruments, to arrangement in a given order. For instance, partial credit
978-1-5386-1523-2/17/$31.00 ©2017 IEEE

361
values from a 0 to 3 scale, in which 0=incorrect, 1=partially enrolled in programming language course for the first time.
correct, 2= almost correct and 3=completely correct, can only After that, the validity and reliability of the test instruments
be conceptually interpreted. Clearly, with such scales the (pre-test and post-test items) were analyzed using Rasch
difference between two measures cannot be interpreted in a measurement model and raw scores (see Fig 2). The results of
quantitative sense. Regardless of the value labels chosen, the the experimentation then calculated using ANOVA and effect
difference between ‘partially correct’ and ‘almost correct’ is size to study the effects.
not necessarily equal to the difference between ‘almost
correct’ and ‘completely correct’. Despite these issues, IRT CTT
researchers in many disciplines including the social sciences
Testing validity
continue to employ parametric statistical methods to their Rasch Model Raw scores and reliability of
ordinal scale data. the instruments
To overcome these shortcomings, statisticians have developed Effect size Effect Size
alternative methods for handling ordinal scale data. In the so- Results
called non-parametric statistical tests, ‘medians’ are compared
instead of ‘means’. Yet this method has its limitations and is
less powerful than parametric tests [5]. Another approach to Fig 2: Data analysis stages
these problems is to transform ordinal observations into linear
measures [6]. This study transformed the data into ‘logit’ unit
IV. FINDINGS
[7].
This research employed a CTT measurement and Rasch IRT
model not only to test the validity and reliability of the test
III. METHODOLOGY instruments but also to transform the ordinal raw score data
into interval measures to calculate the effect sizes, that is the
The research methodology requires gathering relevant data ‘logit’. The following Fig 3 shows the location of the test-
from quasi experimentation. It was begun with the cognitive items using ordinal raw score and Fig 4 shows the location of
styles screening test, follows by pre-test. After that the the test-items using the logits units. The locations are
participants were required to learn programming knowledge measured using the scale ruler based on the item-difficulty
through the web-mediated instructional system. Immediately level.
after the intervention, the participants seat for the post-test (see
Fig 1). Even though the mentioned experimentation stages are
not compulsory to achieve the research objectives but it is
important to ensure that the instruments under testing are
following the standard procedure of experimentation like any
other research procedures. Fig 3: Item location based on raw scores
First Session Second Session
CSA Screening Test CSA Screening Test

Stage 1
Pre-test Pre-test Stage 2 Fig 4: Item location based on logit units
Web-mediated
Pr24 on Fig 3 and Fig 4 shows the most difficult item. While
Web-mediated
Instructional System Instructional System Pr06 is the easiest item when analyzing using both raw scores
(Text-plus-textual (Text-plus-textual Stage 3 and the logit. Because the logits were used to locate the test-
(T1)) (T2)) items along the scale, the test-items showed in Fig 4 were
positioned at the same distance apart. However, as pictured in
Fig 4, the test-items located according to the raw scores have
different intervals between them. Therefore, by simply
Post-test Post-test
Stage 4 summing the raw scores to measure students’ cognitive
performance leads to an inconsistency in constructing a
defensible unit of measurement [8]. As Bond and Fox [9] in
Fig 1: Quasi experimentation data collection procedure page ix state in their book that “....the essential rule in
The participants for this study were volunteered second successful measurement is used ubiquitously in money,
year undergraduate students from a Malaysian university which
362
length,..... That rule is one more unit means the same amount correct (test-item difficulty) are calculated from the number of
extra no matter how much there already is”. responses or proportion responses in the sample. This is
The Rasch measurement model is in a family of probabilistic indicated by p-value, where p reflects the ratio of responses in
models. So saying, the Rasch model is classified within item a sample that endorses a test-item [15]. As a result, the higher
response theory (IRT) or latent trait theory. As the names p-value denotes easier test-item, whereas a low p-value
suggest, the Rasch model focuses on characteristics of indicates more difficult test-item [16]. Therefore, the p-value
individual’s test-items (item-level information), rather than for any given test-item will be higher, if the value is drawn
concentrating on test-level information as it does in classical from a more able population, rather than if calculated from a
test theory (CTT) [10]. Instead, the Rasch measurement allows less able population. This is the major drawback of this
for a robust approach to measuring an underlying variable or approach which is sample dependent [1]. Since the random
construct such as attitudes, abilities and personality traits [11]. sampling was not possible in this study, biased parameters
For instance, this study estimates student’ cognitive were possible [17].
performance on their acquisition of programming language
concepts. The Rasch model therefore predicts the probability of Secondly, to make a fair score comparison within a testing-
a student getting a correct response in terms of two variables, sample, it is necessary, that the responses be complete for each
one relating to the ‘level of test-items difficulty’ and second to person. It is critical to achieve this condition, because in
the ‘ability of the participants’ [12]. In other words, the educational settings, missing data are unavoidable due to a
probability to endorse a test-item is modelled as a ‘logistic variety of reasons such as: physical or mental fatigue, or time
function’ of the discrepancy between the person’s proficiency constraints. Thus, a fair comparison between a person’s ability
and the test-item difficulty. It was assumed that a student with
and test-item difficulty in CTT cannot be attained, if there are
a higher proficiency had a greater chance of success on
particular test-item compared to a less proficient student- one or more missing data (unless the missing value imputation
participant. Moreover, the Rasch measurement also has the is used). Even though some imputation methods are clearly
capability to demonstrate the student’s ability (higher or lower) better than others, none of them can really be described as
in a test, as compared to other students on one scale. acceptable, as it is only deemed appropriate when the
possibility of missing data is extremely small. “The only really
The use of the raw scores to measure students’ cognitive good solution to the missing data problem is not to have any”
performance will contribute to a variation in the ‘unit of [18] page 2. There will always be one simple solution for this
measurement’ and lead to underestimating the effect sizes. This problem, as suggested by the default option of statistical
means that ‘one more unit is no longer equal to the same packages. This is when the case (or person) with incomplete
amount extra no matter how much there already is’. This
responses needs to be excluded. Unfortunately, this is not a
research demonstrates that the use of a linear measurement
desirous solution because of the probability of losing a
allows for statistical evaluation of quantitative data because the
data have their mathematical properties of equidistant interval significant amount of data in a practical situation to capture
measures, which permitted further analysis. Applying different the relevant statistic [18].
approaches to the same set of quantitative data yields different
outcomes. Therefore, using a Rasch measurement model (a Thirdly, even though CTT has an estimate of person’s ability
family of probabilistic models of IRT) to analyse the data and test-item’s difficulty, it may also complicate the prediction
might limits the mix of findings in similar research. of the outcomes of the interaction between a particular person
and a particular test-item due to the measurement of both
V. SUMMARY persons and the test-items that are expressed in different
metrics.
Even though the CTT has served the measurement practices
for decades, there are shortcomings in CTT. Firstly, in the Consequently, the Rasch IRT model is preferred because of its
CTT, correlation coefficients (or internal consistency) used as capacity to overcome the aforementioned shortcomings in the
a basis in test-item selection and correlations are sample CTT. Firstly, it was considered that the test-item parameters
dependent [13][14]. Testing instrumentation in an educational are independent of the case (or person) characteristic. The
setting assumes that the ability of a student will be lower if total score of a person on a test is sufficient in estimating a
that student is given a difficult test. However, the ability will person’s ability and the total score of a test-item is sufficient
be higher if that same student is given a less difficult test. to estimate test-item’s difficulty. In Rasch IRT model, this is
That, on the one hand, a person’s abilities are dependent on referred to as ‘local independence’. In other words, the correct
the test-items that are being administered, while, on the other or incorrect answers to a test-item are totally independent of
hand, if the same test-items are encountered by two different correct or incorrect answers to any other test-items in the test
groups, one to above-average students and the other to below- [19]. One way to determine local independence is by visually
average students; the test-items appear to be less difficult to examining the output derived from the Quest estimate,
the former and more difficult to the latter. This is to say that, according to the Rasch model, this is known as the ‘fit
the calculations of test-item difficulty and test-item statistic’ (for instance: see an item fit map produced by Quest
discrimination are dependent on the groups of examinees software) which shows how well each test-item addresses the
taking the test. In other words, with the CTT, the probabilities single underlying construct. In the Rasch model when the data
of participants responding to test-items correctly or partially
363
fit the model, this result can signify that all test-items measure depending on the purpose. It can be either before instruction to
a single construct. This property is also referred to as assist the planning and direction of the instruction, on the one
‘unidimensionality’ [20]. The outfit or infit mean square hand; and during instruction to help make an instructional
values, for each test-item, of about one, are considered adjustment or after instruction on the other, to provide specific
acceptable and indicate local independence and information regarding what learners know and what they do
unidimensionality [7]. Hence, these requirements allow not. Therefore, the Rasch IRT model is appropriate to be used
generalization of person’s ability and test-item difficulty to to evaluate the assessment scale and provide evidence and
other samples of test-item and persons, respectively. information that suits the purpose.
The second reason for choosing Rasch model because it is

fairly robust in handling incomplete responses. It requires only
sufficient density of data as the starting point for calculations ACKNOWLEDGMENT
[9]. So saying, the Rasch model uses a different probability The authors wish to thank the Universiti Tun Hussein Onn
distribution for incomplete responses than it uses for complete Malaysia for funding this study under grant A025.
responses. The fundamental measurement for the Rasch model
is derived from the principle of ‘specific objectivity’ [21], that REFERENCES
was first advocated by Rasch [22]. Specific objectivity is [1] Hambleton, R. K., Jones, R. W., & Rogers, H. J. (1993). Influence of
defined as two different persons that can be compared item parameter estimation errors in test development. Journal of
Educational Measurement, 30(2), 143-155.
independently of the response given to the specific test-item
and two different test-items can be compared regardless of the [2] Iramaneerat, C., Smith Jr, E. V., & Smith, R. M. (2008). An
person’s ability attempt at those test-items [23]. For instance, introduction to Rasch measurement. In J. W. Osborne (Ed.), Best
when considering the analogy to measure human height, any practices in quantitative methods (pp. 71-85). Thousand Oaks,
California: Sage Publication.
ruler can be used as a measurement tool, the ruler is
independent of the person being measured and it does not [3] Wright, B. D., & Mok, M. M. C. (2004). An overview of the family
measure differently when measuring one person’s height to of Rasch measurement models. In E. V. Smith Jr & R. M. Smith
another. Hence, missing data would not affect the measures (Eds.), Introduction to Rasch measurement (pp. 1-24).
with specific objectivity. Despite incomplete responses in the [4] Gravetter, F. J., & Wallnau, L. B. (2009). Statistic for the behavioral
data, the data remains useful and provides plausible sciences (8 ed.). Bekmont, California: Wadsworth Cengage Learning.
information.
[5] Siegel, S., & Castellan, N. J. (1988). Non-parametric statistics for the
behavioral sciences (2 ed.). New York: McGraw-Hill
Finally, in this study the ability of the Rasch IRT model
approach to visualizing the statistical properties of test-items, [6] Fischer, G. H. (1995). Derivations of the Rasch model. In G. H.
was considered [24]. The performance interaction between Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent
persons and test-items could be described more precisely, as developments and applications (pp. 15-38). New York: Springer
Verlag.
the Quest estimate offers a number of visual checks (for
instance, see variable map produced by Quest Software) on [7] Adams, R. J., & Khoo, S. T. (1996). QUEST:The interactive test
the performance of person in relation to particular test-items, analysis system (Vol. 2.1). Melbourne: Australian Council for
or the performance of test-items in relation to a particular Educational Research.
group of people because both people and test-items are located [8] Wright, B. D., & Masters, G. N. (1982). Rating scale analysis.
in the same measurement scale [25], representing the so-called Chicago: Mesa Press.
powerful ‘conjoint measurement’ technique [7][26]. A
graphical representation, for instance see the variable map [9] Bond, T., & Fox, C. M. (2007). Applying the Rasch model:
Fundamental measurement in the human sciences (2 ed.). London:
(one of the Quest software outputs), which allows for a Lawrence Erlbaum.
meaningful visual observation of person and test-item
performance, simultaneously. This unique characteristic of the [10] Fan, X. (1998). Item response theory and classical test theory: An
Quest estimate output files offers a richness of diagnosis. empirical comparison of their item/person statistics. Educational and
Psychological Measurement, 58(3), 357-381.
Besides the advantages, the Rasch IRT model also allows
calibration on small numbers of test-items and small numbers [11] Rasch, G. (1960). Probabilistic models for some intelligence and
of examinees in contrast with the CTT, which needs a more attainment tests. Copenhagen: Danish Institute for Educational
extensive test of up to 200 and more test-items to attain the Research.
test’s accuracy [27]. [12] Barrett, S. (2001). Differential item functioning: A case study from
first year economics. International Education Journal, 2(3), 123-132.
The primary objective of such assessment (or test) in the field
of education or training, is to measure variables such as [13] Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory:
McGraw-Hill.
learning (or training) performance or aptitude [28]. The test-
score provides the location of the test-taker on the [14] Liu, X., & Kalman, C. (2010). Using and developing measurement
measurement scale. The test can be given at any time, instruments in science education: A Rasch modeling approach. North
Carolina: Information Age Publishing.
364
[15] Domino, G., & Domino, M. L. (2006). Psychological testing: an [23] Irtel, H. (1995). An extension of the concept of specific objectivity.
introduction. Cambridge: Cambridge University Press. Psychometrika, 60(1), 115-118.
[16] Kline, T. (2005). Psychological testing: a practical approach to design [24] Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased
and evaluation: Thousand Oaks, California: Sage Publications. test items. Thousand Oaks: Sage Publications.
[17] Embretson, S. E., & Reise, S. P. (2000). Item response theory for [25] Callingham, R., & Bond, T. (2006). Research in mathematics
psychologists. Mahwah, New Jersey: Lawrence Erlbaum Associates. education and Rasch measurement. Mathematics Education Research
Journal, 18(2), 1-10.
[18] Allison, P. D. (2001). Missing data. Thousand Oaks: Sage
Publications. [26] Adanez, G. P., & Velasco, A. D. (2002). Predicting academic success
of engineering students in technical drawing from visualization test
[19] Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: scores. Journal for Geometry and Graphics, 6(1), 99-109.
Mesa Press.
[27] Baker, F. B. (2001). The basics of item response theory (2 ed.).
[20] Bond, T., & Fox, C. M. (2001). Applying the Rasch model: United States of America: ERIC Clearinghouse on Assessment and
Fundamental measurement in the human sciences. London: Mahwah. Evaluation.
[21] Linacre, J. M. (1989). Rasch models from objectivity: A [28] Griffin, P. (2009). Teachers' use of assessment data. In C. W. Smith
generalization. Paper presented at the Fifth International Objective & J. J. Cumming (Eds.), Educational assessment in the 21st century:
Measurement Workshop, Berkeley. Connecting theory and practice. Netherlands: Springer.
[22] Rasch, G. (1966). An item analysis which takes individual differences

into account. British Journal of Mathematical and Statistical
Psychology, 19(1), 49-57.
365

Measuring Cognitive Performance On Programming Knowledge - Classical Test Theory Versus Item Response Theory

Uploaded by

Copyright:

Available Formats

Measuring Cognitive Performance On Programming Knowledge - Classical Test Theory Versus Item Response Theory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measuring Cognitive Performance On Programming Knowledge - Classical Test Theory Versus Item Response Theory

Uploaded by

Copyright:

Available Formats

2017 7th World Engineering Education Forum (WEEF)

Measuring Cognitive Performance on programming

Marlina Mohamad Abdul Jalil Omar

978-1-5386-1523-2/17/$31.00 ©2017 IEEE

First Session Second Session

CSA Screening Test CSA Screening Test

Pre-test Pre-test Stage 2 Fig 4: Item location based on logit units

The second reason for choosing Rasch model because it is

[22] Rasch, G. (1966). An item analysis which takes individual differences

You might also like