Measuring Cognitive Performance On Programming Knowledge - Classical Test Theory Versus Item Response Theory
Measuring Cognitive Performance On Programming Knowledge - Classical Test Theory Versus Item Response Theory
Measuring Cognitive Performance On Programming Knowledge - Classical Test Theory Versus Item Response Theory
Abstract—This research aims to study the strengths and study the strengths and weaknesses of IRT and CTT and to
weaknesses of Classical Test Theory (CTT) and Item Response make suggestion of what situation should the theories be
Theory (IRT) and to make suggestion of what situation should adopted. The test-instruments are developed to measure the
the theories be adopted. The test-instruments were developed to cognitive performance of students when learning programming
measure the cognitive performance of students when learning language through web-mediated environment. The intention of
programming language through web-mediated environment. this research article is to provide the non-technical comparison
This quasi-experimental study at the same time purpose to between CTT and IRT.
measure the effects of web-mediated learning environment in
acquisition abstract knowledge. However, this paper only reports
part of the research findings which involves the comparison II. BACKGROUND
between CTT and IRT mainly on test-instruments. The test-items Investigation of the effects of independent variables on human
show different interval scale when locating at the same performance in an intervention must rely heavily on the
measurement scale. The location of test-items analyzing using
raw scores have different intervals between them. However, when
validity of the cognitive measuring instruments. However,
using logit (the measurement unit in IRT), the test-items locate at research in the social sciences tends to lack the rigour in their
the same distance apart. Therefore, by simply summing the raw measuring instruments compared to research in the physical
scores to measure students’ cognitive performance will leads to sciences. In physical science studies (for instance, that
inconsistency in constructing a defensible unit of measurement. measure weight, height or speed), the research participant
The finding suggests that the used of CTT and IRT should be being measured and the measuring instrument are assumed to
carefully examining depends on the context of study. fulfil the requirements of fundamental measurement, and the
instruments are made on a ratio scale. This practice allows the
Keywords—classical test theory; item response theory; rasch
researcher to employ further analysis of the data that involves
mode;, programming; cognitive performance
various parametric statistical techniques. Consider, for
example, a study that measures a person’s height. The measure
I. INTRODUCTION used (which in this case is a ruler) is independent of the person
There are two currently popular statistical frameworks for being measured. The state of the ruler remains constant while
addressing measurement problems such as test-instruments, measuring one person’s height against another person’s
equating test-scores and the identification of test- items biased, height. The ruler has equal interval measurement units that can
that is classical test theory (CTT) and item response theory be positioned on a single dimensional scale. Therefore, direct
(IRT). Although the CTT has served the measurement comparison can be made between persons of different heights.
community for most of this century, IRT has witnessed an As a result, further analysis using parametric statistical
exponential growth in recent decades. The major advantage of techniques (such as analysis of variance (ANOVA) and the t
CTT is its relatively weak theoretical assumptions, which make test), can be easily applied to the data [2].
CTT easy to apply in many testing situations [1]. IRT, on the
other hand, is more theory grounded and models the
However, in social science research (for instance, exploring
probabilistic distribution of examinees’ success at the item-
level. As its name indicates, IRT primarily focuses on the item- attitudes, behaviour or performance), invariant comparisons
level information in contrast to the CTT’s primarily focus on and linear scales are not as easily obtained as they are in the
test-level information. Despite theoretical differences between physical sciences. Observations are commonly made on
CTT and IRT, there is a lack of empirical knowledge about ordinal scales; thus, parametric statistical techniques cannot be
how and to what extent, the IRT- and CTT-based item and applied to the data [3][4]. The measurement values using an
person statistic behave differently. This research aims to ordinal scale are only interpretable according to their
analyse the validity and reliability of the test-instruments, to arrangement in a given order. For instance, partial credit
values from a 0 to 3 scale, in which 0=incorrect, 1=partially enrolled in programming language course for the first time.
correct, 2= almost correct and 3=completely correct, can only After that, the validity and reliability of the test instruments
be conceptually interpreted. Clearly, with such scales the (pre-test and post-test items) were analyzed using Rasch
difference between two measures cannot be interpreted in a measurement model and raw scores (see Fig 2). The results of
quantitative sense. Regardless of the value labels chosen, the the experimentation then calculated using ANOVA and effect
difference between ‘partially correct’ and ‘almost correct’ is size to study the effects.
not necessarily equal to the difference between ‘almost
correct’ and ‘completely correct’. Despite these issues, IRT CTT
researchers in many disciplines including the social sciences
Testing validity
continue to employ parametric statistical methods to their Rasch Model Raw scores and reliability of
ordinal scale data. the instruments
To overcome these shortcomings, statisticians have developed Effect size Effect Size
alternative methods for handling ordinal scale data. In the so- Results
called non-parametric statistical tests, ‘medians’ are compared
instead of ‘means’. Yet this method has its limitations and is
less powerful than parametric tests [5]. Another approach to Fig 2: Data analysis stages
these problems is to transform ordinal observations into linear
measures [6]. This study transformed the data into ‘logit’ unit
IV. FINDINGS
[7].
This research employed a CTT measurement and Rasch IRT
model not only to test the validity and reliability of the test
III. METHODOLOGY instruments but also to transform the ordinal raw score data
into interval measures to calculate the effect sizes, that is the
The research methodology requires gathering relevant data ‘logit’. The following Fig 3 shows the location of the test-
from quasi experimentation. It was begun with the cognitive items using ordinal raw score and Fig 4 shows the location of
styles screening test, follows by pre-test. After that the the test-items using the logits units. The locations are
participants were required to learn programming knowledge measured using the scale ruler based on the item-difficulty
through the web-mediated instructional system. Immediately level.
after the intervention, the participants seat for the post-test (see
Fig 1). Even though the mentioned experimentation stages are
not compulsory to achieve the research objectives but it is
important to ensure that the instruments under testing are
following the standard procedure of experimentation like any
other research procedures. Fig 3: Item location based on raw scores
Web-mediated
Pr24 on Fig 3 and Fig 4 shows the most difficult item. While
Web-mediated
Instructional System Instructional System Pr06 is the easiest item when analyzing using both raw scores
(Text-plus-textual (Text-plus-textual Stage 3 and the logit. Because the logits were used to locate the test-
(T1)) (T2)) items along the scale, the test-items showed in Fig 4 were
positioned at the same distance apart. However, as pictured in
Fig 4, the test-items located according to the raw scores have
different intervals between them. Therefore, by simply
Post-test Post-test
Stage 4 summing the raw scores to measure students’ cognitive
performance leads to an inconsistency in constructing a
defensible unit of measurement [8]. As Bond and Fox [9] in
Fig 1: Quasi experimentation data collection procedure page ix state in their book that “....the essential rule in
The participants for this study were volunteered second successful measurement is used ubiquitously in money,
year undergraduate students from a Malaysian university which
362
2017 7th World Engineering Education Forum (WEEF)
length,..... That rule is one more unit means the same amount correct (test-item difficulty) are calculated from the number of
extra no matter how much there already is”. responses or proportion responses in the sample. This is
The Rasch measurement model is in a family of probabilistic indicated by p-value, where p reflects the ratio of responses in
models. So saying, the Rasch model is classified within item a sample that endorses a test-item [15]. As a result, the higher
response theory (IRT) or latent trait theory. As the names p-value denotes easier test-item, whereas a low p-value
suggest, the Rasch model focuses on characteristics of indicates more difficult test-item [16]. Therefore, the p-value
individual’s test-items (item-level information), rather than for any given test-item will be higher, if the value is drawn
concentrating on test-level information as it does in classical from a more able population, rather than if calculated from a
test theory (CTT) [10]. Instead, the Rasch measurement allows less able population. This is the major drawback of this
for a robust approach to measuring an underlying variable or approach which is sample dependent [1]. Since the random
construct such as attitudes, abilities and personality traits [11]. sampling was not possible in this study, biased parameters
For instance, this study estimates student’ cognitive were possible [17].
performance on their acquisition of programming language
concepts. The Rasch model therefore predicts the probability of Secondly, to make a fair score comparison within a testing-
a student getting a correct response in terms of two variables, sample, it is necessary, that the responses be complete for each
one relating to the ‘level of test-items difficulty’ and second to person. It is critical to achieve this condition, because in
the ‘ability of the participants’ [12]. In other words, the educational settings, missing data are unavoidable due to a
probability to endorse a test-item is modelled as a ‘logistic variety of reasons such as: physical or mental fatigue, or time
function’ of the discrepancy between the person’s proficiency constraints. Thus, a fair comparison between a person’s ability
and the test-item difficulty. It was assumed that a student with
and test-item difficulty in CTT cannot be attained, if there are
a higher proficiency had a greater chance of success on
particular test-item compared to a less proficient student- one or more missing data (unless the missing value imputation
participant. Moreover, the Rasch measurement also has the is used). Even though some imputation methods are clearly
capability to demonstrate the student’s ability (higher or lower) better than others, none of them can really be described as
in a test, as compared to other students on one scale. acceptable, as it is only deemed appropriate when the
possibility of missing data is extremely small. “The only really
The use of the raw scores to measure students’ cognitive good solution to the missing data problem is not to have any”
performance will contribute to a variation in the ‘unit of [18] page 2. There will always be one simple solution for this
measurement’ and lead to underestimating the effect sizes. This problem, as suggested by the default option of statistical
means that ‘one more unit is no longer equal to the same packages. This is when the case (or person) with incomplete
amount extra no matter how much there already is’. This
responses needs to be excluded. Unfortunately, this is not a
research demonstrates that the use of a linear measurement
desirous solution because of the probability of losing a
allows for statistical evaluation of quantitative data because the
data have their mathematical properties of equidistant interval significant amount of data in a practical situation to capture
measures, which permitted further analysis. Applying different the relevant statistic [18].
approaches to the same set of quantitative data yields different
outcomes. Therefore, using a Rasch measurement model (a Thirdly, even though CTT has an estimate of person’s ability
family of probabilistic models of IRT) to analyse the data and test-item’s difficulty, it may also complicate the prediction
might limits the mix of findings in similar research. of the outcomes of the interaction between a particular person
and a particular test-item due to the measurement of both
V. SUMMARY persons and the test-items that are expressed in different
metrics.
Even though the CTT has served the measurement practices
for decades, there are shortcomings in CTT. Firstly, in the Consequently, the Rasch IRT model is preferred because of its
CTT, correlation coefficients (or internal consistency) used as capacity to overcome the aforementioned shortcomings in the
a basis in test-item selection and correlations are sample CTT. Firstly, it was considered that the test-item parameters
dependent [13][14]. Testing instrumentation in an educational are independent of the case (or person) characteristic. The
setting assumes that the ability of a student will be lower if total score of a person on a test is sufficient in estimating a
that student is given a difficult test. However, the ability will person’s ability and the total score of a test-item is sufficient
be higher if that same student is given a less difficult test. to estimate test-item’s difficulty. In Rasch IRT model, this is
That, on the one hand, a person’s abilities are dependent on referred to as ‘local independence’. In other words, the correct
the test-items that are being administered, while, on the other or incorrect answers to a test-item are totally independent of
hand, if the same test-items are encountered by two different correct or incorrect answers to any other test-items in the test
groups, one to above-average students and the other to below- [19]. One way to determine local independence is by visually
average students; the test-items appear to be less difficult to examining the output derived from the Quest estimate,
the former and more difficult to the latter. This is to say that, according to the Rasch model, this is known as the ‘fit
the calculations of test-item difficulty and test-item statistic’ (for instance: see an item fit map produced by Quest
discrimination are dependent on the groups of examinees software) which shows how well each test-item addresses the
taking the test. In other words, with the CTT, the probabilities single underlying construct. In the Rasch model when the data
of participants responding to test-items correctly or partially
363
2017 7th World Engineering Education Forum (WEEF)
fit the model, this result can signify that all test-items measure depending on the purpose. It can be either before instruction to
a single construct. This property is also referred to as assist the planning and direction of the instruction, on the one
‘unidimensionality’ [20]. The outfit or infit mean square hand; and during instruction to help make an instructional
values, for each test-item, of about one, are considered adjustment or after instruction on the other, to provide specific
acceptable and indicate local independence and information regarding what learners know and what they do
unidimensionality [7]. Hence, these requirements allow not. Therefore, the Rasch IRT model is appropriate to be used
generalization of person’s ability and test-item difficulty to to evaluate the assessment scale and provide evidence and
other samples of test-item and persons, respectively. information that suits the purpose.
364
2017 7th World Engineering Education Forum (WEEF)
[15] Domino, G., & Domino, M. L. (2006). Psychological testing: an [23] Irtel, H. (1995). An extension of the concept of specific objectivity.
introduction. Cambridge: Cambridge University Press. Psychometrika, 60(1), 115-118.
[16] Kline, T. (2005). Psychological testing: a practical approach to design [24] Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased
and evaluation: Thousand Oaks, California: Sage Publications. test items. Thousand Oaks: Sage Publications.
[17] Embretson, S. E., & Reise, S. P. (2000). Item response theory for [25] Callingham, R., & Bond, T. (2006). Research in mathematics
psychologists. Mahwah, New Jersey: Lawrence Erlbaum Associates. education and Rasch measurement. Mathematics Education Research
Journal, 18(2), 1-10.
[18] Allison, P. D. (2001). Missing data. Thousand Oaks: Sage
Publications. [26] Adanez, G. P., & Velasco, A. D. (2002). Predicting academic success
of engineering students in technical drawing from visualization test
[19] Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: scores. Journal for Geometry and Graphics, 6(1), 99-109.
Mesa Press.
[27] Baker, F. B. (2001). The basics of item response theory (2 ed.).
[20] Bond, T., & Fox, C. M. (2001). Applying the Rasch model: United States of America: ERIC Clearinghouse on Assessment and
Fundamental measurement in the human sciences. London: Mahwah. Evaluation.
[21] Linacre, J. M. (1989). Rasch models from objectivity: A [28] Griffin, P. (2009). Teachers' use of assessment data. In C. W. Smith
generalization. Paper presented at the Fifth International Objective & J. J. Cumming (Eds.), Educational assessment in the 21st century:
Measurement Workshop, Berkeley. Connecting theory and practice. Netherlands: Springer.
365