Comparison of Classical Test Theory and
Comparison of Classical Test Theory and
Comparison of Classical Test Theory and
ISSN:1991-8178
Comparison of Classical Test Theory and Item Response Theory:A Review of Empirical
Studies
1
Ado Abdu Bichi, 2Rahimah Embong, 3Mustafa Mamat, 4Danjuma A. Maiwada
1
Faculty of Islamic Contemporary Studies,Universiti Sultan ZainalAbidin, 21300 Kuala Terengganu, Malaysia
2
Research Institute of Islamic Product & Civilization,Universiti Sultan ZainalAbidin, 21300 Kuala Terengganu, Malaysia
2
Faculty of Informatics and Computing, Universiti Sultan ZainalAbidin, 21300 Kuala Terengganu, Malaysia
3
Faculty of Education, Northwest University, PMB 3220, Kano-Nigeria
person statistics of IRT has been illustrated discrimination and the ability of the examinees.
theoretically (Hambleton and Swaminathan, 1985; Although CTT has been used for most of the time in
Hambleton, Swaminathan, and Rogers, 1991). A educational and psychological measurement, in
search of literatures revealed that several studies recent decades IRT has been gaining ground, there
have been conducted to empirically examine the by becoming a favourite measurement framework.
comparability of IRT-based and CTT-based item and The weak theoretical assumptions are the major
person statistics i.e. (Lawson 1991; Fan 1998; arguments against CTT which make it according to
MacDonald and Paunonen 2001; Guleret al., 2014; Hambleton and Jones (1993), easy to apply in many
Courville 2004; Ojerindeet al., 2012; Ojerinde, 2013; testing situations.In their views, the person statistic is
and Magno 2009) all the studies compared IRT- item dependent and the item statistics such as item
based and CTT-based item and person statistics using difficulty and item discrimination are sample
different data sets and settings, and their findings dependent. On the other hand, IRT is more theory
revealed a strong relationships between the IRT and grounded and models the distribution of examinees’
CTT, it Suggest that information from the two success at the item level. As its name implies, IRT
approaches about items and examinees might be very mainly focuses on the item-level information in
much the same. However, Cook, Eignor, and contrast to CTT’s principal focus on test-level
Taft(1988)reported Lack of invariance for both CTT- information.Notwithstanding the recent growth and
based and IRT-based item difficulty estimates. theoretical superiority of the item response theory
Despite the number of empirical studies (IRT), classical test theory(CTT) continues to be an
conducted to directly or indirectly provide empirical important framework for test construction (Bechgeret
evidences of the relationship between CTT and IRT, al.,2003).It is therefore pertinent to give a brief
there are not enough studies that provide evidence explanation of the two frameworks in order to give a
theoretical superiority of IRT over CTT. This clear notion of the relationship between CTT and
inability of the studies to provide a clear distinction IRT for researchers and item writers who are
between the two measurement frameworks as frequently familiar with the frameworks to appreciate
theoretically been established leaves much to be ask the two.
as relate to the suitability of the tools used in the
studies. 2.1 What is Classical Test Theory?
According to Hambleton and Jones (1993)
Objective: Classical test theory is a theory about test scores that
The main purpose of this paper is to critically introduces three concepts- (1) test score (often called
review the previous empirical studies that were the observed score), (2) true score, and (3) error
conducted to clarify some aspects of classical test score. Within that theoretical framework, models of
theory (CTT) and item response theory (IRT) various forms have been formulated. For example, in
modeling especially with regards to item what is often referred to as the "classical test model,"
development. a simple linear model is postulated linking the
observable test score(X) to the sum of two
Significant of the study: unobservable (or often called latent) variables, true
The results from this review of empirical score (T) and error score (E), that is:
comparison of CTT and IRT will provide X = T + E.
measurement specialists, test developers and the Because the true score is not easily observable,
researchers with information regarding the suitability instead, the true score must be estimated from the
and comparability of the two frameworks from the individual’s responses on a set of test items.
practical point of view and provide basis for further Therefore the equation is not solvable unless some
research to improve measurement practice. simplifying assumptions are made. The major
The following questions guide the study: assumptions underlines the CTT are (a) true scores
(1) How comparable are the CTT and IRT and error scores are uncorrelated, (b) the average
frameworks in terms of item and person parameters? error score in the population of examinees is zero,
(2) Is there enough empirical evidence on the extent and (c) error scores on the parallel tests are
to which CTT and IRT behave differently? uncorrelated.
The major advantages of CTT as highlighted by
2.0 Concept of CTT and IRT: Hambleton and Jones (1993) are its relatively weak
Classical test theory (CTT) and item response theoretical assumptions, which make CTT easy to
theory (IRT) are generally perceived as the two apply in many testing situations.The benefit of using
popular statistical frameworks for addressing CTT in test development as given by Schumacker
Measurement Problems, both the two approaches (2010), are (1) when compared to item response
describe characteristics of an individual and analyze theory models, analyses can be performed with
abilities and latent attributes and enable to predict smaller representative samples of examinees. This is
outcomes of psychological and educational tests by particularly important when field-testing a measuring
identifying item parameters which are item difficulty, instrument. (2) Classical test analysis employs
551 Ado Abdu Bichi et al, 2015
Australian Journal of Basic and Applied Sciences, 9(7) April 2015, Pages: 549-556
relative simple mathematical procedures and model parameter logistic model (2PLM) and three
parameter estimations are conceptually parameter logistics model (3PLM).
straightforward.(3) Classical test analysis is often All three models have an item difficulty
referred to as “weak models” because the parameter (b), In addition, the 2PL and 3PLmodels
assumptions are easily met by traditional testing possess a discrimination parameter (a), which allows
procedures. the items to discriminate differently among the
Fan (1998) summarizes the major limitation of examinees. The 3PL model contains a third
CTT as circular dependency: (a) The person statistic parameter, referred to as the pseudo-chance
(i.e., observed score) is (item) sample dependent, and parameter (c). The pseudo-chance parameter (c)
(b) the item statistics (i.e., item difficulty and item corresponds to the lower asymptote of the item
discrimination) are (examinee) sample dependent. characteristic curve (ICC) which represents the
This circular dependency poses some theoretical probability that low ability test takers will answer the
difficulties in CTT’s application in some item correctly and provide an estimate of the pseudo-
measurement situations (e.g., test equating, chance parameter (Embretson and Reise, 2000)
computerized adaptive testing).
The major focus of CTT is on test-level 3.0 Comparison of CTT and IRT:
information, however item statistics (i.e., item According to Sohn (2009) one of distinguishing
difficulty and item discrimination) are also an characteristics of item indices under CTT and IRT
important part of the CTT model (Fan, 1998) frameworks is whether they are sample dependent or
invariant. The item parameters under CTT are
2.2 What is Item Response Theory? regarded as sample dependent, because they are
Hambleton and Jones (1993) describe Item based on the total score of the test which is the
response theory as a general statistical theory about person parameter in CTT and has a variant attribute.
examinee item and test performance and how Another way of saying this is that the values of the
performance relates to the abilities that are measured item parameters are different across the samples
by the items in the test. Item responses can be collected for the test. This characteristic may be a
discrete or continuous and can be dichotomously or threat to the reliability of the test. So, in order to
polychotomously scored; item score categories can generalize the results of the test, random sampling is
be ordered or unordered; there can be one ability or assumed for CTT. The item parameters under IRT,
many abilities underlying test performance; and there however, are not considered to be dependent upon
are many ways (i.e., models) in which the the ability level of the examinees responding the item
relationship between item responses and the (Baker, 2001). In other words, the item parameters
underlying ability or abilities can be specified. are regarded as sample invariant. If an item measures
Within the general IRT framework, many model the same latent trait for groups, the estimated item
shave been formulated and applied to real test data. parameters are assumed to be the same. Because the
The characteristics of Item Response Models as item difficulty parameter under IRT is independent
summarised by Hambleton and Swaminathan (1985) of the samples, it is considered easier to interpret
are, first, an IRT model must specify the relationship than that under CTT. Baker and Kim (2004) argue
between the observed response and underlying that the concepts of item difficulty in CTT and the
unobservable construct. Secondly, the model must location parameter i b in IRT are not completely
provide a way to estimate scores on the ability. interchangeable. Under CTT, an easy item is defined
Thirdly, the examinee’s scores will be the basis for by a low ratio of the correct response to an item in
estimation of the underlying construct. Finally, an the total population. On the other hand, under IRT,
IRT model assumes that the performance of an an item is defined as easy when the magnitude of the
examinee can be completely predicted or explained item difficulty parameter is less than the average
from one or more abilities. In item response theory, it level of ability. Considering item discrimination
is often assumed that an examinee has some latent, parameter, however, it is regarded as the parameter
unobservable trait (also called ability), which cannot which makes it possible to establish a distinction
be studied directly. The purpose of IRT is to propose among examinees’ different ability under both CTT
models that permit to link this latent trait to some and IRT. Thus, IRT has been considered to hold
observable characteristics of the examinee, especially advantages over CTT at least in terms of theoretical
his/her faculties to correctly answering to a set of point of view. Lord (1980) argued that IRT provides
questions that form a test (Magis 2007). the methods of optimally discriminating items in the
Item Response Theory, Item parameters include scope of a passing score. However, practical
difficulty (location), discrimination (slope), and researches comparing CTT and IRT have not shown
pseudo-guessing (lower asymptote). Three most there is consistent superiority of IRT measurement
commonly used IRT models are; one parameter statistics.
logistic model (1PLM or Rasch model), two
552 Ado Abdu Bichi et al, 2015
Australian Journal of Basic and Applied Sciences, 9(7) April 2015, Pages: 549-556
Table 1:Main Difference between CTT and IRT Models, source :( Hambletonand Jones 1993).
Area CTT IRT
Model Linear Nonlinear
Level Test Item
Assumptions Weak (i.e easy to meet with test data) Strong (i.e more difficult to meet with data)
Item-ability relationship Not specified Item characteristics functions
Ability Test scores or estimated true scores are Ability scores are reported on scale – ∞ to +
reported on the test-score scale (or a ∞ (or transformed scale)
transformation test score scale)
Invariance of item and person No-item and person parameters are sample Yes-item and person parameters are sample
statistics dependent independent, if model fits test data
Item statistics p, r b, a and c (for the three-parameters model)
plus corresponding item information
functions
Sample size (for item parameters 200 to 500 (in general) Depends on the IRT model but larger
estimation) samples i.e over 500, in general are needed
4.0 Review of Empirical Studies: the study from the Abia State College of Education,
Many studies have been conducted to investigate Arochukwu. The instrument used was the
the comparability of item and person parameter Mathematics Achievement Test (MAT) for College
estimates by CTT and IRT approaches. The studies students developed by the researchers. Data was
conducted by different scholars are critically analysed in two dimensions. First, the psychometric
reviewed and discuss in this study. properties of the instrument were analyzed using
Lawson (1991) comparesRasch model item- and CTT and IRT and the detection of item bias was
person-parameters to CTT difficulty and number- performed using the method for Differential Item
right in three sets of examination data. His CTT and Functioning (DIF). The results showed that although
IRT results showed that the correlation coefficient Classical Test Theory (CTT) and Item Response
(r= -.9949) of the level of item difficulty through Theory (IRT) methods are different in so many ways;
CTT and IRT (Rasch model) were very high. This outcome of data analysis using the two methods in
result indicated that item difficulty estimates behaved this study did not say so. Items which were found to
very similarly in two different approaches, CTT and be “bad items” in CTT came out not fitting also in
IRT. CTT and IRT have also been compared under Rasch Model. Overall results of analysis showed that
simulated conditions. the achievement test in its generality was a good test.
Fan (1998) uses a large-scale test database from Although there are items removed, revised, and
a statewide assessment program to examined and rephrased, most of the items came out to be “good
also compared CTT and IRT estimates from different items”. These were also the items that turned out to
subsamples of N=1000 to investigate the invariance have extreme logit measures qualifying it to be
properties of CTT and IRT parameter estimates. He unfitting in the latent trait model. Surprisingly, some
created samples varying in their representativeness of the items came out to be biased as detected in the
by sampling a larger dataset (e.g., random selections DIF analysis.
vs. men and women vs. high and low scores). Of MacDonald and Paunonen (2002) felt that prior
course, IRT parameters are known to be invariant but research might be influenced by the fact that they
Fan was testing the empirical invariance of the IRT were real-data studies. In particular, these researchers
item parameter estimates and comparing them to were interested in the effects of the specific items
estimates of CTT statistics, which are thought not to used in the study. They also wished to examine
be invariant. Fan found that both CTT and IRT accuracy, which was not possible in the previous,
difficulty, and to a lesser degree, discrimination real-data studies. Therefore, they simulated data
statistics displayed invariance. For item difficulty, using 1PL and 2PL IRT models and then computed
CTT estimates were closer to perfect invariance. IRT and CTT statistics from these values. They
Overall the Fan’s Finding shows that both the CTT performed three sets of correlations. First, they tested
and IRT produced similar results in terms of comparability of test scores, difficulty, and item
comparability of item and person statistics and also discrimination by correlating estimated IRT and CTT
on the degree of invariance of the item statistics from statistics; they found very high comparability for test
the two approaches. In his conclusion, Fan scores and difficulty and less comparability for item
questioned whether IRT had the “advertised” discrimination. Next, they correlated values obtained
advantages over CTT. from different samples to test invariance; they found
Idowuet al. (2001) apply the Classical Test exceptional invariance with CTT exhibiting slightly
Theory and Item Response Theory to evaluate the closer to perfect invariance as compared to IRT.
quality of an assessment constructed by the Courville (2004) in the examination of empirical
researchers to measure National Certificate of comparison of item response theory and classical test
Education (NCE) students’ achievement in theory item/person statistics, the study focused on
Mathematics. A sample of 80 students was drawn for two central themes: (1) how comparable are the item
553 Ado Abdu Bichi et al, 2015
Australian Journal of Basic and Applied Sciences, 9(7) April 2015, Pages: 549-556
and person statistics derived from the item response school students in Philippines, and actual test data for
and classical test framework? and (2) How invariant chemistry. The CTT and IRT were compared across
are the item statistic from each measurement two samples and two forms of test on their item
framework across examinee samples? The ACT difficulty, internal consistency, and measurement
Assessment test composed of four tests: English, errors. The results demonstrate certain limitations of
Mathematics, Reading, and Science were used for the the classical test theory and advantages of using the
study. Random samples of 80,000 examinees IRT. It was found in the study that, IRT estimates of
composed of 40,000 males and 40,000 females were item difficulty do not change across samples as
drawn from the population of 322,460. The results of compared with CTT with inconsistencies; difficulty
this study indicate high correlations between CTT- indices were also more stable across forms of tests
based and IRT-based estimates, at least for the one- than the CTT approach; IRT internal consistencies
parameter and two-parameter models. This result are very stable across samples while CTT internal
holds for either small sample clinical trials or large consistencies failed to be stable across samples; IRT
sample assessment situations. Similarly the CTT item had significantly less measurement errors compared
difficulty estimates, for the random sampling plan, to CTT.
had a higher degree of invariance than the IRT-based Adedoyin (2010) investigates the invariance of
item difficulty estimates, especially for the two- and person parameter estimates based on Classical Test
three-parameter models. The discrimination indices, and Item Response Theories, 11 items that fitted the
however, correlated highly only when the spread of 2PL model from the 40 items of Paper 1 Botswana
discriminations was large and the spread of difficulty junior secondary mathematics examinations, were
values was small used to estimate the person ability, a random sample
Progaret al. (2008) in their study titled An of five thousand examinees (5000) were drawn from
empirical comparison of Item Response Theory and the population of thirty- five thousand, five hundred
Classical Test Theory ,the researchers used a real and sixty- two (35562) who sat for the examination.
data set from the Third International Mathematics The person parameter estimates from CTT and IRT
and Science Study (TIMSS 1995) to address the were tested for invariance using repeated measure
following questions: (1) How comparable are CTT ANOVA at 0.05 significant level. The IRT person
and IRT based item and person parameters? (2) How parameter estimates based on IRT were invariant
invariant are CTT and IRT based item parameters across subsets of items. The findings of the study
across different participant groups? (3) How show that, there is gross lack of invariance when
invariant are CTT and IRT based item and person classical test theory (CTT) is used to estimate person
parameters across different item sets? The findings parameter or ability. IRT person parameter estimates
indicate that the CTT and the IRT item/person exhibited the invariance property across subset of
parameters are very comparable, that the CTT and item.
the IRT item parameters show similar invariance Ojerindeet al. (2012) evaluate the use of English
property when estimated across different groups of pre-test data so as to compare indices obtained using
participants, that the IRT person parameters are more the 3-parameter model of the Item Response Theory
invariant across different item sets, and that the CTT (IRT) with those from the Classical Test Theory
item parameters are at least as much invariant in (CTT) approach and verify how well the two can
different item sets as the IRT item parameters. The predict actual test results and the degree of their
results furthermore demonstrate that, with regards to comparability, using a sample of 1075 test takers that
the invariance property, IRT item/person parameters took one version of the pre-test in use of English of
are in general empirically superior to CTT the UTME. The findings in this study have indicated
parameters, but only if the appropriate IRT model is that the person and item statistics derived from the
used for modelling the data two measurement frameworks are quite comparable.
Zamanet al. (2008) in their study to compare the The degree of invariance of item statistics across
CTT and IRT for students ranking on the basis of samples, usually considered as the theoretical
their abilities on objective type test in Physics at superiority of IRT models, also appeared to be
secondary level taking a random sample of 400, 9 th similar for the two measurement frameworks but the
grade students from variety of population in Pakistan IRT model provided a better idea about the internal
using a content valid test of 80 multiple choice item, consistency of the test than CTT. However, the 3PL
found out that, CTT-Based and IRT-based examinee model was found to be more suitable in multiple-
ability estimates were very comparable and highly choice questions in ability test but involved more
correlated (0.95), indicating that the ability level of complex mathematical estimation procedure than the
individual examinees will lead to similar results CTT. In the overall, indices obtained from both
across the different measurement theories. approaches gave valuable information with
Magno (2009) has conducted a study to comparable and almost interchangeable results.
demonstrate the difference between classical test Pido (2012) conducted a study to determine and
theory (CTT) and item response theory (IRT) compare the item parameters of the MCQ in the 2011
approach using a random sample of 219 junior higher Uganda Certificate of Education (UCE) examinations
554 Ado Abdu Bichi et al, 2015
Australian Journal of Basic and Applied Sciences, 9(7) April 2015, Pages: 549-556
using the CTT and IRT approaches. Four subjects, parameters Adedoyinand Adedoyin(2013) found out
Geography, Chemistry, Physics and Biology Paper that, the CTT and IRT item difficulty and item
were used. 480 scripts of the examinees in each discrimination values were positively linearly
subject were selected as sample for the study. The correlated and there was no statistical significant
examinees’ responses were analysed using the difference between the item difficulty and item
Xcalibre 4.1.7.1 software to determine item discrimination parameter estimates by CTT and IRT.
parameters based on the CTT and IRT approaches. Ojerinde (2013) conducted a study to evaluate
The correlation coefficient and the inspection the psychometric utility of data obtained using the
methods were used to compare the difficulty and two models in the analysis of UTME Physics Pre-test
discrimination indices obtained from both the CTT so as to examine the results obtained and determine
and IRT approaches. The results revealed a very high how well the two can predict actual test results and
correlation between the difficulty indices obtained the degree of their comparability. The researcher also
using CTT and IRT approaches. Similar result was verified the conditions to be fulfilled for IRT to be
also found between discrimination indices. The usefully applied with real test data. Findings showed
overall result revealed a strong relationship between that the result obtained using the IRT model was
the values of item parameter estimated using CTT found to be more suitable in multiple-choice
approach and those estimated using IRT approach. questions (MCQs) in ability test but involved more
Abedalazizand Leng (2013), in the study to complex mathematical estimation procedure than the
examine the Relationship between CTT and IRT classical approach. In the overall, indices obtained
Approaches in Analysing Item Characteristics, the from both approaches gave valuable information
aim was to compare the item difficulty and item with comparable and almost interchangeable results
discrimination of the Mathematical ability scale in some cases.
using the two methods across 1, 2, and 3 parameters. In another study, Guleret al. (2014) compare
The instrument was administered to tenth grade classical test theory and item response theory in
sample of N=602. The data gathered was analysed terms of item parameters, the aim of their study was
for possible relationship of the item characteristics to empirically examine the similarities and
using CTT and IRT methods. Results indicate that differences in the parameters estimated using the two
the 3-parameter logistic model has the most approaches, a random sample of 1250 students from
comparable indices with CTT, furthermore, CTT and the group of 5989 students who had taken 25-item
IRT models (1-parameter logistic model and 3- Turkish high schools entrance exam (HSEE) in 2003
parameter logistic model) can be used independently were used In the study, the findings reveals that, the
or altogether to describe the nature of the items highest correlations between CTT and 1-parameter
characteristics. IRT model (0.99) in terms of item difficulty
Nenty and Adedoyin (2013) in their study to parameters, and between CTT and 2-parameter IRT
compare and testfor significance the invariance model (0.96) in terms of item discrimination
between the item parameter estimates from CTT and parameters. The result also shows the lowest level of
those from each of 2-,3- IRT across models/theories correlation between the 3-parameter model and CTT
for inter and intra model validation of the two test although the 3-parameter model was identified as the
theories. 10,000 junior secondary school pupils were most congruous one in terms of model-data fit. In the
randomly selected from the population of 36,940 light of their findings, it may be said that there is not
pupils who sat for 2010 mathematics paper 1 much difference between using 1 or 2-parameter IRT
examination in Botswana. The estimated CTT and model and CTT. However, in cases where the
IRT item parameters were tested for significance probability of guessing is high, there is a significant
with respect to invariance concept using dependent t- difference between 3-parameter model and CTT.
test with respect to the two theory models, the result The researchers correlated the estimated and true
It showed that, the inter validation of item parameter statistics to examine accuracy, the studies
estimates between CTT and 2-3-IRT models were demonstrated that the item and person parameters
not significant. This showed that the inter invariance generated by CTT and IRT were very accurate and
concept of item parameter estimates for CTT and 2-, highly comparable across all conditions. However in
3-IRT models was established. The item difficulty the case of the item discrimination indices, the
parameter estimates between 2-3-IRT models were statistic under IRT accurately estimated the
not statistically significant, but there was a statistical discrimination value in all conditions. On the other
significant difference in the item discrimination hand, the item discrimination value under the CTT
parameter estimates. This showed that the intra framework obtained an accurate estimate only when
invariance concept of item discrimination parameter the potential item pool had a narrow range of item
estimates for 2-, 3-IRT models could not be difficulty levels. Generally the studies show a high
established. degree of measurement accuracy of the CTT and IRT
In a study conducted to Assess the comparability framework, and provide no substantial evidence of
between classical test theory (CTT) and item superiority of IRT over CTT.
response theory (IRT) models in estimating test item
555 Ado Abdu Bichi et al, 2015
Australian Journal of Basic and Applied Sciences, 9(7) April 2015, Pages: 549-556