0% found this document useful (0 votes)
27 views10 pages

2016-Analysis Test of Understanding of Vectors With The Three-Parameter Logistic Model

Uploaded by

Witchukorn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

2016-Analysis Test of Understanding of Vectors With The Three-Parameter Logistic Model

Uploaded by

Witchukorn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

PHYSICAL REVIEW PHYSICS EDUCATION RESEARCH 12, 020135 (2016)

Analysis test of understanding of vectors with the three-parameter logistic model


of item response theory and item response curves technique
Suttida Rakkapao,1,* Singha Prasitpong,2 and Kwan Arayathanitkul3
1
Department of Physics, Faculty of Science, Prince of Songkla University,
Hat Yai, Songkhla 90110, Thailand
2
Faculty of Education, Thaksin University, Muang Songkhla, Songkhla 90000, Thailand
3
Department of Physics, Faculty of Science, Mahidol University, Ratchathewi, Bangkok 10400, Thailand
(Received 6 November 2015; published 25 October 2016)
This study investigated the multiple-choice test of understanding of vectors (TUV), by applying item
response theory (IRT). The difficulty, discriminatory, and guessing parameters of the TUV items were fit
with the three-parameter logistic model of IRT, using the PARSCALE program. The TUV ability is an ability
parameter, here estimated assuming unidimensionality and local independence. Moreover, all distractors of
the TUV were analyzed from item response curves (IRC) that represent simplified IRT. Data were gathered
on 2392 science and engineering freshmen, from three universities in Thailand. The results revealed IRT
analysis to be useful in assessing the test since its item parameters are independent of the ability parameters.
The IRT framework reveals item-level information, and indicates appropriate ability ranges for the test.
Moreover, the IRC analysis can be used to assess the effectiveness of the test’s distractors. Both IRT and
IRC approaches reveal test characteristics beyond those revealed by the classical analysis methods of tests.
Test developers can apply these methods to diagnose and evaluate the features of items at various ability
levels of test takers.

DOI: 10.1103/PhysRevPhysEducRes.12.020135

I. INTRODUCTION Therefore, the purpose of this study is to explore the


The test of understanding of vectors (TUV), developed 20-item TUV based on the framework of IRT. We will first
by Barniol and Zavala in 2014, is a well-known standard present the key concepts of IRT, focusing on the three-
multiple-choice test of vectors for an introductory physics parameter logistic (3PL) model used in the study (Sec. II).
course at the university level. The TUV consists of 20 items Since IRT displays only the functionality of correct answers
with five choices for each test item, and covers ten main to items, we will also investigate distractors of the TUV items
vector concepts without a physical context. A source of using the item response curves (IRC) technique (Sec. III).
strength for the TUV is that the choices were constructed Section IV is data collection of the TUV Thai language
based on responses to open-ended questions, posed to over version from first-year university students (N ¼ 2392).
2000 examinees. The TUV assesses more vector concepts Section V (results and discussion) will be divided into three
than other previous standard tests of vectors and its main parts. In part A, we will present some limitations of
reliability as an assessment tool has been demonstrated CTT, advantages of IRT, and the significance of using
by five classical test assessment methods: the item diffi- 3PL-IRT analyzed by our data. Parts B and C address the
culty index, the item discriminatory index, the point- results and discussion of 3PL-IRT analysis and IRC analysis,
biserial coefficient, the Kuder-Richardson reliability index, respectively. Last, in Sec. VI, we will summarize what we did
and the Ferguson’s delta test [1]. However, the framework and found in this study.
of classical test theory (CTT) for test assessment has some
important limitations. For instance, the item parameters II. ITEM RESPONSE THEORY (IRT)
depend on the ability distribution of examinees and the The IRT framework rests on the assumption that the
ability parameters depend on the set of test items. To performance of an examinee on a test item can be predicted
overcome these shortcomings, the item response theory from the item’s typical features and the examinee’s latent
(IRT) was introduced [2–5]. traits—often called abilities or person parameters. The
relationship between the examinees’ performance on an
*
Corresponding author. item and their ability is described by an item characteristic
[email protected] function (or item response function), which quantitates how
the probability of a correct response to a specific item
Published by the American Physical Society under the terms of
the Creative Commons Attribution 3.0 License. Further distri- increases with the level of an examinee’s ability. The graph
bution of this work must maintain attribution to the author(s) and of this relationship is known as the item characteristic
the published article’s title, journal citation, and DOI. curve (ICC). The empirical ICCs in prior published

2469-9896=16=12(2)=020135(10) 020135-1 Published by the American Physical Society


SUTTIDA RAKKAPAO et al. PHYS. REV. PHYS. EDUC. RES. 12, 020135 (2016)
1
where bi is the item difficulty parameter (θ at the inflection
Probability of correct responses point of the ICC), and ai is the slope of the ICC at that
0.8 inflection point, called the item discriminatory parameter.
The lower asymptote level ci of the ICC, which corresponds
0.6 to the probability of a correct answer at very low-ability
levels, is referred to as the pseudo-chance level or some-
0.4 times as the guessing parameter. The constant D normally
equals 1.7, and is used to make the logistic function as close
as possible to the normal ogive function. The 3PL model can
0.2
be reduced to the two-parameter logistic (2PL) model by
setting c ¼ 0. The 2PL model is most plausible for open-
0 ended questions, in which responses are rarely guesses.
-3 -2 -1 0 1 2 3
Moreover, this can be further reduced to a one-parameter
TUV-ability logistic (1PL) model by considering only the b parameter at
which the probability of a correct response is 0.5, while
FIG. 1. Item characteristic curve (ICC) of item 1 in the TUV,
holding a fixed [2,7–8]. The special cases with D ¼ 1.0,
modeled from data on Thai students (N ¼ 2392).
c ¼ 0, and a ¼ 1.0, are known as Rasch models [9].
In this study, we applied the PARSCALE program, written
research, of relevance to IRT, tend to be S shaped or by Muraki and Bock (1997), to fit the 3PL model to the
sigmoidal. As the ability increases, the empirical ICC rises dichotomous TUV data. PARSCALE uses the expected
slowly at first, more sharply in the middle, and again slowly a posteriori (EAP) on estimating the ability, and marginal
at very high levels of ability. In its early days, the normal maximum likelihood (MML) to estimate the item parame-
ogive function was commonly used to model the ICC, ters. The estimated abilities are scaled to mean 0 and unit S.D.
while nowadays the logistic function is a popular alter- The numeric search for optimum parameters uses Newton-
native, as shown in Eq. (1). An example of the ICC from Raphson iterations, and the program outputs both numerical
our data, established by fitting a logistic function, is shown results (parameters and diagnostics) and graphs [10].
in Fig. 1.
In item response function models, the ability parameter
is usually standardized to zero mean and unit standard III. ITEM RESPONSE CURVES (IRC)
deviation (SD), and is commonly denoted by the Greek Introduced by Morris and colleagues in 2006, IRC
letter theta (θ). Theoretically, ability can range from −∞ analysis is a simplification of IRT for evaluating multiple-
to ∞, but its values in practice are usually in the interval choice questions and their options [11–12]. It relates the
½ − 3; 3. This interval would contain 99.7% of cases if percentage of students (on the vertical axis) to each ability
the standardized variable is normally distributed, i.e., level (on the horizontal axis), separately for each choice in a
a z score. We assumed unidimensionality, meaning that given item of the test. Unlike IRT, which only considers
a single dominant ability characteristic in the students whether the correct choice was made, IRC analysis displays
influences their test performances. We then defined the effectiveness of every choice. In other words, the
“TUV ability” as the single trait influencing a student’s information provided by wrong answers is also put to use.
performance in the test of understanding of vectors, Moreover, it is easy to use and its results are easy to interpret.
similar to the FCI ability reported in a study by Scott The IRC technique can help test developers improve their
and Schumayer in 2015 [6]. Moreover, we made the local tools by identifying nonfunctioning distractors that can then
independence assumption that performance on one item is be made more appropriate to the examinees’ abilities.
independent of that on another item, and only depends on This study approaches IRC analysis in a slightly different
the ability. way from prior analyses [11–12], but without essential
The TUV consists of multiple-choice questions with differences. These prior studies used the total score for the
responses sometimes based on guessing, and the three- test as an estimate of the ability level across the students,
parameter logistic (3PL) model of IRT is appropriate for and they are indeed strongly correlated. However, the
the investigation of the related item parameters (i.e., strong correlation between total score for the test and
difficulty, discriminatory, and guessing parameters). In ability is not necessarily equivalent for individual items of
the item response function, the probability of answering the test, nor for right or wrong responses to specific items—
item i correctly for an examinee with the ability θ is as also mentioned in Refs. [11–12]. Therefore, on applying
given by the IRC technique, the ability of each examinee was
estimated by the PARSCALE program, which uses optimality
1
Pi ðθÞ ¼ ci þ ð1 − ci Þ ; ð1Þ criteria instead of using the total score as a surrogate for
½1 þ expf−Dai ðθ − bi Þg ability.

020135-2
ANALYSIS TEST OF UNDERSTANDING … PHYS. REV. PHYS. EDUC. RES. 12, 020135 (2016)

IV. DATA COLLECTION To explore the invariance of item indices within the
sample group when the ability of test-takers changes, we
The 20-item TUV was translated into Thai and validated
divided the Thai students into three groups by ability level.
by a group of Thai physics professors. The professors were
Overall, the Thai students’ TUV ability (θ) varied from
asked to perform the TUV in both English and Thai within
−1.7 to 3.1 with mean 0 and S.D. 1, as provided by the
a 2-month period. The test was revised based on sugges-
PARSCALE program for a single-trait model in IRT [10].
tions from the professors, with regard to technical terms
Using ranges below, within, and above the interval
and translations of English to Thai. We applied the TUV to
[mean  0.5S:D:] in TUV ability (θ), we partitioned the
2392 science and engineering first-year students at three
Thai students into low-ability (N ¼ 799), medium-ability
universities. These students had learned vector concepts
(N ¼ 833), and high-ability (N ¼ 760) groups. As shown
through lectures and integrated video demonstrations
in Table I, both P and rpbs values of each item changed
and group discussions, with approximately 300 students
in each class. The participants took 25 min to complete when ability levels of the students changed. The P and rpbs
the TUV, a month after the end of the classes. We used values calculated from the low-ability group of examinees
the Kuder-Richardson reliability index to measure the are likely to disagree with the norm values more than the
whole-test self-consistency of the TUV Thai version, and other groups. This indicates that the item difficulty and
Ferguson’s delta test for the discriminatory power of discriminatory indices analyzed by the framework of CTT
the whole test. The obtained indicator values, 0.78 for the depend on a particular group of examinees. This is one
KR-20 index and 0.97 for Ferguson’s delta, are within the important shortcoming of CTT. Moreover, in CTT, an
desired ranges [13]. The collected data were analyzed using examinee’s ability being defined by the observed true score
the 3PL model in IRT and IRC plots. of the test depends on the item features. Simply, the item
parameters depend on the ability distribution of examinees
and the person parameters depend on the set of test items.
V. RESULTS AND DISCUSSION Furthermore, CTT makes the assumption of equal errors for
A. Significance of using 3PL-IRT all ability parameters. There is no probability information
available about how examinees of a specific ability might
In the context of the data on Thai students, CTT respond to a question. Generally, CTT focuses on test-level
presented some shortcomings and IRT analysis some information, and depends on a linear model [2–5].
advantages. We demonstrate these observations via the As mentioned earlier, IRT is an alternative that over-
item difficulty index (P) and the point-biserial coefficient comes some disadvantages of CTT. The IRT is based on
(rpbs ) of the CTT framework. The difficulty index of an nonlinear models, makes strong assumptions, and focuses
item (P) measures the proportion of students who answer on item-level information. An ability parameter and its
correctly. The point-biserial coefficient (rpbs ) measures the individual error are test independent, and are estimated
Pearson correlation of scores for a single item to the whole from the patterns in the test responses. The item and ability
test scores, to reveal the discrimination information of that parameters should be invariant if the model optimally fits
item. When we calculated P and rpbs of our TUV data, we the test data and the sample size is large enough [2–3,7–8].
found that certain items showed outside the criterion These are theoretical advantages of IRT over CTT: how-
ranges of P ¼ ½0.3–0.9 and rpbs ≥ 0.2 [1,13]. Only the ever, some empirical studies have reported that the item
results for items 1, 2, and 3 are shown in Table I, attached and person parameters derived by the two measurement
for discussion in this part. Item 2 ðP ¼ 0.12Þ and item 3 frameworks are quite comparable [4–5].
ðP ¼ 0.24Þ appear to have been somewhat difficult, and Using the IRT framework, we explored the invariance
item 2 ðrpbs ¼ 0.17Þ was less useful in discriminating Thai of item parameters relative to abilities of the test takers by
students who know the cross product of vectors from those applying the 3PL model to the data on Thai students taking
who do not. the TUV test. Taking item 1 of the test as an example

TABLE I. Item difficulty index (P) and point-biserial coefficient (rpbs ) from CTT analysis for items 1, 2, and 3 in the TUV for overall
and for three ability levels of Thai students.

Item 1 Item 2 Item 3


P rpbs P rpbs P rpbs
Overall θ ¼ ½−1.7; 3.1 (N ¼ 2392) 0.39 0.48 0.12 a
0.17 a
0.24 a
0.36
Low ability θ ¼ ½−1.7; −0.6 (N ¼ 799) 0.15a 0.17a 0.11a 0.18a 0.14a 0.25
Medium ability θ ¼ ½−0.5; 0.5 (N ¼ 833) 0.34 0.13a 0.11a 0.25 0.20a 0.28
High ability θ ¼ ½0.6; 3.1 (N ¼ 760) 0.70 0.24 0.16a 0.27 0.39 0.38
a
Outside the criterion range.

020135-3
SUTTIDA RAKKAPAO et al. PHYS. REV. PHYS. EDUC. RES. 12, 020135 (2016)

(Fig. 1), the 3PL model fit to response data can be displayed colleagues reported using the one-parameter logistic model
as the item characteristic curve (ICC), representing the in IRT to explore the function of the Force Concept Inventory
probability of correct response [PðθÞ] across various TUV (FCI) [14]. In the same year, the FCI was analyzed by Wang
abilities (θ). The parameters for item 1, when fit to the and Bao using the three-parameter logistic model in IRT,
entire data (N ¼ 2392), are a ¼ 0.91, b ¼ 0.76, and assuming the single-trait model [15]. However, in 2012, the
c ¼ 0.13, and these were computed by the PARSCALE study of Scott and colleagues showed that a five-trait model
program. When we separately fit the logistic model to in IRT was suitable for analysis of the FCI [16]. Recently, the
subset data for the low-ability, the medium-ability, and FCI has been analyzed using multitrait item response theory
the high-ability groups, the item parameters remained [6]. Aside from concept tests, IRT has also been applied to
unchanged. This indicates invariance of the item parame- general online questions with large numbers of participants
ters, relative to the subject population tested, which is [17–18].
desirable.
This can be simply explained by revising the logistic B. Analysis of 3PL-IRT
model of Eq. (1) into linear form. It can be rewritten as
In applying IRT to the data gathered on Thai first-year
ln½1−PðθÞ
PðθÞ−c ¼ αθ þ β, where α ¼ −Da and β ¼ Dab. This university students (N ¼ 2392), we used the PARSCALE
linearization has slope α and intercept β, while ln½1−PðθÞ
PðθÞ−c is program to fit the three-parameter logistic (3PL) models,
the log odds ratio at given θ. Indeed, the same linear model one for each TUV item. We assumed a single ability, named
should apply to any range for θ, giving the same values α the TUV ability, which represents the latent traits in each
and β, and therefore unchanged a and b. A single 3PL-IRT student that affect performance in the TUV. Each logistic
model for item 1 corresponds to a linear relationship, valid model is determined by identifying its three parameters:
for any range of θ (low-ability, medium-ability, or high- discrimination a, difficulty b, and guessing c. These
ability groups) with fixed slope and intercept. However, identified parameters are shown in Table II for the 20
this invariance property only holds when the model fits the TUV items, categorized by their concepts. Moreover, the
data exactly in the population [2–3,7]. item difficulty index (P) and the point-biserial coefficient
Several prior PER studies have applied the IRT framework (rpbs ) from the CTT framework are included as the last
to examine concept tests. For example, in 2010, Planinic and columns of Table II. The criterion ranges of the item

TABLE II. The model parameters identified in IRT analysis, namely, discrimination a, difficulty b, and guessing c, for the 20 items in
TUV categorized by concepts, along with the item difficulty index (P) and the point-biserial coefficient (rpbs ) from CTT analysis.

IRT CTT
a b c P rpbs
Vector concept Item [0,2] ½ − 2; 2 [0,0.3] [0.3,0.9] ≥0.2
1. Direction 5 0.84 0.18 0.40a 0.67 0.39
17 0.93 1.54 0.14 0.26a 0.39
2. Magnitude 20 0.84 −0.26 0.10 0.61 0.49
3. Component 4 0.88 0.57 0.33a 0.57 0.41
9 1.18 1.09 0.28 0.42 0.42
14 0.69 0.56 0.26 0.53 0.40
4. Unit vector 2 1.12 2.82a 0.11 0.12a 0.17a
5. Vector representation 10 0.65 0.67 0.23 0.50 0.37
6. Addition 1 0.91 0.76 0.13 0.39 0.48
7 0.96 0.25 0.24 0.57 0.47
16 0.97 0.91 0.15 0.37 0.47
7. Subtraction 13 1.07 0.78 0.08 0.34 0.53
19 1.12 0.86 0.05 0.30 0.54
8. Scalar multiplication 11 1.08 0.62 0.13 0.40 0.53
9. Dot product 3 1.04 1.70 0.14 0.24a 0.36
6 0.75 −0.45 0.16 0.67 0.43
8 0.59 0.83 0.00 0.26a 0.49
10. Cross product 12 1.02 1.74 0.08 0.17a 0.39
15 0.60 0.74 0.00 0.29a 0.49
18 1.17 1.16 0.12 0.28a 0.48
a
Outside the criterion range.

020135-4
ANALYSIS TEST OF UNDERSTANDING … PHYS. REV. PHYS. EDUC. RES. 12, 020135 (2016)

(a) (b) (c) (d) (e)

FIG. 2. ~ ¼ 2î þ 2ĵ.


Item 2 in the TUV, relating to the graphical representation of a unit vector in the direction of A

parameters are shown by interval bounds in square for the students, according to IRT. Their correct response
brackets. rates were only 17% and 24%, respectively, indicating
The item difficulty b is the ability θ at the inflection point them as hard items according to the P values also. Further
of ICC. In the logistic model, at this point, the probability agreement was found in item 6, whose P value indicated an
of correct answer is ð1 þ cÞ=2, midway between the easy question, with about 67% correct answers, consistent
asymptote levels, as seen by substituting θ ¼ b in with difficulty b ¼ −0.45. However, the two measures of
Eq. (1). When c ¼ 0, as in the 1PL and 2PL models, difficulty seem to disagree on items 8 and 15. Viewed
the probability of a correct answer is 0.5 and this could be through CTT, they appeared to be quite difficult items with
used to identify b. Parameter b is named “difficulty” correct answer rates below 30%, but their respective
because a harder test item requires higher ability b for b values were 0.83 and 0.74, quite far from the upper
probability 0.5 of a correct answer. The criterion range for b bound of 2 for the difficulty parameter of IRT. As discussed
is chosen to be ½ − 2; 2 [2]. Clearly, an item with b close to earlier, this is one downside of the CTT approach, whose
−2 is very easy, while b close to 2 is very difficult for the results depend on the examinees’ ability. The Thai students
sample population of examinees. had low-ability levels with two-thirds of them below ability
The results in Table II show that item 2, involving the unit 0.5, as shown in Table I. Then the P values are biased
vector concept, was the most difficult question for the group downwards, and this bias favors labeling items as hard or
of Thai students, with the maximum b ¼ 2.82. Item 2 in the difficult.
TUV is displayed in Fig. 2. Only 12% of the students The discriminatory parameter a for an item is related to
correctly answered item 2 (choice C) (P ¼ 0.12). The most the slope of ICC at point b: for the 3PL model, the slope of
popular distractor for the students was choice B (61%). the ICC at θ ¼ b is actually Dað1 − cÞ=4. Items with high
Interview results showed that these students understand that a a values, or steeper slopes, are more useful for distinguish-
unit vector of A ~ has magnitude 1 and points to the same ing between examinees with closely similar abilities near b.
direction, but they thought the î þ ĵ vector in choice B has The typical values of a are in the interval [0, 2] [2,8].
magnitude 1. It indicates that they did not have difficulties in Several TUV items displayed very high a values, such as
the unit vector direction, but did not know the basic items 9, 2, 19, and 18, and these provide discrimination
mathematics for calculating vector magnitude. The same around the respective b values. In contrast, viewed through
misconception was reported in the study of Barniol and the CTT framework, the point-biserial coefficient of item 2
Zavala [19]. Since only a few students from the high-ability (rpbs ¼ 0.17) indicates low discrimination power.
group understood both the direction and the mathematics to The guessing parameter c of an item represents the
determine the magnitude of a vector, item 2 clearly presented probability that an examinee with very low-ability level
the most difficulty to the normal-ability students. answers correctly. This may relate to the attractiveness of
Although difficulty b from the IRT analysis and the the answer choices and/or the guess behavior of the
P value from the CTT analysis differ in their theoretical examinees. Its value is equal to the level of a lower
interpretations, they tend to be in good agreement. For asymptote for the ICC, and it ranges from 0 to 1.
example, item 12 for the cross product (b ¼ 1.74) and item Typically, c should be less than 0.3 [8]. Table II shows
3 for the dot product (b ¼ 1.70) were somewhat difficult that, overall, the very low-ability students had less than a

020135-5
SUTTIDA RAKKAPAO et al. PHYS. REV. PHYS. EDUC. RES. 12, 020135 (2016)

30% chance of choosing the correct option for most TUV on Thai students. It has high discrimination power
items, with the exception of items 4 and 5. Item 5 involves (a ¼ 1.12) for separating examinees at medium-ability
selecting a vector with a given direction from among level near b ¼ 0.86, and very low-ability students have a
several in a graph, and very poorly performing students 5% chance to correctly answer the item (c ¼ 0.05).
had a 40% chance to answer or guess correctly (c ¼ 0.40). This item asks students to choose the vector difference
Possibly the correct choice was the most attractive to the ~ −B
A ~ of two vectors (A ~ ¼ −3î; B ~ ¼ 5î) in the arrow
very low-ability students. The most usual incorrect representation. The most frequent error is adding instead of
response (choice A) and the correct response (choice C) ~ þB ~ option (choice B).
subtracting, and choosing the A
share the belief that vectors pointing to the same quadrant
They just overlap two arrows, cancel a part of opposite
(northeast) have the same direction [1,20]. This may enable
the low-ability students to score by chance in item direction and answer the remaining part. In some sense,
5 (c ¼ 0.40). many students seem to believe that the opposite arrow has
In contrast, the probability that a student with very low- already accounted for the subtraction, then they just add it
ability would correctly answer item 8 or item 15 was nil with another arrow instead of subtracting it [1,21]. The low-
ability students may hold such a misconception and have a
(c ¼ 0.00). It is an indication that there are strongly held
lower scoring probability (5%) than they would have from a
misconceptions represented in the distractors that appear as
random guess, while the high-ability students easily master
answer choices in these questions. Item 8 and item 15
the concept of graphical subtraction of vectors in one
involve the calculation of the dot product (A ~ · B),
~ and the
dimension. Item 19 was then considered as the most
~ ~ ~
cross product (A × B) of the vectors A ¼ î þ 3ĵ and appropriate question in the TUV for separating low- and
~ ¼ 5î, respectively. Our results show that choice C,
B high-ability students (b ¼ 0.86).
5î þ 3ĵ, was the distractor most commonly chosen in both Moreover, the steepness of ICC for items 9, 11, 13, and
item 8 (41%) and item 15 (35%). The misconception is that 18 (multiple choice items not shown here) was very similar
the dot or cross product of two vectors with identical unit to that of item 19, but with different ability thresholds or
vector is the same unit vector and can be combined with guessing parameters. Simply, a curve with greater b value is
shifted further to the right (items 9 and 18), while a greater
other vectors [1]. Low-ability students, who have difficul-
c value raises the bottom of the curve up (item 9). As shown
ties with calculations of the dot and cross product that
in Fig. 3, the ICC of item 2 had the same slope as item 19,
involve unit vector notation, have no chance (c ¼ 0.00) to
but it discriminates at a very high-ability level (b ¼ 2.82).
score from both items. In fact, they have a better chance if
The curve of item 2 is very flat at the low-ability levels with
they just guess and do not read the question, as this would
θ < 0 and rapidly rises at the high-ability levels. Also the
give the correct choice with probability 1=m in a multiple-
ICCs of items 3 and 12 were similar to that of item 2, but
choice question with m options, as explained by the CTT
moved further to the left. Roughly, the ICC of item 17 is
framework. That probability is 0.2 in these TUV items with
flatter than that of item 19 owing to the smaller a parameter,
five choices in each. which is similar to items 1, 7, 16, and 20. The slope of item
Let us assess some sets of questions that measured the
5’s curve at its inflection point is close to that of item 17,
same vector concept, categorized previously in Table II. In but the curve is “lifted up” by its greater guessing parameter
items testing for the component of vector concept, items 4
and 14 gave quite similar model parameters. Item 4 had item 2 item 5 item 8 item 17 item 19
higher discrimination power than item 14, but larger 1
guessing value. These items separate the examinees around
the same ability level b, so they can be considered parallel
Probability of correct responses

0.8
questions for one concept at a fixed difficulty level. Items 1,
7, and 16, testing the addition of vectors in different
0.6
contexts, display very similar a values, or sensitivities
around ability ¼ b, so we can select to use an item based on
the b value (or the ability) we focus around. 0.4

To show how the probability of correct response for a


specific item depends on the ability of an examinee, we 0.2
build the item characteristic curve (ICC). The ICC of a
well-designed question should have a sigmoidal S shape 0
[2,7]. Then the probability of a correct response would -3 -2 -1 0 1 2 3
consistently increase with ability, and a high slope at the TUV-ability
inflection point would indicate sharp separation by ability
around that point. As shown by the solid line in Fig. 3, item FIG. 3. The ICCs for items 2, 5, 8, 17, and 19 in the TUV,
19 of the TUV mostly agrees with these criteria in our data according to data on Thai students.

020135-6
ANALYSIS TEST OF UNDERSTANDING … PHYS. REV. PHYS. EDUC. RES. 12, 020135 (2016)

(c ¼ 0.40). This is similar to item 4, but with different are highly sensitive to ability differences around ability 1.1.
b values. As shown in Fig. 3, the ICC of item 8 is flatter In general, the purpose of the test decides what type of
than those of other items of the TUV and has the lowest information curve would be desired. For example, a flat
a value. TUV items with low discrimination power can still curve over the whole range of abilities indicates that the
be in the criterion ranges, such as items 15, 10, 14, and 6, component items are sensitive to variations at different
and present ICCs similar to that of item 8. ability levels, so the test information obtained by summing
Overall, the results show that each TUV item has proper the item information becomes evenly distributed. This is
discrimination power at its specific difficulty (or ability desired if the test serves to assess a wide variety of abilities.
level). Clearly, the steepness of the curve demonstrates the In contrast, a sharply peaked curve sensitively reports
capability of the item to discriminate in the ability domain differences of the test takers only around that peak, else-
between examinees who understand the concept targeted where it acts like a pass or fail threshold. This may be
by the item and those who do not. In general, a set of test desirable when the purpose of the test is to provide eventual
items or questions should cover a range of ability domains pass or fail labels. Test developers can benefit from these
in which the test takers are expected to differ. To determine item and overall test information curves, to revise tests with
how well the TUV does in testing adoption of the vector such considerations of purpose in mind.
concept by the examinees at their various ability levels, the However, we have only analyzed data on test responses
test information function was investigated in IRT frame- by Thai students. The item parameters of the TUV reported
work. The information function for a test at θ, denoted IðθÞ, in Table II specifically apply to Thai first-year science and
is defined as the sum of the item information functions at θ: engineering students. When the TUV test is administered in
Xn Xn another language to a different group of students, the item
½Pi ðθÞ0 2
IðθÞ ¼ I i ðθÞ ¼ ; ð2Þ parameters will likely change from those obtained in the
i¼1
P ðθÞQi ðθÞ
i¼1 i current study. Simply calibrating item parameters using
where I i ðθÞ is the item information function of item i, IRT does not automatically make them universal. An
Pi ðθÞ0 is the derivative of Pi ðθÞ with respect to θ for item i, equating or scaling procedure is needed to transform the
and Qi ðθÞ is 1 − Pi ðθÞ [2,7]. In Eq. (2), clearly information item parameters of a test from one group of examinees to
of one item is independent of other items in the test. This another, or for a given group of examinees the ability may
feature is not available in CTT. For example, the point- be needed to transform from one test to another. Such
biserial coefficient for an item is influenced by all items in equating usually assumes a linear relationship in the
the test. A plot of IðθÞ for the 20-item TUV across ability transformation of parameters.
levels is shown in Fig. 4. The information curve peaks There is some arbitrariness to the ability scale, and
sharply to its maximum value 7.5 at θ ¼ 1.1. This indicates rescaling it gives an equally valid but different ability scale.
that the TUV test provides information about the vector We now examine such scaling along with transformations.
concepts most effectively when the examinees have abil- Using the 3PL model in IRT, the probability of a correct
ities roughly in the range from 0.1 to 2.1 (medium to high). response to item i by a person with ability θ is
For cases with abilities less than 0, the TUV test provides Pi ðθ; ai ; bi ; ci Þ, as shown in Eq. (1). On linearly trans-
forming the IRT ability, the probability of a correct response
very little information that would distinguish their
must not change, so Pi ðθ; ai ; bi ; ci Þ ¼ Pi ðθ ; ai ; bi ; ci Þ,
differences, while the results of the 20-item TUV test
with the transformed parameters indicated by stars.
8
Notice that the guessing ci is on the probability unit of
measurement, so no transformation is necessary. The trans-
7 formation equations are bi ¼ Abi þ B, θ ¼ Aθ þ B, and
6 ai ¼ ai =A, where A and B are the scaling constants of the
linear transformation. These transformations do not change
Test information

5
ai ðθ − bi Þ, which is the invariant property of the item
4 response function [2,9,22]. Researchers can apply the trans-
formation equations to transform the item parameters of the
3
TUV reported in this article to another group of students.
2 Mathematical methods introduced to estimate the A and B
1 scaling constants include regression, the mean and sigma
method, the robust mean and sigma method, and the
0
characteristic curve method [2,9,22].
-3 -2 -1 0 1 2 3
TUV-ability
C. Analysis of IRC
FIG. 4. Test information curve of the 20-item TUV across To show how well choices of an item function are
ability levels of the Thai students tested. distributed, we will now assess the item response curves

020135-7
SUTTIDA RAKKAPAO et al. PHYS. REV. PHYS. EDUC. RES. 12, 020135 (2016)

FIG. 5. The IRC of item 1, showing the percentage of respondents by choice, with symbols for the choices being ⋄ ¼ A, □ ¼ B,
Δ ¼ C, o ¼ D, and × ¼ E. The correct choice was E. The horizontal scale is the ability in plot 1(a), and the total test score in plot 1(b).

(IRCs), in which the vertical axis represents the percentage choices in the item gets its own IRC, with symbols
of respondents, and the horizontal axis represents the ⋄ ¼ choice A, □ ¼ choice B, Δ ¼ choice C, o ¼ choice D,
ability level as created by the PARSCALE program; the and × ¼ choice E. The graphs are very similar, demon-
total score is not used as a surrogate for ability, as in strating the tight relation of the ability and the total score as
Refs. [11–12]. Students with the same overall test score mentioned before. In this paper, we chose to present the
need not share an ability level, because the pattern of their IRCs plotted against the ability, in order to easily compare
answers still can differ, but the total score and the ability with results of IRT analysis.
typically have a robust correlation in a well-designed test. The correct response in item 1 is choice E, which in the
To check whether this makes a difference, a linear model IRC plots has a consistently increasing steady trend. In
was fit to predict the ability θ from the TUV raw score. other words, higher ability corresponds to higher proba-
The model θ ¼ 4.84ðraw scoreÞ − 6.76, with coefficient bility of the correct answer to item 1, which is desirable.
of determination R2 ¼ 0.98, can be used to estimate the This indicates suitable discrimination power of choice E
ability from the total TUV raw score (within the data on in item 1, consistent with the results of IRT analysis
Thai students, not necessarily in general). To display (a ¼ 0.91, in Table II). The ICC shown in Fig. 1 is the
example IRC plots against the ability and the total score, logistic curve fit to the IRC for choice E. The most popular
the curves for item 1 are shown in Fig. 5. Each of the five distractor in item 1 was D, which also has discrimination

FIG. 6. IRCs for items 2, 3, 5, and 10, representing the fractions of respondents at any given ability that chose option ⋄ ¼ A, □ ¼ B,
Δ ¼ C, o ¼ D, or × ¼ E.

020135-8
ANALYSIS TEST OF UNDERSTANDING … PHYS. REV. PHYS. EDUC. RES. 12, 020135 (2016)

power: many low-ability students selected D, and the distractor, the low-ability students have poorer performance
frequency of this choice consistently decreases with ability. in items 8 and 15 than random guessing would give. In
The other distractors in item 1 function quite well. This IRC contrast, the correct choices of items 4 and 5 are the most
pattern is similar to those of items 8, 11, 13, 15, 17, and 19. attractive responses to the very low-ability students, who
In the IRC for item 2 shown in Fig. 6 the correct choice C have a >30% chance of answering each item correctly.
has a relatively flat graph at low-ability levels and starts to Moreover, the IRC analysis covering all distractors dis-
increase around ability 1.5. This agrees with IRT analysis closed that some distractors in the TUV did not function
that only considers the correct option in an item, in showing well. For example, choice B of item 2 had a flat response to
discrimination power only at high-ability levels. Moreover, ability, indicating it discriminates poorly. The students were
the IRC shows that the students were attracted to choose equally likely to choose distractor B, regardless of their
option B in item 2 almost uniformly across the ability ability level. The distractors B, D, and E of item 5 did not
levels. This distractor had poor discriminating power, as did function well either, attracting few students overall.
the other distractors in item 2. The correct choice C in item However, as mentioned, the item and ability parameters
3 had discrimination power at abilities exceeding 1, which reported in this study only pertain to the TUV responses by
agrees with IRT analysis (a ¼ 1.04), while its distractors Thai first-year science and engineering students. Rescaling
drew about equal attention from the low-ability students may be required for transfer of the current results to other
(ability < 1). This is similar to item 12. In item 5, the groups of examinees.
correct option C and the distractor A function very well, but Overall, the approach and findings of the current study
the other distractors do not: they attracted few students may be used to develop and improve testing, and enhance
from any ability level, with flat response curves. The IRCs its sensitivity and effectiveness within a given range of
of the remaining TUV items are similar to item 10, in which abilities. Test developers can analyze item and ability
all choices functioned quite well: the distractors were more parameters using IRT, and distractors in an item can be
popular among the low-ability students, and their curves assessed with the IRC technique. Moreover, using IRT with
gently decreased with ability, while the trend for the correct the item parameters held constant, the same group of
choice was increasing with ability. students can be tested before and after instruction to
determine the learning gains in ability. Further studies of
VI. CONCLUSIONS TUV or its modifications could, in particular, explore the
dimensionality of latent traits and implement the multitrait
Results in this study indicate that the 20-item TUV test,
model of item response theory. The current results and
with 5 choices per item, is most useful for testing the vector
approach can directly benefit anyone who uses the TUV, to
concepts when the ability of the examinees is from medium
gain improved accuracy of diagnosis.
to high. It can be applied as a pass or fail threshold
instrument at a somewhat high-ability value (θ ≈ 1.1). This
ACKNOWLEDGMENTS
insight is clearly provided by the test information function.
Items 2, 3, and 12 are useful for separating examinees at The research was supported by Thailand Research Fund
high-ability levels. There is a very strongly held miscon- and Prince of Songkla University under Grant
ception represented in the distractors of items 8 and 15, No. TRG5780099. The authors are grateful to all partici-
shown by the biased preferences of low-ability students for pating students, Prof. Seppo Karrila, and Prof. Helmut
these distractors. Because of this attraction bias of a Dürrast for valuable data and suggestions.

[1] P. Barniol and G. Zavala, Test of understanding of vectors: [4] X. Fan, Item response theory and classical test theory: An
A reliable multiple-choice vector concept test, Phys. Rev. empirical comparison of their item/person statistics, Educ.
ST Phys. Educ. Res. 10, 010121 (2014). Psychol. Meas. 58, 357 (1998).
[2] R. K. Hambleton, H. Swaminathan, and H. Rogers, Fun- [5] P. MacDonald and S. Paunonen, A Monte Carlo compari-
damentals of Item Response Theory (Sage Publications, son of item and person statistics based on item response
Inc., Newbury Park, CA, 1991). theory versus classical test theory., Educ. Psychol. Meas.
[3] R. K. Hambleton and R. W. Jones, Comparison of 62, 921 (2002).
classical test theory and item response theory and their [6] T. F. Scott and D. Schumayer, Students’ proficiency scores
applications to test development, Educ. Meas. 12, 38 within multitrait item response theory, Phys. Rev. ST Phys.
(1993). Educ. Res. 11, 020134 (2015).

020135-9
SUTTIDA RAKKAPAO et al. PHYS. REV. PHYS. EDUC. RES. 12, 020135 (2016)

[7] C. L. Hulin, F. Drasgow, and C. K. Parsons, Item Response [15] J. Wang and L. Bao, Analyzing force concept inventory
Theory: Application to Psychological Measurement (Dow with item response theory, Am. J. Phys. 78, 1064
Jones-Irwin, Homewood, IL, 1983). (2010).
[8] D. Harris, Comparison of 1-, 2-, and 3-parameter IRT [16] T. F. Scott, D. Schumayer, and A. R. Gray, Exploratory
models, Educ. Meas. 8, 35 (1989). factor analysis of a Force Concept Inventory data set, Phys.
[9] M. J. Kolen and R. L. Brennan, Test Equating, Scaling, and Rev. ST Phys. Educ. Res. 8, 020105 (2012).
Linking Methods and Practices, 2nd ed. (Springer Science [17] G. Kortemeyer, Extending item response theory to online
and Business Media, New York, 2004). homework, Phys. Rev. ST Phys. Educ. Res. 10, 010118
[10] D. T. Mathilda, IRT from SSI: BILOG-MG, MULTILOG, (2014).
PARSCALE, TESTFACT User Manual (Scientific Software [18] Y. J. Lee, D. J. Palazzo, R. Warnakulasooriya, and D. E.
International, Inc., Lincolnwood, IL, 2003). Pritchard, Measuring student learning with item response
[11] G. A. Morris, L. Branum-Martin, N. Harshman, S. D. theory, Phys. Rev. ST Phys. Educ. Res. 4, 010102
Baker, E. Mazur, S. Dutta, T. Mzoughi, and V. McCauley, (2008).
Testing the test: item response curves and test quality, Am. [19] P. Barniol and G. Zavala, Students’ difficulties with unit
J. Phys. 74, 449 (2006). vectors and scalar multiplication of a vector, AIP Conf.
[12] G. A. Morris, N. Harshman, L. Branum-Martin, E. Mazur, Proc. 1413, 115 (2012).
T. Mzoughi, and S. D. Baker, An item response curves [20] N. Nguyen and D. E. Meltzer, initial understanding of
analysis of the force concept inventory, Am. J. Phys. 80, vector concepts among students in introductory physics
825 (2012). courses, Am. J. Phys. 71, 630 (2003).
[13] L. Ding, R. Chabay, B. Sherwood, and R. Beichner, [21] A. F. Heckler and T. M. Scaife, Adding and subtracting
Evaluating an electricity and magnetism assessment tool: vectors: The problem with the arrow representation, Phys.
Brief electricity and magnetism assessment, Phys. Rev. ST Rev. ST Phys. Educ. Res. 11, 010101 (2015).
Phys. Educ. Res. 2, 010105 (2006). [22] M. L. Stocking and F. M. Lord, Developing a common
[14] M. Planinic, L. Ivanjek, and A. Susac, Rasch model based metric in item response theory, Appl. Psychol. Meas. 7,
analysis of the Force Concept Inventory, Phys. Rev. ST 201 (1983).
Phys. Educ. Res. 6, 010103 (2010).

020135-10

You might also like