Educational Tests
Educational Tests
Email: [email protected]
Phone: +(966) 3-860-1930
Fax: +(966) 3-860-2174
Abstract
schools have fueled innovations in test construction and analysis. As the measurement accuracy
of a test depends on the quality of the items it includes, item selection procedures play a central
role in this process. Mathematical programming and the item response theory (IRT) are often
used in automating this task. However, when the item bank is very large, the number of item
combinations increases exponentially and item selection becomes more tedious. To alleviate this
problem, several attempts were made to utilize heuristic search and machine learning
approaches, including neural networks. This paper proposes a novel approach that uses abductive
network modeling to automatically identify the most-informative subset of test items that can be
used to effectively assess the examinees without seriously degrading accuracy. Abductive
machine learning automatically selects only effective model inputs and builds an optimal
network model of polynomial functional nodes that minimizes a predicted squared error
criterion. Using a training dataset of 1500 cases (examinees) and 45 test items, the proposed
approach automatically selected only 12 items which classified an evaluation population of 500
cases with 91% accuracy. Performance is examined for various levels of model complexity and
compared with that of statistical IRT-based techniques. Results indicate that the proposed
approach significantly reduces the number of test items required while maintaining acceptable
test quality.
Keywords: Abductive machine learning, Abductive networks, Neural networks, Optimal test
design, Educational measurements, Item response theory, Test analysis, Test construction.
2
1. Introduction
There is a growing interest in the use of computers in automating test construction and
analysis, especially for large-scale testing (Buyske, 2005; Stocking, Swanson & Pearlman,
1991). A primary goal of administering an educational test is to locate examinees on the ability
scale and to classify them into categories with acceptable accuracy. This is usually achieved by
observing their response to items included in the test, which are selected from a larger collection
of items in the form of an item bank or pool. One of the earlier findings on educational
measurements is that classification accuracy is improved when the test consists of a large number
of discriminating items which are neither too easy nor too difficult for the test takers (Berger,
1997). However, increasing the number of items is not cost effective, as it requires more physical
resources, e.g. paper, and consumes longer times from both the examiners and examinees. While
test analysis is concerned with item characteristics and how accurate a test is in classifying
examinees, test construction is concerned with selecting items to be included in the test that
ensure accurate assessment using relatively few items. Unfortunately, the process of test
construction and analysis could be quite labor-intensive. As a result, several methods have been
proposed for automating the process based on the item response theory (IRT) (Lord, 1980;
Hambleton & Swaminathan, 1985; Stocking, Swanson & Pearlman, 1991; van der Linden &
Hambleton, 1997). Within the framework of IRT, examinees are described by a single latent
variable and each item is described by the Fisher’s information function. The item information
function (IIF) provides test developer with an indication of the measurement precision for the
test item. Accordingly, a test can be formed by selecting items on the basis of their information
function. Lord (Lord, 1980) outlined a procedure for selecting items such that the information
function of the constructed test (sum of the information functions for the individual items it
3
includes) approximates a target information function to a satisfactory degree. The smaller the
distance between the target information function and the constructed test information function,
the more precise the test is in measuring ability. Although this procedure is conceptually simple,
it becomes impractical to apply as the item bank grows in size. Mathematical programming
provides a more systematic approach for optimal test design. A great amount of research has
been conducted in this area; see for example, (Lord, 1980; Hambleton & Swaminathan, 1985;
Theunissen, 1985; Baker, 1988; van der Linden, 1987; van der Linden & Boekkooi-Timminga,
1989; Adema, 1990a; Adema, 1990b; Adema, Boekkooi-Timminga & van der Linden, 1991;
Fletcher, 2000; van der Linden, 2005). With these approaches, the test construction problem is
modeled as an optimization problem to maximize (or minimize) some objective function while
meeting a number of constraints in the form of test specifications. However, the application of
such approaches is often hindered by the need for a prior estimation of item characteristics.
Moreover, the search for optimal solutions becomes computationally intensive as the size of the
item bank increases. To overcome these limitations, a number of heuristic approaches have been
computation time using Tabu search, simulated annealing, etc. (Adema & van der Linden, 1989;
Adema 1990b; Swanson & Stocking, 1993; Jeng & Shih, 1997; Luecht, 1998; Hwang, Yin &
Yeh, 2006). Recently, artificial neural networks have been successfully used to solve many
complex modeling and optimization problems in several areas of science, engineering, and the
social sciences, and some attempts have been made in the area of educational measurements. Sun
and Chen (Sun & Chen, 1999) used neural networks for constructing educational tests. With their
approach, the test information function is transformed into an energy function which is
minimized using a neural network model. When the energy function stabilizes, the state of the
4
network represents a solution. Although this approach can be used to effectively solve the
problem, the computing processes are complex. To achieve good results at a faster pace, a greedy
approach similar to the neural network method was later proposed (Sun, 2001).
This paper proposes an alternative approach based on abductive machine learning for
identifying the most informative subset of items that can be used to effectively assess the
examinees without severely degrading measurement accuracy. Abductive machine learning has
(Montgomery & Drake, 1991). It builds an optimal network model composed of non-linear
functional elements (nodes) organized in layers in a manner that minimize a predicted squared
error (PSE) criterion (Barron, 1984). Thus, it can represent complex and uncertain relationships
between dependent (output) and independent (input) variables. There are several advantages for
using abductive networks for discovering complex relationships between input and output
variables. Unlike most approaches such as regression and neural networks, the self-organizing
the training data without requiring the user to specify the network architecture in advance. It has
also been shown that the prediction accuracy of abductive networks can be higher compared to
that of neural networks (Montgomery & Drake, 1991). Furthermore, abductive networks were
found to be faster, easier to use, and involved fewer parameters (Agarwal, 1999). The iterative
tuning process necessary with regression and neural network approaches is largely reduced with
the abductive approach. Accordingly, an abductive network model can be used effectively as an
estimator for predicting the output of a complex system, a classifier for handling difficult pattern
recognition problems, or a system identifier for determining which inputs are important for
modeling the system (Agarwal, 1999). The approach selects only relevant model inputs and
5
synthesizes more transparent models that provide greater insights and give better explanations
for the modeled phenomena compared to neural networks, which is an important advantage in
human-related disciplines, e.g. education, medicine, and the environment. The abductive network
approach has been previously used to model and forecast the educational score in school health
surveys (Abdel-Aal & Mangoud, 1996) and in a variety of other areas including weather
forecasting (Abdel-Aal & Elhadidy, 1995), financial modeling (Agarwal, 1999), electric load
forecasting (Abdel-Aal, 2004), drilling tool life prediction (Lee, Liu & Tarng, 1999), electronic
combat (Montgomery, Hess & Hwang, 1990), and fault diagnosis in electrical power
The rest of this paper is organized as follows: Section 2 briefly describes abductive network
machine learning, highlighting similarities and differences with neural networks. Section 3 gives
an outline of the dataset used in our experiments together with the results of some exploratory
analysis. Section 4 presents the results of using abductive networks to model examinees’ ability
in terms of their response to the test items at various levels of specified model complexity.
Section 5 describes corresponding results obtained using statistical and IRT-based techniques.
Section 6 compares the results obtained from the two approaches and conclusions are made in
Section 7.
learning tool for automatically synthesizing abductive network models from a database of inputs
and outputs representing a training set of solved examples. As a self-organizing group method of
data handling (GMDH) (Farlow, 1984), the tool can automatically synthesize adequate models
that embody the inherent structure of complex and highly nonlinear systems. The automation of
6
model synthesis not only lessens the burden on the analyst but also safeguards the model
generated from being influenced by human biases and misjudgments. The GMDH approach is a
using initially simple (myopic) regression relationships to derive more accurate representations
in the next iteration. To prevent exponential growth and limit model complexity, the algorithm
selects only relationships having good predicting powers within each phase. Iteration is stopped
when the new generation regression equations start to have poorer prediction performance than
those of the previous generation, at which point the model starts to become overspecialized and
therefore unlikely to perform well with new data. The algorithm has three main elements:
representation, selection, and stopping. It applies abduction heuristics for making decisions
To illustrate these steps for the classical GMDH approach, consider an estimation database of
ne observations (rows) and m+1 columns for m independent variables (x1, x2, ..., xm) and one
dependent variable y. In the first iteration we assume that our predictors are the actual input
variables. The initial rough prediction equations are derived by taking each pair of input
variables (xi, xj; i, j = 1, 2, ..., m) together with the output y and computing the quadratic
Each of the resulting m(m-1)/2 polynomials is evaluated using data for the pair of x variables
used to generate it, thus producing new estimation variables (z1, z2, ..., zm(m-1)/2) which would be
expected to describe y better than the original variables. The resulting z variables are screened
according to some selection criterion and only those having good predicting power are kept. The
7
original GMDH algorithm employs an additional and independent selection set of ns
observations for this purpose and uses the regularity selection criterion based on the root mean
ns ns
rk = ∑ (yl − z kl )2 ∑y ;k = 1,2,...,m(m − 1)/2 . (2)
2 2
l
l =1 l =1
Only those polynomials (and associated z variables) that have rk below a prescribed limit are kept
and the minimum value, rmin, obtained for rk is also saved. The selected z variables represent a
new database for repeating the estimation and selection steps in the next iteration to derive a set
of higher-level variables. At each iteration, rmin is compared with its previous value and the
increasing rmin is an indication of the model becoming overly complex, thus over-fitting the
estimation data and performing poorly in predicting the new selection data. Keeping model
the final objective of constructing the model, i.e., using it with new data previously unseen
during training. The best model for this purpose is that providing the shortest description for the
data available (Barron, 1984). Computationally, the resulting GMDH model can be seen as a
layered network of partial quadratic descriptor polynomials, each layer representing the results of
an iteration.
A number of GMDH methods have been proposed which operate on the whole training data
set thus avoiding the use of a dedicated selection set. The adaptive learning network (ALN)
approach, AIM being an example, uses the predicted squared error (PSE) criterion (Barron,
1984) for selection and stopping to avoid model overfitting, thus eliminating the problem of
determining when to stop training in neural networks. The criterion minimizes the expected
8
squared error that would be obtained when the network is used for predicting new data. AIM
where FSE is the fitting squared error on the training data, CPM is a complexity penalty
multiplier selected by the user, K is the number of model coefficients, n is the number of samples
in the training set, and σp2 is a prior estimate for the variance of the error obtained with the
unknown model. This estimate does not depend on the model being evaluated and is usually
taken as half the variance of the dependent variable y (Barron, 1984). As the model becomes
more complex relative to the size of the training set, the second term increases linearly while the
first term decreases. PSE goes through a minimum at the optimum model size that strikes a
balance between accuracy and simplicity (exactness and generality). The user may optionally
control this trade-off using the CPM parameter. Larger values than the default value of 1 lead to
simpler models that are less accurate but may generalize well with previously unseen data, while
lower values produce more complex networks that may overfit the training data and degrade
AIM builds networks consisting of various types of polynomial functional elements. The
network size, element types, connectivity, and coefficients for the optimum model are
automatically determined using well-proven optimization criteria, thus reducing the need for user
intervention compared to neural networks. This simplifies model development and reduces the
learning/development time and effort. The models take the form of layered feed-forward
abductive networks of functional elements (nodes) (AbTech, 1990), see Fig. 1. Elements in the
first layer operate on various combinations of the independent input variables (x's) and the
element in the final layer produces the predicted output for the dependent variable y. In addition
9
to the main layers of the network, an input layer of normalizers convert the input variables into
an internal representation as Z scores with zero mean and unity variance, and an output unitizer
The used version of AIM supports the following main functional elements:
(i) A white element which consists of a constant plus the linear weighted sum of all outputs of
where x1, x2,..., xn are the inputs to the element and w0, w1, ..., wn are the element weights.
(ii) Single, double, and triple elements which implement a third-degree polynomial expression
with all possible cross-terms for one, two, and three inputs respectively; for example,
"Double" Output = w0 + w1x1 + w2x2 + w3x12 + w4x22 + w5x1x2 + w6x13 + w7x23, (5)
3. The Dataset
In order to evaluate the performance of the proposed approach, we used a dataset from
(Rudner, 2005) which consists of a sample of 2000 cases (examinees) and a 45-item test. It is
assumed that examinees are classified based on a single-ability parameter, θ. Hence, each case in
the dataset gives the response vector and the true ability level for an individual test taker. Table 1
lists the information for the first twenty cases of the dataset, showing the response vector to the
test items and the corresponding true ability parameter for each case. The test items are
numbered as 1, 2, 3, …, 45 according to the column they occupy in the dataset. The column
number is used as an item identification (IID) throughout this paper. Test items are
dichotomously scored, i.e. when the test is taken, the examinee’s response to each item is
encoded as 1 (i.e. correct) or 0 (i.e. incorrect). It is also assumed that the examinee can skip some
items which are marked x (i.e. missing) in Table 1. Out of the 2000 cases, only two ability values
10
(0.1%) fall outside the range {-4 to +4}, so practically the ability scale ranges from -4 to +4. The
distribution of examinees in this sample approximately follows a normal distribution over the
ability range -4 to +4, as indicated by the histogram plot in Fig. 2. For the purpose of
experiments reported later in this paper, the total sample population is divided into two
categories (fail and pass) and each category is further divided into two groups (G1 and G2 for the
fail category and G3 and G4 for the pass category), as marked on Fig. 2. Details of the size of
these categories and groups and their boundaries on the ability scale are listed in Table 2.
Abductive networks were used to model the relationship between the ability level of the
examinees and their response to the 45 test items, through training on a subset of the dataset. To
account numerically for skipped test items in the response vectors, these input items were
assigned 0, while correct responses were represented as +1 and incorrect responses as -1. The
objective of abductive modeling is to utilize the property of automatic selection of effective input
variables to identify the optimum subset of test items that explain the ability outcome. To verify
the adequacy of the resulting model, performance of the model in predicting the ability level was
evaluated on an evaluation subset not seen previously during training. Two modeling
experiments were performed which are described in the two subsections below.
Abductive networks were used to model a two-level outcome for the examinees’ ability as a
function of relevant input test items. Ability values in the range {-4.1456 to +0.0055} were
assigned an output level 0 (fail category) while values in the range {+0.0075 to +4.0583} were
assigned an output level 1 (pass category). Referring to Table 2, the first category consists of
groups G1 and G2 and the second category consists of groups G3 and G4, with each category
11
comprising 1000 cases. The overall set of 2000 cases was then randomly split into two subsets:
1500 cases used for training and 500 cases for evaluation. Responses for all the 45 test items
were enabled as inputs to the model. Table 3 shows abductive model structures synthesized at
various levels of model complexity as indicated by the CPM parameter specified prior to
training. The variable number indicated at a model input in Table 3, e.g. Var_i, correspond to the
IID of the test item selected as input to the model during model synthesis. Var_46 is the binary
(pass/fail) ability output. Lower CPM values give more complex models. The same model
structure and selected model inputs were preserved over the CPM range of 0.2 to 2.0, which is a
sign of model robustness. All these models select the same 12 inputs (test items) out of the 45
inputs available, thus achieving about 73.3% dimensionality reduction for the modeled problem.
The selected model inputs correspond to test items having IIDs 3, 10, 17, 19, 23, 25, 27, 31, 36,
41, 43, and 45. Preserving the same subset of inputs over a decade of variations in the CPM
value indicates the importance of the selected inputs to the modeling process. At CPM = 5, a
slightly simpler model is synthesized which uses only 11 inputs, namely 3, 6, 7, 15, 17, 25, 27,
31, 36, 41, and 45. Approximately 73% of these inputs are included in the previous subset of 12
inputs. The table also lists the percentage classification error for each model on both the training
and evaluation datasets. As model complexity increases (lower CPM values), the model fits the
training data more closely and the classification performance on the training dataset improves.
However, the possibility of overfitting increases, which degrades performance on the external
evaluation set. Fig. 3 plots the above two model performance indicators versus the CPM value.
Best classification performance on the evaluation set is obtained using the optimum model with
CPM = 0.5 which gives a classification error of 9.4%. Table 4(a) shows the resulting confusion
matrix and Table 4(b) lists the parameters characterizing the classification performance of this
12
optimum model on the evaluation set, including classification accuracy, sensitivity, specificity,
positive predictive value, and negative predictive value. Throughout the above analysis, “Pass” is
considered as the positive outcome. The results indicate a minimum value of approximately 90%
Abductive models were also developed to further discriminate between examinees in each of
the fail and pass categories based on their ability level. Referring to Table 2, the 1000 cases in
the fail category were split into two groups G1 (100 cases) and G2 (900 cases) corresponding to
two binary levels for the ability. The 1000 cases were then randomly split into two subsets: 750
for training and 250 for evaluation. The G1/G2 model was synthesized using the training subset
and evaluated on the evaluation subset. Similarly, the G3/G4 model was developed for the pass
category. Table 5 shows the optimum abductive network structures that minimize the
classification error on the evaluation subsets for the G1/G2 and the G3/G4 models. The first
model selects 10 input items, {2, 4, 5, 8, 9, 12, 29, 34, 39, 43}, while the second model selects 11
items, {8, 12, 15, 18, 19, 20, 21, 27, 32, 33, 38}. Only 2 items are common between the two
subsets, which demonstrates the ability of abductive learning to successfully select different
subsets of test items that achieve different objectives. Classification accuracy on the evaluation
set is 90% and 93% for the G1/G2 and the G3/G4 models, respectively.
5. IRT-Based Analysis
Following the three parameter logistic model (3PL) (Lord, 1980), each dichotomously scored
test item is characterized by three parameters; namely discrimination power parameter, a, item
difficulty parameter, b, and guessing parameter, c. With this model, the probability that a test
taker with ability θ correctly answers an item with parameters (a, b, c) is given by (Lord, 1980):
13
1− c
P (θ ) = c + . (6)
1 + e − a (θ − b )
where a ∈ (0, ∞), b ∈ (0, ∞) and c ∈ (0,1). Using the empirical dataset described in Section 3,
individual test items were calibrated using Newton-Raphson maximum likelihood estimation as
outlined in (Lord, 1980; Rudner, 2005). Table 6 lists the actual values for the a, b, and c
parameters for each test item. We carried out the calibration for test items given the response
patterns and true abilities (method a). The estimated values for the three parameters as well as
their standard errors, SE_a, SE_b, SE_c, for each test item are shown in columns (a) of Table 6.
The table also shows the number of cases, N, used for calibrating each item. We have also
estimated the examinees’ abilities given their response patterns and the item parameters
calculated above. Examinees were then classified according to the estimated ability as pass or
fail by setting the threshold value to 0. The total percentage classification error (passing a failed
examinee or failing a passed examinee) was found to be 6.15%, with the false fail rate (failing a
passed examinee) being 2.8% and the false pass rate (passing a failed examinee) being 3.35%.
We also estimated the ability parameter, θ, and item parameters given only the response patterns
(method b). The item parameters estimated in this way together with their standard errors are
shown in columns (b) of Table 6. As a result, the total classification error increased to 6.45%
while the false fail rate dropped to 2.6%. Table 7 lists the true and estimated ability parameter, θ,
and its standard error, SE, for the first twenty cases in the dataset using item parameters and
To examine the correlation between the actual and estimated parameters of the test items, we
plot the scatter diagram and give the computed correlation coefficients in Fig. 4 and Fig. 5 for
methods a and b, respectively. The results show that parameters estimated from response vectors
14
and true abilities (method a) are more correlated to the actual parameters than those estimated
from the response vectors alone. Similarly Fig. 6 shows the scatter diagrams and correlation
coefficients for estimated abilities using the two methods. Again, we found that the abilities
estimated using response vectors and estimated item parameters (method a) are slightly more
correlated to the true abilities than those estimated from response vectors alone. The figure also
shows the correlation between the two estimated ability parameters. We have also observed that
estimating the item parameters and the ability parameter from just the response patterns
converged much slower compared to estimating item parameters from θ and response patterns or
6. Comparison of Results
The purpose of this paper is to investigate the potential use of abductive machine learning in
test development as an alternative to conventional methods, e.g. IRT-based analysis. This section
compares results obtained using the two approaches. As described in Section 5, best results for
pass/fail classification using IRT-based analysis using all 45 test items give a classification error
of 6.15%. The corresponding optimum abductive model synthesized with CPM = 0.5 gives
classification errors of 7.8% and 9.4% on the training and evaluation sets, respectively, and uses
only 12 test items, see Table 3. The significant reduction in the number of test items required for
the test may justify the slight degradation in classification accuracy. We also examined the
properties of test items selected as inputs during the synthesis of the abductive network model
for fail/pass classification to verify if they represent an adequate selection according to IRT
criteria. Table 8 lists all 45 test items and the results of sorting them in a number of ways.
Columns 2, 3, and 4 in the table show the test items sorted according to the actual values of their
a, b, and c parameters, respectively, with items having the smallest values listed at the top.
15
Columns 5 to 11 show the items sorted according to the value of the item information function
computed at seven ability levels corresponding to θ = -1.5, -1.0, -0.5, 0.0, +0.5, +1.0, and +1.5.
Throughout the table, cells containing test items selected by the optimum pass/fail abductive
model with CPM = 0.5 are marked by a black background. The table indicates that items selected
by the abductive network approach are concentrated around the middle of the difficulty
parameter, b, thus satisfying the criteria of being not too easy nor too difficult for the test takers.
Most of such items also have high values for the information function at θ = 0, which is the
threshold for pass/fail classification. As we go away from this ability cutoff in either direction,
the selected items become more scattered. This shows that the abductive learning approach
selects test items that are effective discriminators with a high information content at the required
To examine the effectiveness of the abductive network approach in identifying the most
informative subset of test items, we compared the classification performance of three pass/fail
tests, one composed of all 45 test items, another composed of the 12 items selected as inputs by
the optimum abductive network model with CPM = 0.5, and the third composed of a randomly
selected subset of 12 items {4, 8, 10, 12, 13, 22, 23, 27, 28, 31, 37, 43}. Results are plotted in
Fig. 7 for the overall classification error, the false pass rate, and the false fail rate. They indicate
that the abductive selection is significantly superior to the random selection, particularly for the
classification rate and the false fail rate. It is interesting to note that the abductive false fail rate is
slightly lower than that achieved using the full set of test items. We have also examined the test
information function, defined as the sum of item information functions for items included in the
test, for the three tests described above. Fig. 8 plots the results over the full ability range. Sharper
peaks for the information function lead to more precise classification. Although the inclusion of
16
all test items results in a higher peak value, this peak is slightly offset away from the zero ability
cut-off for the pass/fail test. However, using the abductive item selection, the peak value
coincides more accurately on the cutoff point. The test information peak for the randomly
selected subset is lower than the peak for the abductive selection, in spite of the fact that the two
subsets have the same size. The former peak is also shallower and further offset from the center,
In another experiment, we formed two tests using the abductive item selection made for the
two abductive models that perform G1/G2 group discrimination within the fail category and
between G3/G4 group discrimination within the pass category, see Fig. 2 and Table 2. The model
structures, selected inputs, and classification performance for the two abductive models were
introduced in Section 4.2 and given in Table 5. The test information functions for the two tests
are shown in Fig. 9. The peak for the G3/G4 model is sharper and more centered around the
nominal G3/G4 ability cut-off level of 2, compared to the peak for the G1/G2 model which is
shallower and is significantly offset from the nominal G3/G4 ability cut-off level of -2. This
explains the relatively poorer classification performance by the G1/G2 model as indicated in
Table 5, where the percentage classification error for that model is shown to be approximately
7. Conclusions
In this paper, we have demonstrated the utility of abductive machine learning as an alternative
tool for educational test design and analysis. Performance of the proposed approach was
examined and compared to classical statistical IRT-based techniques using a dataset of 2000
cases and 45 test items with various levels of model complexities and at various ability
thresholds. Results indicate that abductive network models can classify examinees with a
17
reasonable classification accuracy. The learning algorithm automatically identifies a concise and
effective subset of test items with high discriminatory power, which can be used to form the test.
Therefore, large-scale assessment systems can benefit from using abductive networks in test
development. In general, results show that abductive networks can improve the educational
measurement by reducing the number of items included in the test without severely degrading
measurement precision. We have also demonstrated that multiple tests for finer grade
classification can be constructed by controlling the ability threshold value. Several areas could
benefit from the proposed approach including college placement testing, medical licensing, job
applicant screening, and academic achievement testing. This paper lays a new research direction
in educational measurement. Future work will attempt to further improve the predication
accuracy, e.g. using network ensembles, and extend the modeling approach to multidimensional
Acknowledgment
The authors are grateful to King Fahd University of Petroleum and Minerals (KFUPM),
References
Abdel-Aal, R. E., and Elhadidy, M. A. (1995). Modeling and Forecasting the Maximum
Temperature using Abductive Machine Learning. Weather and Forecasting, 10:310-25.
Abdel-Aal, R. E., and Mangoud A. M., (1996). Abductive Machine Learning for Modeling and
Predicting the Educational Score in School Health Surveys. Methods of Information in
Medicine, 35(3):265-71.
Abdel-Aal, R. E. (2004). Short Term Hourly Load Forecasting using Abductive Networks. IEEE
Trans. Power Systems, 19:164-73.
18
AbTech Corporation (1990). AIM User's Manual, Charlottesville, VA.
Adema, J. J., van der Linden, W. J. (1989). Algorithms for Computerized Test Construction
Using Classical Item Parameters. Journal of Educational Statistics, 14(3): 279-290.
Adema, J. J. (1990a). Models and Algorithms for the Construction of Achievement Tests. Ph.D.
Thesis, Enschede: University of Twente.
Adema, J. J. (1990b). A Revised Simplex Method for Test Construction Problems. Research
Report 90-5. Enschede: Department of Education, University of Twente, Netherlands.
Adema, J. J., Boekkooi-Timminga, E., and van der Linden, W. J. (1991). Achievement Test
Construction Using 0-1 Linear Programming. European Journal of Operational Research,
55, 103-111.
Armstrong, R. D., Jones, D. H., Wang, Z. (1998). Optimization of Classical Reliability in Test
Construction. Journal of Educational and Behavioral Statistics, 23(1):1-17.
Baker, F. B., Chen, A. S., and Barmish, B. R. (1988). Item Characteristics of Tests Constructed
by Linear Programming. Applied Psychological Measurement, 12:189-199.
Barron, A. R. (1984). Predicted Squared Error: A Criterion for Automatic Model Selection. In
Farlow, S. J. (Ed.), Self-Organizing Methods in Modeling: GMDH Type Algorithms, (pp. 87-
103). Marcel-Dekker, New York.
Berger, M. P. F. (1997). Optimal Designs for Latent Variable Models: A Review. In Rost, J. and
Langeheine, R. (Eds.), Applications of Latent Trait and Latent Class Models in the
Social Sciences (pp. 71-79). Münster: Waxmann.
19
Boekkooi-Timminga, E. (1989). Models for Computerized Test Construction. The Haag,
Netherlands: Academisch Boeken Centrum.
Boekkooi-Timminga, E. (1990). The Construction of Parallel Tests from IRT-Based Item Banks.
Journal of Educational Statistics, 15(2):129-145.
Buyske, S. (2005). Optimal Design in Educational Testing. In Berger, M. P. F., and Wong, W. K.
(Eds.), Applied Optimal Designs. John Wiley & Sons.
Fletcher, R. B. (2000). A Review of Linear Programming and its Application to the Assessment
Tools for Teaching and Learning (asTTle) Projects. Technical Report 5, Project asTTle,
University of Auckland.
Hambleton, R. K., and Swaminathan, H. (1985). Item Response Theory: Principles and
Applications. Kluwer Academic Publishers Group, Netherlands.
Hwang, G.-J., Yin, P.-Y., and Yeh, S. H. (2006). A Tabu Search Approach to Generating Test
Sheets for Multiple Assessment Criteria. IEEE Transactions on Education, 49(1): 88-97.
Jeng, H. L., and Shih, S. G. (1997). A Comparison of Pair-wise and Group Selections of Items
using Simulated Annealing in Automated Construction of Parallel Tests. Psychological
Testing, 44(2):195-210.
Lee, B. Y., Liu, H. S. and Tarng, Y. S. (1999). An Abductive Network for Predicting Tool Life
in Drilling, IEEE Transactions on Industry Applications, 35(1): 190-195.
Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. New
Jersey: Lawrence Erlbaum.
Montgomery, G. J., Hess, P., and Hwang, J. S. (1990). Abductive Networks Applied to
Electronic Combat. Proceedings of SPIE - The International Society for Optical Engineering,
1294:454-465.
20
Montgomery, G. J. and Drake, K. C. (1991). Abductive Reasoning Networks. Neurocomputing,
2(3):97-104.
Rudner, L. M. (2005). PARAM-3PL Calibration Software for the 3 Parameter Logistic IRT
Model. Available: http://edres.org/irt/param
Sidhu, T. S., Cruder, O., and Huff, G. J. (1997). An Abductive Inference Technique for Fault
Diagnosis in Electrical Power Transmission Networks, IEEE Transactions on Power
Delivery, 12(1):515-522
Stocking, M. L., Swanson, L., and Pearlman, M. (1991). Automated Item Selection Using Item
Response Theory. Research Report 91-9. Princeton, NJ: Educational Testing Service.
Sun, K. T., and Chen, S. F. (1999). A Study of Applying the Artificial Intelligent Technique to
Select Test Items. Psychological Testing, 46(1):75-88.
Sun, K.-T. (2001). A Greedy Approach to Test Construction Problems. Proceedings of the
National Science Council, ROC-Part D, 11(2):78-87.
Swanson, L., and Stocking, M. L. (1993). A Model and Heuristic for Solving Very Large Item
Selection Problems. Applied Psychological Measurement, 17(2):151-166.
van der Linden, W. J. (1987). Automated Test Construction using Minimax Programming. In van
der Linden, W. J. (Ed.), IRT-Based Test Construction (pp. 1-16). Enschede, Netherlands:
Department of Education, University of Twente.
van der Linden, W. J., and Boekkooi-Timminga, E. (1989). A Maximum Model for Test Design
with Practical Constraints. Psychometrika, 54:237-247.
van der Linden, W. J. and Hambleton, R. K. (Eds.), (1997). Handbook of Modern Item Response
Theory, Springer-Verlag.
van der Linden, W. J. (2005). Linear Models for Optimal Test Design. Springer.
21
Table 1. First twenty cases in the dataset. Shown are the examinee’s identification (EID),
response pattern and true ability level for each case. (1 = correct, 0 = incorrect, x = skipped).
True
EID Response Pattern
Ability
1 100011010000001000x00010000010001001001011101 -1.5841
2 011001000100000010000101101100100100011000001 -1.689
3 111111111110110111110011111110100111001111100 0.494
4 100000011000x100000x00100000001011111111x0100 -0.981
5 111111101111111111111011111111110111101111111 1.4221
6 0110110011100000110100x1011010100100011101111 0.0353
7 111011101110011111011010111110101111101011101 0.7333
8 01x011101010100110000000111110100111011111100 -0.2385
9 100011100110010111001100001000000010101000000 -0.5911
10 101000101110111101000001000110000110100111000 -0.6697
11 110011101010110111100001010110100111111101101 0.0707
12 011111101110100011000011100110001110101x00111 -0.3552
13 011110101110111111110x11111111111111111111101 1.0341
14 011110001000010001111000010011100110011101100 -0.7209
15 110010101010100001000001010110000010101101010 -1.1992
16 111111001011101111110010110011100111111101111 0.2684
17 01111110111011111100001011101010111011111111x 0.6881
18 010100101110100011010001001010100011001101001 -0.62
19 011000101000100001001010010000100101101100100 -0.2659
20 01111110101011111111001111111x100111001111110 0.9098
22
Table 3. Structure and performance of abductive network models synthesized for pass/fail
classification at various levels of model complexity. Training on 1500 cases and evaluation on
500 cases.
% Classification Error on
CPM Model Selected Test Items
Training Set Evaluation Set
23
Table 4. Confusion matrix (a) and parameters characterizing classification performance (b) for
the optimum pass/fail abductive model synthesized with CPM = 0.5 on the evaluation dataset of
500 cases.
Predicted
(a)
1 (261) 0 (239)
1 (254) 234 20
Actual
0 (246) 27 219
Positive Negative
Classification
Sensitivity, % Specificity, % Predictive Predictive
Accuracy, %
Value, % Value, %
24
Table 5. Structure and performance for the two abductive models performing further
classification of the fail and pass examinees’ categories into two groups each: {G1, G2} and
{G3, G4}, respectively.
Classify the
2, 4, 5, 8, 9,
Fail category
G1/G2 0.5 12, 29, 34, 5.3 10
into groups
39, 43
G1 and G2
25
Table 6. Actual and estimated parameters for each test item. Estimated parameters are calculated
by two methods: (a) using abilities and responses, and (b) using responses only. IID is the item
identification number. N is the number of cases used in item estimation and SE is the standard
error.
26
Table 7. True abilities and estimated abilities, θ, for the first twenty cases in the dataset
calculated by two methods: (a) using item parameters and responses, (b) using responses only. N
is the number of items used in estimating the ability and SE is the standard error in θ.
27
Table 8. Test items sorted in ascending order by item parameters and item information function
(IIF) at several ability levels at and around the pass/fail cut-off of 0. Cells with black background
indicate items selected by the optimum pass/fail abductive network model with CPM = 0.5, see
Table 3.
Items Sorted By Items Sorted By IIF at θ =
IID
a b c -1.5 -1 -0.5 0 0.5 1 1.5
1 16 9 32 38 38 38 38 13 39 39
2 13 29 12 33 8 8 8 8 13 9
3 4 39 8 8 33 33 16 16 9 29
4 1 13 29 32 32 12 13 38 29 13
5 21 5 11 12 12 32 21 9 16 34
6 8 18 21 30 30 30 33 39 5 5
7 5 35 24 24 22 21 12 29 34 18
8 42 34 43 19 21 22 32 5 35 35
9 40 11 17 28 28 16 1 4 18 16
10 10 43 14 22 19 1 30 21 42 42
11 22 42 2 21 24 28 22 1 8 11
12 2 36 36 20 1 13 4 42 4 23
13 17 2 44 15 20 15 10 35 40 43
14 7 40 18 1 15 20 28 10 36 36
15 37 17 41 25 16 10 5 40 11 40
16 36 23 45 41 10 4 15 34 43 2
17 35 3 35 10 25 19 9 18 2 17
18 15 31 40 44 41 24 42 22 17 31
19 14 37 22 16 13 44 40 36 10 3
20 6 27 20 26 4 7 39 2 23 4
21 30 6 30 45 44 25 20 17 37 37
22 28 14 9 4 7 5 29 12 1 27
23 18 7 28 7 26 42 7 30 21 10
24 44 16 33 6 45 41 35 37 27 6
25 27 4 1 14 37 40 37 43 3 7
26 43 26 4 37 6 37 2 11 7 14
27 11 45 13 13 14 26 17 33 6 26
28 26 10 42 27 27 6 36 7 31 1
29 20 25 7 31 42 14 18 15 14 8
30 34 41 31 3 40 2 34 32 22 45
31 9 44 15 40 5 27 44 28 26 21
32 12 19 3 42 2 17 6 27 15 25
33 45 15 6 23 17 45 27 6 38 44
34 29 24 26 5 36 36 14 14 30 15
35 32 20 5 2 3 35 19 23 44 22
36 41 1 23 17 35 9 43 26 45 41
37 33 28 27 36 31 18 26 3 25 20
38 3 22 16 35 43 34 11 20 28 28
39 25 30 38 43 34 43 25 44 12 30
40 39 33 34 34 18 39 24 25 20 19
41 23 21 39 11 23 29 41 31 33 24
42 38 12 37 18 11 11 45 45 41 12
43 19 32 19 9 9 3 3 41 32 33
44 24 38 25 39 39 23 23 19 19 38
45 31 8 10 29 29 31 31 24 24 32
Fig. 1. A typical AIM abductive network model showing various types of functional elements.
400
350
300
Category
250 Fail Pass
Frequency
200
Group
150 G1 G2 G3 G4
100
50
0
-4 -3 -2 -1 0 1 2 3 4
Ability
Fig. 2. Frequency distribution of examinees in the dataset over the ability continuum. See Table
2 for details of the categories and groups.
29
13
On Evaluation Set
On Training Set
12
Classification Error (%).
11
10
7
10 1 0.1
CPM (Log Scale)
Fig. 3. Classification error versus the CPM parameter for the abductive models in Table 3 on
both the training and evaluation sets. Lower CPM values correspond to greater model
complexity.
30
2 2.5 0.4
1.5 0.3
Estimated c .
Estimated a .
1.5
Estimated b .
0.5
0.2
-1.5
0.5 0
0.5 1 1.5 2 0 0.1 0.2 0.3 0.4
-2.5
Actual a Actual b Actual c
Fig. 4. Correlation between actual and estimated values for the parameters a, b and c. Estimation is
carried out using true abilities and response patterns (method a). Correlation coefficients are 0.930,
0.998 and 0.831 respectively.
2 2.5 0.4
1.5 0.3
Estimated c .
Estimated a
1.5
Estimated b .
0.5
0.2
-1.5
0.5 0
0.5 1 1.5 2 0 0.1 0.2 0.3 0.4
-2.5
Actual a Actual b Actual c
Fig. 5. Correlation between actual and estimated values for the parameters a, b and c. Estimation is
carried out using response patterns only (method b). Correlation coefficients are 0.918, 0.993 and
0.496 respectively.
31
5
4
2
1
0
(a) -5 -4 -3 -2 -1 -1 0 1 2 3 4 5
-2
-3
-4
-5
Actual Ability
4
Estimated Ability Using Method (b)
0
(b) -5 -4 -3 -2 -1
-1
0 1 2 3 4 5
-2
-3
-4
-5
Actual Ability
4
Estimated Ability Using Method (b)
2
1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
(c) -1
-2
-3
-4
-5
Estimated Ability Using Method (a)
Fig. 6. (a) Correlation between true ability and estimated ability using original item parameters
and response patterns (b) Correlation between true ability and estimated ability using response
patterns only (c) correlation between the two estimated ability parameters. The correlation
coefficients are 0.960, 0.958 and 0.991 respectively.
32
14
10
8
Error (%)
0
Classification False Pass False Fail
Fig. 7. Classification errors for three pass/fail tests. ALL: test composed of all 45 test items;
ABD: test composed of the 12 items selected as inputs for the optimum abductive network model
with CPM = 0.5; RND: test composed of a randomly selected subset of 12 items {4, 8, 10, 12,
13, 22, 23, 27, 28, 31, 37, 43}.
0
-4 -3 -2 -1 0 1 2 3 4
Ability
Fig. 8. Test information functions for the three tests of Fig. 7. ALL: test composed of all 45 test
items; ABD: test composed of the 12 items selected as inputs for the optimum abductive network
model with CPM = 0.5; RND: test composed of a randomly selected subset of 12 items {4, 8, 10,
12, 13, 22, 23, 27, 28, 31, 37, 43}.
Test Information Function
3
G1/G2 2.5
G3/G4
2
1.5
0.5
0
-4 -3 -2 -1 0 1 2 3 4
Ability
Fig. 9. Test information functions for the test items selected by abductive machine learning for
the two models that perform G1/G2 and G3/G4 classification within the fail and pass groups,
respectively. See Fig. 2, Table 2, and Table 5.
34