2024 Bea-1 44

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Item Difficulty and Response Time Prediction with Large Language

Models: An Empirical Analysis of USMLE Items

Okan Bulut, Guher Gorgun, Bin Tan


Measurement, Evaluation, and Data Science
Faculty of Education, University of Alberta, Canada
{bulut, gorgun, btan4}@ualberta.ca

Abstract The difficulty of items and the average response


time required to answer them are typically esti-
This paper summarizes our methodology and mated based on empirical data collected during
results for the BEA 2024 Shared Task. This
test pretesting. However, pretesting and obtain-
competition focused on predicting item diffi-
culty and response time for retired multiple- ing robust results often require a large sample of
choice items from the United States Medical examinees, which can incur substantial test admin-
Licensing Examination® (USMLE®). We ex- istration costs. As a result, researchers have ex-
tracted linguistic features from the item stem plored various methods to predict item character-
and response options using multiple methods, istics without an actual test administration. For
including the BiomedBERT model, FastText instance, researchers have sought estimates of item
embeddings, and Coh-Metrix. The extracted
difficulty from domain experts and test develop-
features were combined with additional fea-
tures available in item metadata (e.g., item type) ment professionals. However, this approach has not
to predict item difficulty and average response consistently produced satisfactory or reliable esti-
time. The results showed that the BiomedBERT mations (Bejar, 1983; Attali et al., 2014; Wauters
model was the most effective in predicting item et al., 2012; Impara and Plake, 1998). Another line
difficulty, while the fine-tuned model based on of research seeks to predict item characteristics
FastText word embeddings was the best model based on only item texts, such as the passages in
for predicting response time. source-based items, item stem, and response op-
tions (Yaneva et al., 2019; Hsu et al., 2018). This
1 Introduction
approach employs text-mining techniques to ex-
In standardized exams, the examination of item tract surface features (e.g., the number of words
characteristics is highly crucial for ensuring the in the texts) and complex features (e.g., semantic
fairness and validity of test results. For example, similarities of sentences) from item texts, to make
the difficulty of items pertains to the likelihood of predictions using advanced statistical models.
an examinee answering the items correctly. Incor- Building on the second line of research in pre-
porating a broad range of item difficulty levels in a dicting item characteristics based on item texts, the
standardized exam can help reduce measurement National Board of Medical Examiners (NBME)
error and thereby improve the accuracy of the mea- initiated the BEA 2024 Shared Task (https://
surement process (Kubiszyn and Borich, 2024). In sig-edu.org/sharedtask/2024) for automated
addition, while response time is often linked to item prediction of item difficulty and item response time.
difficulty (i.e., more difficult items require more The released dataset contained 667 previously used
time to answer) (Yang et al., 2002), this variable it- and now retired items from the United States Med-
self can also offer new insights into examinees’ test ical Licensing Examination® (USMLE®). The
completion processes, such as their testing engage- USMLE is a series of high-stakes examinations
ment and cognitive processes, thereby supporting (also known as Steps; https://www.usmle.org/
the validity of test results. Furthermore, understand- step-exams) to support medical licensure deci-
ing item characteristics can also be advantageous sions in the United States. The items from USMLE
for modern test administration methods, including Steps 1, 2 Clinical Knowledge (CK), and 3 focus
applications in automated item assembly, computer- on a wide range of topics relevant to the practice of
ized adaptive testing, and personalized assessments medicine.
(Baylari and Montazer, 2009; Wauters et al., 2012). In the BEA 2024 Shared Task, research teams
522
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications, pages 522–527
June 20, 2024 ©2024 Association for Computational Linguistics
were invited to utilize natural language processing tures to predict item difficulty. However, readabil-
(NLP) methods for extracting linguistic features of ity indices did not perform well as predictors of
the items and using them to predict the difficulty item difficulty–a finding consistent with Susanti
and response time of the items. Our team employed et al. (2017) who noted that readability indices
state-of-the-art large language models (LLMs) to were among the least important predictors of item
extract the features and build predictive models difficulty.
for item difficulty and response time. This paper NLP methods can also be used to extract Term
documents the methods and results of our best- Frequency-Inverse Document Frequency (TF-IDF)
performing models for predicting item difficulty features. TF-IDF measures the frequency of words
and response time separately. or word sequences in a document and adjusts this
count based on their frequency across a collection
2 Related work of documents. This approach emphasizes the im-
portance of specific words to a particular docu-
The interest and effort in predicting item difficulty ment, with higher values indicating greater poten-
based on item texts dates back decades in the mea- tial importance (Salton, 1983). In a relatively re-
surement literature. Early work in item difficulty cent study predicting item difficulty for newly gen-
prediction primarily focused on identifying how erated multiple-choice questions, Benedetto et al.
item difficulty is influenced by a set of readily avail- (2020b) extracted TF-IDF features and achieved a
able, easily extracted, or manually coded item-level root mean square error of 0.753.
features. For example, Drum et al. (1981) predicted An important application of NLP techniques is
the difficulty of 210 reading comprehension items the extraction of semantic features from item texts.
using various surface structure variables and word Word embedding is a technique that converts texts
frequency measures for the text, such as the num- into numerical values in vector space, capturing
ber of words, content words, or content-function the meanings of words across different dimensions
words. Freedle and Kostin (1993) predicted the dif- (Mikolov et al., 2013). Pre-trained NLP models
ficulty of 213 reading comprehension items using such as Word2Vec and GLoVe allow researchers to
12 categories of sentential and discourse variables, extract word embedding features from item texts
such as vocabulary, length of texts, and syntactic (e.g., Firoozi et al., 2022). For example, Hsu et al.
structures (e.g., the number of negations). Perkins (2018) transformed item texts into semantic vectors
et al. (1995) employed artificial neural networks and then used cosine similarity to measure the se-
to predict the item difficulty of 29 items in a read- mantic similarity between different pairs of items.
ing comprehension test. They coded the items to Additionally, (Yaneva et al., 2019) extracted word
extract three types of features: text structure (e.g., embedding features from multiple-choice items in
the number of words, lines, paragraphs, sentences, high-stakes medical exams. Along with other lin-
and content words), propositional analysis of pas- guistic and psycholinguistic features in predicting
sages and stems (e.g., the number of arguments, item difficulty, they found that word embedding
modifiers, and predicates), and cognitive process features contributed most to the predictive power.
(e.g., identify, recognize, verify, infer, generalize, More recently, a significant breakthrough in the
or problem-solving). NLP field has been the development of LLMs such
Research focused on the prediction of item char- as BERT (Devlin et al., 2018) and its variants,
acteristics such as difficulty and response time has which were trained using different mechanisms
been significantly influenced by the availability and or training datasets. For example, Zhou and Tao
application of emerging techniques in NLP and ma- (2020) utilized a BERT-variant model to predict
chine learning AlKhuzaey et al. (2023). For exam- the difficulty of programming problems. Their
ple, Yaneva et al. (2019) employed NLP methods results showed that compared with BERT, Distil-
to extract syntactic features to predict item diffi- BERT, a small version of the BERT base model,
culty, which were identified as crucial predictors. was the best-performing model when the only avail-
Another application of NLP methods involves as- able data for fine-turning was the text of the items.
sessing the linguistic complexity or readability of Benedetto et al. (2021) also compared the perfor-
item texts to predict item difficulty. Benedetto et al. mance of BERT and DistilBERT in predicting the
(2020a), for instance, calculated readability indices difficulty of multiple-choice questions and found
for item texts and combined them with other fea- that the BERT-based models significantly outper-
523
formed the two baseline models. 3.2 Our Best Model for Difficulty Prediction
Unlike the prediction of item difficulty, the pre- Here, we describe the details of our best-
diction of response time has not been widely in- performing model for predicting item difficulty
vestigated in the literature. This is mainly due (RM SE = .318), which performed slightly worse
to the limited availability of response time data. than a baseline dummy regressor (RM SE = .311)
However, with the increasing use of digital assess- and ranked at the 20th place out of 43 submissions
ments, such as computer-based and computerized- in the difficulty prediction leaderboard.
adaptive tests, in operational testing, the collection
3.2.1 Feature Extraction
of response data has become easier, which moti-
vated researchers to employ predictive models to We extracted linguistic features from item stems
predict the average response time required to solve and response alternatives (i.e., the answer key and
the items (e.g., Baldwin et al., 2021; Hecht et al., the incorrect response options) by leveraging both
2017; Yaneva et al., 2019). pre-trained large-language models and more inter-
pretable text representations such as connectivity,
cohesion, and text length. We started the feature ex-
3 Methodology traction process by concatenating the stem, key, and
alternatives of each item in a single data frame col-
3.1 Dataset umn and separated each item into individual data
files to extract Coh-Metrix features (McNamara
As mentioned earlier, this study utilized an empir- et al., 2014; Graesser et al., 2011). Concatenating
ical dataset released by NBME, which included item stems and alternatives served two purposes:
667 multiple-choice items previously administered (1) Adequately represent item length in terms of
in the USMLE series. Due to the requirements stem and alternatives and (2) control for the dif-
of the BEA 2024 Shared Task, the data was re- ferential number of alternatives that each item in-
leased in two stages. Initially, 446 multiple-choice cludes. Coh-Metrix includes 108 features and ana-
items were provided for extracting linguistic fea- lyzes a text on multiple measures of language and
tures from the items and building predictive models discourse (Graesser et al., 2011).
based on the extracted features. For each item, the Coh-Metrix focuses on six theoretical levels of
dataset encompasses the source texts (typically a text representation: words, syntax, the explicit
clinical case followed by a question) and the texts textbase, the referential situation model, the dis-
for each response option. The response options for course genre and rhetorical structure, and the prag-
the questions vary and can include up to 10 options, matic communication level (Graesser et al., 2014).
each represented in a separate column. When the It generates indices of text, including paragraph
number of response options was less than 10, the count, sentence count, word count, narrativity, syn-
remaining columns were left empty. tactic simplicity, referential cohesion, deep cohe-
Additionally, the dataset contained metadata sion, noun overlap, stem overlap, latent seman-
with four additional variables: Item type (text-only tic analysis, lexical diversity, syntactic complex-
items versus items containing pictures), exam steps ity, syntactic pattern density, and readability. We
(Steps 1, 2, or 3 in the USMLE series), item dif- removed four features from Coh-Metrix indices
ficulty, and average response time. Subsequently, due to no variability, including paragraph count
the predictive models trained in the first stage were (i.e., the number of paragraphs), the standard devia-
applied to make predictions for the remaining 201 tion of paragraph length, the mean Latent Semantic
items in the second stage, serving as the testing Analysis (LSA) overlap in adjacent paragraphs, and
set for evaluating the performance of the predictive the standard deviation of LSA overlap in adjacent
models for item difficulty and response time. The paragraphs.
structure of the second dataset mirrors that of the In the next step, we utilized the BiomedBERT
first, with the exception that the item difficulty and model (Gu et al., 2020) to extract new features.
response time variables were not immediately avail- This model, which was previously named PubMed-
able. These variables were released after the sub- BERT, is a pretrained LLM based on abstracts from
mission deadline for the BEA 2024 Shared Task, al- PubMed and full-text articles from PubMedCen-
lowing for the identification of the best-performing tral. We chose this particular model because it is
trials among the participating teams. known to achieve state-of-the-art performance on
524
many biomedical NLP tasks. By using Biomed- 3.2.3 Results
BERT, we obtained sentence embeddings for the With our pseudo-test set held out from the shared
item stems and alternatives and then computed the training set, we obtained a Mean Squared Error
cosine similarity between item sentence embed- (MSE) value of .064, a Root Mean Squared Error
dings and alternative stem embeddings. Cosine (RMSE) value of .253, and a Mean Absolute Error
similarity, which is commonly used to quantify the (MAE) value of .190, and a Pearson’s correlation
degree of similarity between two sets of informa- coefficient of .555.
tion, was computed as the cosine angle between
the embedding vectors of item stem and alterna- 3.3 Our Best Model for Response Time
tives. As cosine similarity, ranging between 0 and Prediction
1, gets closer to 1, it indicates more resemblance Our solution that achieved the best performance in
between the embedding vectors obtained using the predicting response time differed from the one that
item stem and alternatives. was best at predicting item difficulty. This solution
In the final step, we also extracted word embed- is briefly documented below.
dings for the concatenated text using stems and al-
ternatives by tokenizing the text using the Biomed- 3.3.1 Feature Extraction
BERT model (Gu et al., 2020). BiomedBERT has First, FastText word embeddings were gener-
768 dimensions with a maximum length of 512 ated for each item stem and response option.
words. We extracted the last hidden layer of em- We employed the pre-trained FastText embed-
beddings. We created a new data frame composed dings (wiki-news-300d-1m.vec.zip; obtained from
of three sets of features extracted (i.e., Coh-Metrix https://fasttext.cc/docs/en/english-vectors.html) to
features using the stem, key, and alternatives of map each word in the text to its corresponding 300-
each item, the cosine similarity between the stem dimensional vector representation. FastText is a
and alternatives, and word embeddings using the modified version of word2vec; the difference is that
stem, key, and alternatives) and the ground truth of it treats each word as composed of n-grams rather
item difficulty. The final data frame is composed than the original word in Word2Vec (Mikolov et al.,
of 882 features and the target variable of item diffi- 2017). For each text option, the embeddings of the
culty. first 60 words were concatenated to form a fea-
ture vector, resulting in a dimension of 18,000 (60
3.2.2 Model Training words × 300 dimensions) for each option. If the
text had fewer than 60 words, the corresponding
To identify the best model with the lowest RMSE vector was padded with zeros.
value, we used 85% of the data as our training set Similar to the approach taken for item difficulty
and 15% as our holdout test set. Because the sam- predictions, cosine similarity scores between each
ple size was too small (N = 466 of items shared in pair of alternatives (i.e., response options) were
total) and we had a very large set of features (N = calculated using the embeddings from the Biomed-
882), we first applied a dimension reduction tech- BERT model. For each pair, the cosine similarity
nique, Principal Component Analysis (PCA) (Wold between their embeddings was computed to cap-
et al., 1987). A PCA model with 30 components ture the semantic differences between different re-
explained 99% of variability in the dataset, and sponse options. The extracted features were then
thus, the final feature set included 30 components combined with the dummy-coded item develop-
extracted through the PCA analysis. We used lasso ment information (e.g., text-based items only vs.
regression (Tibshirani, 1996) with repeated 5-fold items including pictures; administration step in the
cross-validation to select the best hyperparameter USMLE series) to form the final feature set. Unlike
(i.e., alpha). Alpha in lasso regression is the model in the item difficulty prediction, we did not extract
penalty that determines the amount of shrinkage any other linguistic features in response time pre-
in the model. An advantage of lasso regression is diction.
the application of a regularization algorithm that
controls for the irrelevant features in the model by 3.3.2 Model Training
shrinking the contribution of irrelevant features to Considering the extremely high dimensionality of
zero. An alpha value of .01 yielded the best model the features, we performed feature selection and
during the cross-validation stage. dimension reduction techniques. First, using the
525
available information on response time in the train- response time. This finding is not necessarily sur-
ing set (N = 466), we eliminated the feature prising because the average reading time required
columns that had an absolute correlation coeffi- for the items is likely to be correlated with the
cient smaller than 0.1. Then, we performed PCA linguistic features extracted from the items.
to extract principal components until they could Overall, the results for the BEA 2024 Shared
capture 95% of the information presented in the Task indicate that predicting item characteristics
original feature set. To this end, we obtained a final such as difficulty remains challenging and requires
feature set with 339 features to train an algorithm. factors beyond linguistic or textual features.
As before, the model training involved the use
of lasso regression due to its ability to perform
feature selection and handle multicollinearity in References
high-dimensional data. The training process was Samah AlKhuzaey, Floriana Grasso, Terry R Payne,
performed using 10-fold cross-validation to opti- and Valentina Tamma. 2023. Text-based question
mize the hyperparameter (i.e., alpha) and evaluate difficulty prediction: A systematic review of auto-
the model’s performance. In terms of the hyperpa- matic approaches. International Journal of Artificial
Intelligence in Education, pages 1–53.
rameter search space, the regularization strength
(alpha) was tuned using a randomized search over Yigal Attali, Luis Saldivia, Carol Jackson, Fred Schup-
a logarithmic scale from 1e-4 to 1e-0.05, with 1000 pan, and Wilbur Wanamaker. 2014. Estimating item
candidate values. An alpha value of .44 yielded difficulty with comparative judgments. ETS Research
Report Series, 2014(2):1–8.
the best model during the cross-validation stage.
Additionally, the fit intercept parameter was tested Peter Baldwin, Victoria Yaneva, Janet Mee, Brian E
with both True and False values, while the selection Clauser, and Le An Ha. 2021. Using natural lan-
parameter was tested with ’cyclic’ and ’random’ guage processing to predict item response times and
improve test construction. Journal of Educational
options1 .
Measurement, 58(1):4–30.
3.3.3 Results
Ahmad Baylari and Gh A Montazer. 2009. Design a per-
Upon comparing our predicted response time and sonalized e-learning system based on item response
the released response time from the BEA 2024 theory and artificial neural network approach. Expert
Shared Task, we found this solution (RM SE = Systems with Applications, 36(4):8013–8021.
31.48; M SE = 990.98; M AE = 23.54, r =
Isaac I Bejar. 1983. Subject matter experts’ assessment
0.209) was slightly better than the baseline dummy of item statistics. Applied Psychological Measure-
regressor (RM SE = 31.68), which ranked 24th ment, 7(3):303–310.
among the 34 submissions.
Luca Benedetto, Giovanni Aradelli, Paolo Cremonesi,
4 Discussion and Conclusion Andrea Cappelli, Andrea Giussani, and Roberto Tur-
rin. 2021. On the application of transformers for
The competition results for the BEA 2024 Shared estimating the difficulty of multiple-choice questions
Task indicated that it is difficult to predict item from text. In Proceedings of the 16th workshop on
innovative use of NLP for building educational appli-
characteristics such as difficulty using linguistic cations, pages 147–157.
features (Yaneva et al., 2024). Only 15 teams out
of 43 managed to perform better than a baseline Luca Benedetto, Andrea Cappelli, Roberto Turrin, and
dummy regressor when it comes to predicting item Paolo Cremonesi. 2020a. Introducing a framework to
assess newly created questions with natural language
difficulty using textual features extracted from the processing. In International Conference on Artificial
items. These results suggest that linguistic features Intelligence in Education, pages 43–54. Springer.
may not be sufficient to capture the complex inter-
play between item features and item difficulty. Luca Benedetto, Andrea Cappelli, Roberto Turrin, and
Paolo Cremonesi. 2020b. R2de: a nlp approach to es-
Unlike predicting item difficulty, predicting the
timating irt parameters of newly generated questions.
average response time using linguistic features ap- In Proceedings of the tenth international conference
pears to be a more promising task. Out of 34 sub- on learning analytics & knowledge, pages 412–421.
missions, 24 teams performed better than a base-
line dummy regressor in predicting the average Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
1
Our codes for predicting item difficulty and response time bidirectional transformers for language understand-
are available at https://osf.io/dwe4n/. ing. arXiv preprint arXiv:1810.04805.
526
Priscilla A Drum, Robert C Calfee, and Linda K Cook. Kyle Perkins, Lalit Gupta, and Ravi Tammana. 1995.
1981. The effects of surface structure variables on Predicting item difficulty in a reading comprehen-
performance in reading comprehension tests. Read- sion test with an artificial neural network. Language
ing Research Quarterly, pages 486–514. testing, 12(1):34–53.
T Firoozi, O Bulut, C Demmans Epp, A N Abadi, and Gerard Salton. 1983. Introduction to modern informa-
D Barbosa. 2022. The effect of fine-tuned word em- tion retrieval. McGraw-Hill.
bedding techniques on the accuracy of automated
essay scoring systems using neural networks. Jour- Yuni Susanti, Takenobu Tokunaga, Hitoshi Nishikawa,
nal of Applied Testing Technology, 23:21–29. and Hiroyuki Obari. 2017. Controlling item difficulty
for automatic vocabulary question generation. Re-
Roy Freedle and Irene Kostin. 1993. The prediction search and practice in technology enhanced learning,
of toefl reading item difficulty: Implications for con- 12:1–16.
struct validity. Language Testing, 10(2):133–170.
Robert Tibshirani. 1996. Regression shrinkage and se-
Arthur C Graesser, Danielle S McNamara, Zhiqang Cai, lection via the lasso. Journal of the Royal Statistical
Mark Conley, Haiying Li, and James Pennebaker. Society Series B: Statistical Methodology, 58(1):267–
2014. Coh-metrix measures text characteristics at 288.
multiple levels of language and discourse. The Ele-
mentary School Journal, 115(2):210–229. Kelly Wauters, Piet Desmet, and Wim Van Den Noort-
gate. 2012. Item difficulty estimation: An auspicious
Arthur C Graesser, Danielle S McNamara, and Jonna M collaboration between data and judgment. Comput-
Kulikowich. 2011. Coh-metrix: Providing multi- ers & Education, 58(4):1183–1193.
level analyses of text characteristics. Educational
Researcher, 40(5):223–234. Svante Wold, Kim Esbensen, and Paul Geladi. 1987.
Principal component analysis. Chemometrics and
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto intelligent laboratory systems, 2(1-3):37–52.
Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng
Gao, and Hoifung Poon. 2020. Domain-specific lan- Victoria Yaneva, Peter Baldwin, Janet Mee, et al. 2019.
guage model pretraining for biomedical natural lan- Predicting the difficulty of multiple choice questions
guage processing. in a high-stakes medical exam. In Proceedings of the
fourteenth workshop on innovative use of NLP for
Martin Hecht, Thilo Siegle, and Sebastian Weirich. building educational applications, pages 11–20.
2017. A model for the estimation of testlet response
time to optimize test assembly in paper-and-pencil Victoria Yaneva, Kai North, Peter Baldwin, An Ha Le,
large-scale assessments. Journal for educational re- Saed Rezayi, Yiyun Zhou, Sagnik Ray Choudhury,
search online, 9(1):32–51. Polina Harik, and Brian Clauser. 2024. Findings
from the First Shared Task on Automated Prediction
Fu-Yuan Hsu, Hahn-Ming Lee, Tao-Hsing Chang, and of Difficulty and Response Time for Multiple Choice
Yao-Ting Sung. 2018. Automated estimation of item Questions. In Proceedings of the 19th Workshop
difficulty for multiple-choice tests: An application of on Innovative Use of NLP for Building Educational
word embedding techniques. Information Processing Applications (BEA 2024), Mexico City, Mexico. As-
& Management, 54(6):969–984. sociation for Computational Linguistics.
James C Impara and Barbara S Plake. 1998. Teach-
Chien-Lin Yang, Thomas R O Neill, and Gene A
ers’ ability to estimate item difficulty: A test of the
Kramer. 2002. Examining item difficulty and re-
assumptions in the angoff standard setting method.
sponse time on perceptual ability test items. Journal
Journal of Educational Measurement, 35(1):69–81.
of Applied Measurement, 3(3):282–299.
Tom Kubiszyn and Gary D Borich. 2024. Educational
testing and measurement. John Wiley & Sons. Ya Zhou and Can Tao. 2020. Multi-task bert for prob-
lem difficulty prediction. In 2020 international con-
Danielle S McNamara, Arthur C Graesser, Philip M ference on communications, information system and
McCarthy, and Zhiqiang Cai. 2014. Automated eval- computer engineering (cisce), pages 213–216. IEEE.
uation of text and discourse with Coh-Metrix. Cam-
bridge University Press.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski,
Christian Puhrsch, and Armand Joulin. 2017. Ad-
vances in pre-training distributed word representa-
tions. arXiv preprint arXiv:1712.09405.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
tions of words and phrases and their compositionality.
Advances in neural information processing systems,
26.
527

You might also like