Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Loading...
User Settings
close menu
Welcome to Scribd!
Upload
Read for free
FAQ and support
Language (EN)
Sign in
0 ratings
0% found this document useful (0 votes)
5 views
2024 Bea-1 44
Uploaded by
Ayşegül Gündüz
AI-enhanced
Copyright:
© All Rights Reserved
Available Formats
Download
as PDF, TXT or read online from Scribd
Download
Save
Save 2024.bea-1.44 For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
2024 Bea-1 44
Uploaded by
Ayşegül Gündüz
0 ratings
0% found this document useful (0 votes)
5 views
6 pages
AI-enhanced title
Document Information
click to expand document information
Original Title
2024.bea-1.44
Copyright
© © All Rights Reserved
Available Formats
PDF, TXT or read online from Scribd
Share this document
Share or Embed Document
Sharing Options
Share on Facebook, opens a new window
Facebook
Share on Twitter, opens a new window
Twitter
Share on LinkedIn, opens a new window
LinkedIn
Share with Email, opens mail client
Email
Copy link
Copy link
Did you find this document useful?
0%
0% found this document useful, Mark this document as useful
0%
0% found this document not useful, Mark this document as not useful
Is this content inappropriate?
Report
Copyright:
© All Rights Reserved
Available Formats
Download
as PDF, TXT or read online from Scribd
Download now
Download as pdf or txt
Save
Save 2024.bea-1.44 For Later
0 ratings
0% found this document useful (0 votes)
5 views
6 pages
2024 Bea-1 44
Uploaded by
Ayşegül Gündüz
AI-enhanced title
Copyright:
© All Rights Reserved
Available Formats
Download
as PDF, TXT or read online from Scribd
Save
Save 2024.bea-1.44 For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
Download now
Download as pdf or txt
Jump to Page
You are on page 1
of 6
Search inside document
Item Difficulty and Response Time Prediction with Large Language
Models: An Empirical Analysis of USMLE Items
Okan Bulut, Guher Gorgun, Bin Tan
Measurement, Evaluation, and Data Science
Faculty of Education, University of Alberta, Canada
{bulut, gorgun, btan4}@ualberta.ca
Abstract The difficulty of items and the average response
time required to answer them are typically esti-
This paper summarizes our methodology and mated based on empirical data collected during
results for the BEA 2024 Shared Task. This
test pretesting. However, pretesting and obtain-
competition focused on predicting item diffi-
culty and response time for retired multiple- ing robust results often require a large sample of
choice items from the United States Medical examinees, which can incur substantial test admin-
Licensing Examination® (USMLE®). We ex- istration costs. As a result, researchers have ex-
tracted linguistic features from the item stem plored various methods to predict item character-
and response options using multiple methods, istics without an actual test administration. For
including the BiomedBERT model, FastText instance, researchers have sought estimates of item
embeddings, and Coh-Metrix. The extracted
difficulty from domain experts and test develop-
features were combined with additional fea-
tures available in item metadata (e.g., item type) ment professionals. However, this approach has not
to predict item difficulty and average response consistently produced satisfactory or reliable esti-
time. The results showed that the BiomedBERT mations (Bejar, 1983; Attali et al., 2014; Wauters
model was the most effective in predicting item et al., 2012; Impara and Plake, 1998). Another line
difficulty, while the fine-tuned model based on of research seeks to predict item characteristics
FastText word embeddings was the best model based on only item texts, such as the passages in
for predicting response time. source-based items, item stem, and response op-
tions (Yaneva et al., 2019; Hsu et al., 2018). This
1 Introduction
approach employs text-mining techniques to ex-
In standardized exams, the examination of item tract surface features (e.g., the number of words
characteristics is highly crucial for ensuring the in the texts) and complex features (e.g., semantic
fairness and validity of test results. For example, similarities of sentences) from item texts, to make
the difficulty of items pertains to the likelihood of predictions using advanced statistical models.
an examinee answering the items correctly. Incor- Building on the second line of research in pre-
porating a broad range of item difficulty levels in a dicting item characteristics based on item texts, the
standardized exam can help reduce measurement National Board of Medical Examiners (NBME)
error and thereby improve the accuracy of the mea- initiated the BEA 2024 Shared Task (https://
surement process (Kubiszyn and Borich, 2024). In sig-edu.org/sharedtask/2024) for automated
addition, while response time is often linked to item prediction of item difficulty and item response time.
difficulty (i.e., more difficult items require more The released dataset contained 667 previously used
time to answer) (Yang et al., 2002), this variable it- and now retired items from the United States Med-
self can also offer new insights into examinees’ test ical Licensing Examination® (USMLE®). The
completion processes, such as their testing engage- USMLE is a series of high-stakes examinations
ment and cognitive processes, thereby supporting (also known as Steps; https://www.usmle.org/
the validity of test results. Furthermore, understand- step-exams) to support medical licensure deci-
ing item characteristics can also be advantageous sions in the United States. The items from USMLE
for modern test administration methods, including Steps 1, 2 Clinical Knowledge (CK), and 3 focus
applications in automated item assembly, computer- on a wide range of topics relevant to the practice of
ized adaptive testing, and personalized assessments medicine.
(Baylari and Montazer, 2009; Wauters et al., 2012). In the BEA 2024 Shared Task, research teams
522
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications, pages 522–527
June 20, 2024 ©2024 Association for Computational Linguistics
were invited to utilize natural language processing tures to predict item difficulty. However, readabil-
(NLP) methods for extracting linguistic features of ity indices did not perform well as predictors of
the items and using them to predict the difficulty item difficulty–a finding consistent with Susanti
and response time of the items. Our team employed et al. (2017) who noted that readability indices
state-of-the-art large language models (LLMs) to were among the least important predictors of item
extract the features and build predictive models difficulty.
for item difficulty and response time. This paper NLP methods can also be used to extract Term
documents the methods and results of our best- Frequency-Inverse Document Frequency (TF-IDF)
performing models for predicting item difficulty features. TF-IDF measures the frequency of words
and response time separately. or word sequences in a document and adjusts this
count based on their frequency across a collection
2 Related work of documents. This approach emphasizes the im-
portance of specific words to a particular docu-
The interest and effort in predicting item difficulty ment, with higher values indicating greater poten-
based on item texts dates back decades in the mea- tial importance (Salton, 1983). In a relatively re-
surement literature. Early work in item difficulty cent study predicting item difficulty for newly gen-
prediction primarily focused on identifying how erated multiple-choice questions, Benedetto et al.
item difficulty is influenced by a set of readily avail- (2020b) extracted TF-IDF features and achieved a
able, easily extracted, or manually coded item-level root mean square error of 0.753.
features. For example, Drum et al. (1981) predicted An important application of NLP techniques is
the difficulty of 210 reading comprehension items the extraction of semantic features from item texts.
using various surface structure variables and word Word embedding is a technique that converts texts
frequency measures for the text, such as the num- into numerical values in vector space, capturing
ber of words, content words, or content-function the meanings of words across different dimensions
words. Freedle and Kostin (1993) predicted the dif- (Mikolov et al., 2013). Pre-trained NLP models
ficulty of 213 reading comprehension items using such as Word2Vec and GLoVe allow researchers to
12 categories of sentential and discourse variables, extract word embedding features from item texts
such as vocabulary, length of texts, and syntactic (e.g., Firoozi et al., 2022). For example, Hsu et al.
structures (e.g., the number of negations). Perkins (2018) transformed item texts into semantic vectors
et al. (1995) employed artificial neural networks and then used cosine similarity to measure the se-
to predict the item difficulty of 29 items in a read- mantic similarity between different pairs of items.
ing comprehension test. They coded the items to Additionally, (Yaneva et al., 2019) extracted word
extract three types of features: text structure (e.g., embedding features from multiple-choice items in
the number of words, lines, paragraphs, sentences, high-stakes medical exams. Along with other lin-
and content words), propositional analysis of pas- guistic and psycholinguistic features in predicting
sages and stems (e.g., the number of arguments, item difficulty, they found that word embedding
modifiers, and predicates), and cognitive process features contributed most to the predictive power.
(e.g., identify, recognize, verify, infer, generalize, More recently, a significant breakthrough in the
or problem-solving). NLP field has been the development of LLMs such
Research focused on the prediction of item char- as BERT (Devlin et al., 2018) and its variants,
acteristics such as difficulty and response time has which were trained using different mechanisms
been significantly influenced by the availability and or training datasets. For example, Zhou and Tao
application of emerging techniques in NLP and ma- (2020) utilized a BERT-variant model to predict
chine learning AlKhuzaey et al. (2023). For exam- the difficulty of programming problems. Their
ple, Yaneva et al. (2019) employed NLP methods results showed that compared with BERT, Distil-
to extract syntactic features to predict item diffi- BERT, a small version of the BERT base model,
culty, which were identified as crucial predictors. was the best-performing model when the only avail-
Another application of NLP methods involves as- able data for fine-turning was the text of the items.
sessing the linguistic complexity or readability of Benedetto et al. (2021) also compared the perfor-
item texts to predict item difficulty. Benedetto et al. mance of BERT and DistilBERT in predicting the
(2020a), for instance, calculated readability indices difficulty of multiple-choice questions and found
for item texts and combined them with other fea- that the BERT-based models significantly outper-
523
formed the two baseline models. 3.2 Our Best Model for Difficulty Prediction
Unlike the prediction of item difficulty, the pre- Here, we describe the details of our best-
diction of response time has not been widely in- performing model for predicting item difficulty
vestigated in the literature. This is mainly due (RM SE = .318), which performed slightly worse
to the limited availability of response time data. than a baseline dummy regressor (RM SE = .311)
However, with the increasing use of digital assess- and ranked at the 20th place out of 43 submissions
ments, such as computer-based and computerized- in the difficulty prediction leaderboard.
adaptive tests, in operational testing, the collection
3.2.1 Feature Extraction
of response data has become easier, which moti-
vated researchers to employ predictive models to We extracted linguistic features from item stems
predict the average response time required to solve and response alternatives (i.e., the answer key and
the items (e.g., Baldwin et al., 2021; Hecht et al., the incorrect response options) by leveraging both
2017; Yaneva et al., 2019). pre-trained large-language models and more inter-
pretable text representations such as connectivity,
cohesion, and text length. We started the feature ex-
3 Methodology traction process by concatenating the stem, key, and
alternatives of each item in a single data frame col-
3.1 Dataset umn and separated each item into individual data
files to extract Coh-Metrix features (McNamara
As mentioned earlier, this study utilized an empir- et al., 2014; Graesser et al., 2011). Concatenating
ical dataset released by NBME, which included item stems and alternatives served two purposes:
667 multiple-choice items previously administered (1) Adequately represent item length in terms of
in the USMLE series. Due to the requirements stem and alternatives and (2) control for the dif-
of the BEA 2024 Shared Task, the data was re- ferential number of alternatives that each item in-
leased in two stages. Initially, 446 multiple-choice cludes. Coh-Metrix includes 108 features and ana-
items were provided for extracting linguistic fea- lyzes a text on multiple measures of language and
tures from the items and building predictive models discourse (Graesser et al., 2011).
based on the extracted features. For each item, the Coh-Metrix focuses on six theoretical levels of
dataset encompasses the source texts (typically a text representation: words, syntax, the explicit
clinical case followed by a question) and the texts textbase, the referential situation model, the dis-
for each response option. The response options for course genre and rhetorical structure, and the prag-
the questions vary and can include up to 10 options, matic communication level (Graesser et al., 2014).
each represented in a separate column. When the It generates indices of text, including paragraph
number of response options was less than 10, the count, sentence count, word count, narrativity, syn-
remaining columns were left empty. tactic simplicity, referential cohesion, deep cohe-
Additionally, the dataset contained metadata sion, noun overlap, stem overlap, latent seman-
with four additional variables: Item type (text-only tic analysis, lexical diversity, syntactic complex-
items versus items containing pictures), exam steps ity, syntactic pattern density, and readability. We
(Steps 1, 2, or 3 in the USMLE series), item dif- removed four features from Coh-Metrix indices
ficulty, and average response time. Subsequently, due to no variability, including paragraph count
the predictive models trained in the first stage were (i.e., the number of paragraphs), the standard devia-
applied to make predictions for the remaining 201 tion of paragraph length, the mean Latent Semantic
items in the second stage, serving as the testing Analysis (LSA) overlap in adjacent paragraphs, and
set for evaluating the performance of the predictive the standard deviation of LSA overlap in adjacent
models for item difficulty and response time. The paragraphs.
structure of the second dataset mirrors that of the In the next step, we utilized the BiomedBERT
first, with the exception that the item difficulty and model (Gu et al., 2020) to extract new features.
response time variables were not immediately avail- This model, which was previously named PubMed-
able. These variables were released after the sub- BERT, is a pretrained LLM based on abstracts from
mission deadline for the BEA 2024 Shared Task, al- PubMed and full-text articles from PubMedCen-
lowing for the identification of the best-performing tral. We chose this particular model because it is
trials among the participating teams. known to achieve state-of-the-art performance on
524
many biomedical NLP tasks. By using Biomed- 3.2.3 Results
BERT, we obtained sentence embeddings for the With our pseudo-test set held out from the shared
item stems and alternatives and then computed the training set, we obtained a Mean Squared Error
cosine similarity between item sentence embed- (MSE) value of .064, a Root Mean Squared Error
dings and alternative stem embeddings. Cosine (RMSE) value of .253, and a Mean Absolute Error
similarity, which is commonly used to quantify the (MAE) value of .190, and a Pearson’s correlation
degree of similarity between two sets of informa- coefficient of .555.
tion, was computed as the cosine angle between
the embedding vectors of item stem and alterna- 3.3 Our Best Model for Response Time
tives. As cosine similarity, ranging between 0 and Prediction
1, gets closer to 1, it indicates more resemblance Our solution that achieved the best performance in
between the embedding vectors obtained using the predicting response time differed from the one that
item stem and alternatives. was best at predicting item difficulty. This solution
In the final step, we also extracted word embed- is briefly documented below.
dings for the concatenated text using stems and al-
ternatives by tokenizing the text using the Biomed- 3.3.1 Feature Extraction
BERT model (Gu et al., 2020). BiomedBERT has First, FastText word embeddings were gener-
768 dimensions with a maximum length of 512 ated for each item stem and response option.
words. We extracted the last hidden layer of em- We employed the pre-trained FastText embed-
beddings. We created a new data frame composed dings (wiki-news-300d-1m.vec.zip; obtained from
of three sets of features extracted (i.e., Coh-Metrix https://fasttext.cc/docs/en/english-vectors.html) to
features using the stem, key, and alternatives of map each word in the text to its corresponding 300-
each item, the cosine similarity between the stem dimensional vector representation. FastText is a
and alternatives, and word embeddings using the modified version of word2vec; the difference is that
stem, key, and alternatives) and the ground truth of it treats each word as composed of n-grams rather
item difficulty. The final data frame is composed than the original word in Word2Vec (Mikolov et al.,
of 882 features and the target variable of item diffi- 2017). For each text option, the embeddings of the
culty. first 60 words were concatenated to form a fea-
ture vector, resulting in a dimension of 18,000 (60
3.2.2 Model Training words × 300 dimensions) for each option. If the
text had fewer than 60 words, the corresponding
To identify the best model with the lowest RMSE vector was padded with zeros.
value, we used 85% of the data as our training set Similar to the approach taken for item difficulty
and 15% as our holdout test set. Because the sam- predictions, cosine similarity scores between each
ple size was too small (N = 466 of items shared in pair of alternatives (i.e., response options) were
total) and we had a very large set of features (N = calculated using the embeddings from the Biomed-
882), we first applied a dimension reduction tech- BERT model. For each pair, the cosine similarity
nique, Principal Component Analysis (PCA) (Wold between their embeddings was computed to cap-
et al., 1987). A PCA model with 30 components ture the semantic differences between different re-
explained 99% of variability in the dataset, and sponse options. The extracted features were then
thus, the final feature set included 30 components combined with the dummy-coded item develop-
extracted through the PCA analysis. We used lasso ment information (e.g., text-based items only vs.
regression (Tibshirani, 1996) with repeated 5-fold items including pictures; administration step in the
cross-validation to select the best hyperparameter USMLE series) to form the final feature set. Unlike
(i.e., alpha). Alpha in lasso regression is the model in the item difficulty prediction, we did not extract
penalty that determines the amount of shrinkage any other linguistic features in response time pre-
in the model. An advantage of lasso regression is diction.
the application of a regularization algorithm that
controls for the irrelevant features in the model by 3.3.2 Model Training
shrinking the contribution of irrelevant features to Considering the extremely high dimensionality of
zero. An alpha value of .01 yielded the best model the features, we performed feature selection and
during the cross-validation stage. dimension reduction techniques. First, using the
525
available information on response time in the train- response time. This finding is not necessarily sur-
ing set (N = 466), we eliminated the feature prising because the average reading time required
columns that had an absolute correlation coeffi- for the items is likely to be correlated with the
cient smaller than 0.1. Then, we performed PCA linguistic features extracted from the items.
to extract principal components until they could Overall, the results for the BEA 2024 Shared
capture 95% of the information presented in the Task indicate that predicting item characteristics
original feature set. To this end, we obtained a final such as difficulty remains challenging and requires
feature set with 339 features to train an algorithm. factors beyond linguistic or textual features.
As before, the model training involved the use
of lasso regression due to its ability to perform
feature selection and handle multicollinearity in References
high-dimensional data. The training process was Samah AlKhuzaey, Floriana Grasso, Terry R Payne,
performed using 10-fold cross-validation to opti- and Valentina Tamma. 2023. Text-based question
mize the hyperparameter (i.e., alpha) and evaluate difficulty prediction: A systematic review of auto-
the model’s performance. In terms of the hyperpa- matic approaches. International Journal of Artificial
Intelligence in Education, pages 1–53.
rameter search space, the regularization strength
(alpha) was tuned using a randomized search over Yigal Attali, Luis Saldivia, Carol Jackson, Fred Schup-
a logarithmic scale from 1e-4 to 1e-0.05, with 1000 pan, and Wilbur Wanamaker. 2014. Estimating item
candidate values. An alpha value of .44 yielded difficulty with comparative judgments. ETS Research
Report Series, 2014(2):1–8.
the best model during the cross-validation stage.
Additionally, the fit intercept parameter was tested Peter Baldwin, Victoria Yaneva, Janet Mee, Brian E
with both True and False values, while the selection Clauser, and Le An Ha. 2021. Using natural lan-
parameter was tested with ’cyclic’ and ’random’ guage processing to predict item response times and
improve test construction. Journal of Educational
options1 .
Measurement, 58(1):4–30.
3.3.3 Results
Ahmad Baylari and Gh A Montazer. 2009. Design a per-
Upon comparing our predicted response time and sonalized e-learning system based on item response
the released response time from the BEA 2024 theory and artificial neural network approach. Expert
Shared Task, we found this solution (RM SE = Systems with Applications, 36(4):8013–8021.
31.48; M SE = 990.98; M AE = 23.54, r =
Isaac I Bejar. 1983. Subject matter experts’ assessment
0.209) was slightly better than the baseline dummy of item statistics. Applied Psychological Measure-
regressor (RM SE = 31.68), which ranked 24th ment, 7(3):303–310.
among the 34 submissions.
Luca Benedetto, Giovanni Aradelli, Paolo Cremonesi,
4 Discussion and Conclusion Andrea Cappelli, Andrea Giussani, and Roberto Tur-
rin. 2021. On the application of transformers for
The competition results for the BEA 2024 Shared estimating the difficulty of multiple-choice questions
Task indicated that it is difficult to predict item from text. In Proceedings of the 16th workshop on
innovative use of NLP for building educational appli-
characteristics such as difficulty using linguistic cations, pages 147–157.
features (Yaneva et al., 2024). Only 15 teams out
of 43 managed to perform better than a baseline Luca Benedetto, Andrea Cappelli, Roberto Turrin, and
dummy regressor when it comes to predicting item Paolo Cremonesi. 2020a. Introducing a framework to
assess newly created questions with natural language
difficulty using textual features extracted from the processing. In International Conference on Artificial
items. These results suggest that linguistic features Intelligence in Education, pages 43–54. Springer.
may not be sufficient to capture the complex inter-
play between item features and item difficulty. Luca Benedetto, Andrea Cappelli, Roberto Turrin, and
Paolo Cremonesi. 2020b. R2de: a nlp approach to es-
Unlike predicting item difficulty, predicting the
timating irt parameters of newly generated questions.
average response time using linguistic features ap- In Proceedings of the tenth international conference
pears to be a more promising task. Out of 34 sub- on learning analytics & knowledge, pages 412–421.
missions, 24 teams performed better than a base-
line dummy regressor in predicting the average Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
1
Our codes for predicting item difficulty and response time bidirectional transformers for language understand-
are available at https://osf.io/dwe4n/. ing. arXiv preprint arXiv:1810.04805.
526
Priscilla A Drum, Robert C Calfee, and Linda K Cook. Kyle Perkins, Lalit Gupta, and Ravi Tammana. 1995.
1981. The effects of surface structure variables on Predicting item difficulty in a reading comprehen-
performance in reading comprehension tests. Read- sion test with an artificial neural network. Language
ing Research Quarterly, pages 486–514. testing, 12(1):34–53.
T Firoozi, O Bulut, C Demmans Epp, A N Abadi, and Gerard Salton. 1983. Introduction to modern informa-
D Barbosa. 2022. The effect of fine-tuned word em- tion retrieval. McGraw-Hill.
bedding techniques on the accuracy of automated
essay scoring systems using neural networks. Jour- Yuni Susanti, Takenobu Tokunaga, Hitoshi Nishikawa,
nal of Applied Testing Technology, 23:21–29. and Hiroyuki Obari. 2017. Controlling item difficulty
for automatic vocabulary question generation. Re-
Roy Freedle and Irene Kostin. 1993. The prediction search and practice in technology enhanced learning,
of toefl reading item difficulty: Implications for con- 12:1–16.
struct validity. Language Testing, 10(2):133–170.
Robert Tibshirani. 1996. Regression shrinkage and se-
Arthur C Graesser, Danielle S McNamara, Zhiqang Cai, lection via the lasso. Journal of the Royal Statistical
Mark Conley, Haiying Li, and James Pennebaker. Society Series B: Statistical Methodology, 58(1):267–
2014. Coh-metrix measures text characteristics at 288.
multiple levels of language and discourse. The Ele-
mentary School Journal, 115(2):210–229. Kelly Wauters, Piet Desmet, and Wim Van Den Noort-
gate. 2012. Item difficulty estimation: An auspicious
Arthur C Graesser, Danielle S McNamara, and Jonna M collaboration between data and judgment. Comput-
Kulikowich. 2011. Coh-metrix: Providing multi- ers & Education, 58(4):1183–1193.
level analyses of text characteristics. Educational
Researcher, 40(5):223–234. Svante Wold, Kim Esbensen, and Paul Geladi. 1987.
Principal component analysis. Chemometrics and
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto intelligent laboratory systems, 2(1-3):37–52.
Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng
Gao, and Hoifung Poon. 2020. Domain-specific lan- Victoria Yaneva, Peter Baldwin, Janet Mee, et al. 2019.
guage model pretraining for biomedical natural lan- Predicting the difficulty of multiple choice questions
guage processing. in a high-stakes medical exam. In Proceedings of the
fourteenth workshop on innovative use of NLP for
Martin Hecht, Thilo Siegle, and Sebastian Weirich. building educational applications, pages 11–20.
2017. A model for the estimation of testlet response
time to optimize test assembly in paper-and-pencil Victoria Yaneva, Kai North, Peter Baldwin, An Ha Le,
large-scale assessments. Journal for educational re- Saed Rezayi, Yiyun Zhou, Sagnik Ray Choudhury,
search online, 9(1):32–51. Polina Harik, and Brian Clauser. 2024. Findings
from the First Shared Task on Automated Prediction
Fu-Yuan Hsu, Hahn-Ming Lee, Tao-Hsing Chang, and of Difficulty and Response Time for Multiple Choice
Yao-Ting Sung. 2018. Automated estimation of item Questions. In Proceedings of the 19th Workshop
difficulty for multiple-choice tests: An application of on Innovative Use of NLP for Building Educational
word embedding techniques. Information Processing Applications (BEA 2024), Mexico City, Mexico. As-
& Management, 54(6):969–984. sociation for Computational Linguistics.
James C Impara and Barbara S Plake. 1998. Teach-
Chien-Lin Yang, Thomas R O Neill, and Gene A
ers’ ability to estimate item difficulty: A test of the
Kramer. 2002. Examining item difficulty and re-
assumptions in the angoff standard setting method.
sponse time on perceptual ability test items. Journal
Journal of Educational Measurement, 35(1):69–81.
of Applied Measurement, 3(3):282–299.
Tom Kubiszyn and Gary D Borich. 2024. Educational
testing and measurement. John Wiley & Sons. Ya Zhou and Can Tao. 2020. Multi-task bert for prob-
lem difficulty prediction. In 2020 international con-
Danielle S McNamara, Arthur C Graesser, Philip M ference on communications, information system and
McCarthy, and Zhiqiang Cai. 2014. Automated eval- computer engineering (cisce), pages 213–216. IEEE.
uation of text and discourse with Coh-Metrix. Cam-
bridge University Press.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski,
Christian Puhrsch, and Armand Joulin. 2017. Ad-
vances in pre-training distributed word representa-
tions. arXiv preprint arXiv:1712.09405.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
tions of words and phrases and their compositionality.
Advances in neural information processing systems,
26.
527
You might also like
Headway Pre Intermediate PDF
Document
145 pages
Headway Pre Intermediate PDF
Vera Fernandes
No ratings yet
Western Aphasia Battery Test WAB Sample 2
Document
14 pages
Western Aphasia Battery Test WAB Sample 2
laura13miruna
100% (1)
Question Summation and Sentence Similarity Using BERT For Key Information Extraction
Document
6 pages
Question Summation and Sentence Similarity Using BERT For Key Information Extraction
IJRASETPublications
No ratings yet
Automated Grading Model With Adjusted Level of Lenience For Short Answer Questions Using Natural Language Processing
Document
8 pages
Automated Grading Model With Adjusted Level of Lenience For Short Answer Questions Using Natural Language Processing
International Journal of Innovative Science and Research Technology
No ratings yet
An Empirical Study of Information Retrieval and Machine Reading Comprehension Algorithms For An Online Education Platform
Document
10 pages
An Empirical Study of Information Retrieval and Machine Reading Comprehension Algorithms For An Online Education Platform
Adson Roberto Pontes Damasceno
No ratings yet
Learning To Grade Short Answer Questions Using Semantic Similarity Measures and Dependency Graph Alignments
Document
11 pages
Learning To Grade Short Answer Questions Using Semantic Similarity Measures and Dependency Graph Alignments
Herr Direktor
No ratings yet
Text-To-Text Semantic Similarity For Automatic Short Answer Grading
Document
9 pages
Text-To-Text Semantic Similarity For Automatic Short Answer Grading
imot96
No ratings yet
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
Document
6 pages
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
ياسر سعد الخزرجي
No ratings yet
Application of The B Method For Evaluating Free-Text Answers in An E-Learning Environment
Document
4 pages
Application of The B Method For Evaluating Free-Text Answers in An E-Learning Environment
Misterr Eizziwizzie
No ratings yet
Real-time Evaluation of Descriptive Answer Using NLP and Machine Learning
Document
7 pages
Real-time Evaluation of Descriptive Answer Using NLP and Machine Learning
SMARTX BRAINS
No ratings yet
The Ability of Word Embeddings To Capture Word Similarities
Document
18 pages
The Ability of Word Embeddings To Capture Word Similarities
Darren
No ratings yet
Learning To Grade Short Answer Questions Using Semantic Similarity Measures and Dependency Graph Alignments
Document
11 pages
Learning To Grade Short Answer Questions Using Semantic Similarity Measures and Dependency Graph Alignments
imot96
No ratings yet
Automatic Generation of Cloze Items For Repeated Testing To Improve Reading Comprehension
Document
12 pages
Automatic Generation of Cloze Items For Repeated Testing To Improve Reading Comprehension
benaili.nesrine77
No ratings yet
Comparing Logistic Regression and Extreme Gradient Boosting On Student Arguments
Document
10 pages
Comparing Logistic Regression and Extreme Gradient Boosting On Student Arguments
IAES IJAI
No ratings yet
Developing A Computer-Based Assessment of Complex Problem Solving in Chem
Document
15 pages
Developing A Computer-Based Assessment of Complex Problem Solving in Chem
Science SHS Department
No ratings yet
Bailey IwC07
Document
16 pages
Bailey IwC07
Andrie Laurel
No ratings yet
IJCRT24A4501
Document
4 pages
IJCRT24A4501
samiksha gangurde
No ratings yet
Miyaoka Et Al 2023 Emergent Coding and Topic Modeling A Comparison of Two Qualitative Analysis Methods On Teacher Focus
Document
9 pages
Miyaoka Et Al 2023 Emergent Coding and Topic Modeling A Comparison of Two Qualitative Analysis Methods On Teacher Focus
ecoelhobezerra
No ratings yet
Ackerman, R. D. (2014) - The Effect of Concrete Supplements On Metacognitive Regulation
Document
21 pages
Ackerman, R. D. (2014) - The Effect of Concrete Supplements On Metacognitive Regulation
jimenahdez
No ratings yet
On Task Completion
Document
9 pages
On Task Completion
Laura de los Santos
No ratings yet
Semantic Textual Similarity With Siamese Neural Networks: Tharindu Ranasinghe, Constantin or Asan and Ruslan Mitkov
Document
8 pages
Semantic Textual Similarity With Siamese Neural Networks: Tharindu Ranasinghe, Constantin or Asan and Ruslan Mitkov
Arafat Al Zaman
No ratings yet
Learning Outcome and Blooms Taxonomy
Document
13 pages
Learning Outcome and Blooms Taxonomy
ajit.alwe
No ratings yet
LDA Topic Model With Soft Assignment of Descriptors To Words
Document
9 pages
LDA Topic Model With Soft Assignment of Descriptors To Words
Fabian Moss
No ratings yet
Algoritma Genetika Untuk Penjadwalan Matakuliah
Document
14 pages
Algoritma Genetika Untuk Penjadwalan Matakuliah
Choiryaldi Setya Pratama
No ratings yet
On The Dimensionality of Word Embedding
Document
18 pages
On The Dimensionality of Word Embedding
pier22.antani
No ratings yet
(IJCST-V9I5P10) :alok Kumar, Aditi Kharadi, Deepika Singh, Mala Kumari
Document
9 pages
(IJCST-V9I5P10) :alok Kumar, Aditi Kharadi, Deepika Singh, Mala Kumari
EighthSenseGroup
No ratings yet
Information Processing and Management: Danushka Bollegala, Naoaki Okazaki, Mitsuru Ishizuka
Document
21 pages
Information Processing and Management: Danushka Bollegala, Naoaki Okazaki, Mitsuru Ishizuka
indorayabt
No ratings yet
Simple Unsupervised Keyphrase Extraction Using Sentence Embeddings
Document
9 pages
Simple Unsupervised Keyphrase Extraction Using Sentence Embeddings
Aniket Verma
No ratings yet
139 Zeinabaghahadi
Document
6 pages
139 Zeinabaghahadi
hamedemkamel
No ratings yet
ECS260 Project Progress Report
Document
8 pages
ECS260 Project Progress Report
7yc8vygsm7
No ratings yet
Learning Improved Class Vector For Multi-Class Question Type Classification
Document
9 pages
Learning Improved Class Vector For Multi-Class Question Type Classification
vince
No ratings yet
Bilingual Terminology Extraction Using Multi-Level Termhood: Chengzhi ZHANG, Dan WU
Document
15 pages
Bilingual Terminology Extraction Using Multi-Level Termhood: Chengzhi ZHANG, Dan WU
chianmus
No ratings yet
Cognitive Diagnostic Assessment
Document
43 pages
Cognitive Diagnostic Assessment
Reza Mobashshernia
No ratings yet
2021 Paclic-1 79
Document
10 pages
2021 Paclic-1 79
Devika Ajay Verma
No ratings yet
Language Testing and Validation (Pages 17-21) (2024-07-02)
Document
6 pages
Language Testing and Validation (Pages 17-21) (2024-07-02)
Alex Ouellette
No ratings yet
A Weighted Word Embedding Based Approach For Extractive
Document
11 pages
A Weighted Word Embedding Based Approach For Extractive
NOOR MOHAMMAD
No ratings yet
Subjective Answer Evaluator
Document
7 pages
Subjective Answer Evaluator
IJRASETPublications
No ratings yet
111 1460444112 - 12-04-2016 PDF
Document
7 pages
111 1460444112 - 12-04-2016 PDF
Editor IJRITCC
No ratings yet
Automatic MCQ Generation
Document
4 pages
Automatic MCQ Generation
hoomanreal9
No ratings yet
Writing Objective Test Article by Nelly&Fina
Document
8 pages
Writing Objective Test Article by Nelly&Fina
herlanfadilah98
No ratings yet
03-Improving The Measurement of Cognitive Skills Through Automated Conversations
Document
16 pages
03-Improving The Measurement of Cognitive Skills Through Automated Conversations
XINCHENG WU
No ratings yet
A T C A V E M: Rabic EXT Ategorization Lgorithm Using Ector Valuation Ethod
Document
10 pages
A T C A V E M: Rabic EXT Ategorization Lgorithm Using Ector Valuation Ethod
Anonymous Gl4IRRjzN
No ratings yet
Rossi 2021
Document
7 pages
Rossi 2021
Dianne Caparas
No ratings yet
Evaluating of Efficacy Semantic Similarity Methods
Document
8 pages
Evaluating of Efficacy Semantic Similarity Methods
Gabriela Alejandra PEREZ
No ratings yet
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
Document
17 pages
Bert - Se: A P - L R M S E: RE Trained Anguage Epresentation Odel For Oftware Ngineering
meyohannes2016
No ratings yet
Literature Review
Document
4 pages
Literature Review
zaxaosaoibuhoubvdi
No ratings yet
High School Timetabling Using Tabu Search and Partial Feasibility Preserving Genetic Algorithm
Document
8 pages
High School Timetabling Using Tabu Search and Partial Feasibility Preserving Genetic Algorithm
IJAET Journal
No ratings yet
Regularized Phrase-Based Topic Model For Automatic Question Classification With Domain-Agnostic Class Labels
Document
13 pages
Regularized Phrase-Based Topic Model For Automatic Question Classification With Domain-Agnostic Class Labels
Sakshi Sahu
No ratings yet
Software Process Modeling Languages A Systematic Literature Review
Document
5 pages
Software Process Modeling Languages A Systematic Literature Review
afdtovmhb
No ratings yet
02 - Klyuev - A Query Expansion Technique Using The EWC Se
Document
6 pages
02 - Klyuev - A Query Expansion Technique Using The EWC Se
paragdave
No ratings yet
On Term Frequency Factor in Supervised Term Weighting Schemes For Text Classification
Document
16 pages
On Term Frequency Factor in Supervised Term Weighting Schemes For Text Classification
carthrikupht
No ratings yet
2026
Document
5 pages
2026
Tsal
No ratings yet
Computer Literature Review
Document
8 pages
Computer Literature Review
afdtywqae
100% (1)
Query Reformulation Based On Word Embeddings
Document
12 pages
Query Reformulation Based On Word Embeddings
Khadija Nezhari
No ratings yet
1225-Article Text-2315-1-10-20220818
Document
6 pages
1225-Article Text-2315-1-10-20220818
20ucs067
No ratings yet
The Impact of Rule-Based Text Generation On The Quality of Abstractive Summaries
Document
10 pages
The Impact of Rule-Based Text Generation On The Quality of Abstractive Summaries
Assala Belkhiri
No ratings yet
Biomedical Semantic Embeddings Using Hybrid Sentences To Construct Biomedical Word Embeddings and Its Applications
Document
9 pages
Biomedical Semantic Embeddings Using Hybrid Sentences To Construct Biomedical Word Embeddings and Its Applications
sanwaliyaabhi
No ratings yet
Automatic Grading of Answer Sheets Using Machine L
Document
10 pages
Automatic Grading of Answer Sheets Using Machine L
Yashwanth JV
No ratings yet
Garbacea 22 A
Document
17 pages
Garbacea 22 A
Naoual Nassiri
No ratings yet
Seminar
Document
9 pages
Seminar
wogarigj
No ratings yet
Seminars in Orthodontics
Document
12 pages
Seminars in Orthodontics
mishaladnankp
No ratings yet
Deep Learning with Python. Part 2
From Everand
Deep Learning with Python. Part 2
Simon Winston
No ratings yet
The Essence of Language in Samuel Beckett's PDF
Document
5 pages
The Essence of Language in Samuel Beckett's PDF
FatiFlure
No ratings yet
1-5 New
Document
46 pages
1-5 New
Khristine May Estillore
No ratings yet
Mnemonic S
Document
4 pages
Mnemonic S
Zara Gen Velasco
No ratings yet
Standard Styles in Review of Related Literature, Citation, or Reference
Document
15 pages
Standard Styles in Review of Related Literature, Citation, or Reference
Ariel Nube
No ratings yet
Research Methods in Psychology
Document
21 pages
Research Methods in Psychology
Somia ali
No ratings yet
16 Skills - 7103021071 - Feby Caroline Wijaya
Document
14 pages
16 Skills - 7103021071 - Feby Caroline Wijaya
Feby Caroline Wijaya
No ratings yet
The Effects of Code-Mixing Among Béchar University Students in Learning EFL Case of Study - First Year Master Students of EFL
Document
18 pages
The Effects of Code-Mixing Among Béchar University Students in Learning EFL Case of Study - First Year Master Students of EFL
Bi laL
No ratings yet
Teachinglec1 PDF
Document
21 pages
Teachinglec1 PDF
Jessica Guidett
No ratings yet
HCI
Document
22 pages
HCI
Himanshu Thakur
No ratings yet
SPO Triage
Document
4 pages
SPO Triage
LiaNda Tazmasiv Koto MenyemangatiSendiri
No ratings yet
English Orthography - Wikipedia
Document
25 pages
English Orthography - Wikipedia
Mohamed Yusef
No ratings yet
L Theory L Objects
Document
15 pages
L Theory L Objects
api-298340724
No ratings yet
The Indigenous Psychology
Document
11 pages
The Indigenous Psychology
Lyka Marie Medino
No ratings yet
Bahasa Inggris Industri Kelas C
Document
5 pages
Bahasa Inggris Industri Kelas C
Ridho Syam
No ratings yet
Scientific Method
Document
41 pages
Scientific Method
Yulianti Antula
No ratings yet
Reading Year 8 Lesson Plan
Document
7 pages
Reading Year 8 Lesson Plan
api-285385473
100% (1)
Does Emotional Words Take Longer To Resp
Document
10 pages
Does Emotional Words Take Longer To Resp
The TopTeN Circle
No ratings yet
Pragmatics - Report
Document
12 pages
Pragmatics - Report
Stern Simon
No ratings yet
(Essay) Olimpiade Sains Nasional Global Youth Action #3 Mata Pelajaran Bahasa Inggris Tingkat SMA - SMK - MA - Sederajat
Document
2 pages
(Essay) Olimpiade Sains Nasional Global Youth Action #3 Mata Pelajaran Bahasa Inggris Tingkat SMA - SMK - MA - Sederajat
Gloria Elsefana
No ratings yet
Logic and Lang Structure Syllabus
Document
2 pages
Logic and Lang Structure Syllabus
Parinay Seth
No ratings yet
Assessment and Evaluation of Learning Part 2
Document
16 pages
Assessment and Evaluation of Learning Part 2
hazelakiko torres
No ratings yet
Lesson Exemplar in UCSP Quarter 1 Week 6
Document
6 pages
Lesson Exemplar in UCSP Quarter 1 Week 6
sherilyn palacay
No ratings yet
Rizal
Document
9 pages
Rizal
Ronalie Perez
93% (14)
Informed Consent Formlauradoerr1
Document
1 page
Informed Consent Formlauradoerr1
api-238861405
No ratings yet
BM M2 Merged
Document
120 pages
BM M2 Merged
rtkik
No ratings yet
Untitled Document
Document
3 pages
Untitled Document
Jeanraiza Corpuz
No ratings yet
Emily Alascio - Final Reflection4
Document
3 pages
Emily Alascio - Final Reflection4
api-392885800
No ratings yet
SOFT SKILLS - HR Finance - Gadjian
Document
11 pages
SOFT SKILLS - HR Finance - Gadjian
Yusafat Transisko
No ratings yet