Natural Learning
Natural Learning
INTRODUCTION
The rapidly growing popularity of large language models (LLMs) has taken the
AI community and general public by storm. This attention can lead people to
believe LLMs are the right solution for every problem. In reality, the question of
the usefulness of LLMs and how to adapt them to real-life tasks is an open one.
In education, it has been suggested that they can be exploited to boost learning
in different categories such as in elementary school children, middle and high
school children, university students etc [8]. This is line with a long-time goal of
AI to develop conversational agents that can support teachers in guiding
children through reading material such as reading storybooks [9] [10].
Normally, in reading a text such as children’s storybook, a teacher is expected to
guide the children through the text and periodically gauge their understanding
by posing questions from the text. The key concept in guided reading is the
ability to use questions to gauge understanding and encourage deeper thinking
about the material being read.
1
high cognitive challenge questions [11]. Low cognitive challenge questions are
constrained to short answers while high cognitive challenge questions require
explanations, evaluation or extension of text [11]. The use of questions to foster
understanding and learning from text is well established across a range of age
groups and learning contexts [11].
An emerging paradigm for text generation is to prompt (or ‘ask’) LLMs for a
desired output [5]. This works by feeding an input prompt or ‘query’ (with a
series of examples for a one- or few-shot setting) to a LLM. This paradigm has
inspired a new research direction called prompt engineering. One of the most
common approaches to prompt engineering involves prepending a string to the
context given to a LLM for generation [4]. For controllable text generation
(CTG), such a prefix must contain a control element, such as a keyword that
will guide the generation [5].
Questions are one of the most basic methods used by teachers to educate. As
this learning method is so broad, it uses many organizational taxonomies which
take different approaches to divide questions into groups. One popular example
is Bloom’s taxonomy [3], which divides educational material into categories
based on student’s learning goals. Another example is a difficulty-level
taxonomy, which usually divides questions into 3 categories of easy, medium,
and hard [7]. By combining CTG and these question taxonomies, we open doors
2
for question generation by prompting LLMs to meet specifications of the
educational domain.
For LLMs to be viewed as potential support agents for teachers or even stand-
alone tools that can help in guided reading, they must be able to: generate
meaningful questions and answers from the text, generate diverse questions
both in terms of content coverage and difficulty of questions and identify the
support needs for the students. This study will investigate the use of ChatGPT 3
in the creation of a model for the generation of questions for students.
LITERATURE REVIEW
3
LLMs are part of the family of generative models (Ingraham et al., 2019), which
means they can generate new text based on the patterns and structures learned
from the data used to train them. LLMs have many applications, including NLP,
conversational AI, text generation, machine translation, sentiment analysis, and
content creation. They can be used in various industries, such as healthcare,
finance, customer service, marketing, and entertainment, to automate tasks,
provide insights, and improve user experiences. One of the key features of
LLMs is their ability to understand and conversationally generate text (Kasneci
et al., 2023). They can engage in interactive and dynamic conversations with
users, respond to queries, provide information, and generate relevant and
coherent responses. LLMs are trained to understand context, tone, and style,
making them capable of generating text that closely mimics human like
conversation. The development of LLMs can be considered among the greatest
scientific advancements or breakthroughs in AI.
Currently, LLMs are mainly built upon the Transformer architecture [22], where
multi-head attention layers are stacked in a very deep neural network. Existing
4
LLMs adopt similar Transformer architectures and pre-training objectives (e.g.,
language modeling) as small language models. However, LLMs significantly
extend the model size, data size, and total compute (orders of mag nification).
Extensive research has shown that scaling can largely improve the model
capacity of LLMs [26, 55, 56]. Thus, it is useful to establish a quantitative
approach to characterizing the scaling effect. Next, we introduce two
representative scaling laws for Transformer language mod els [30, 34].
KMscaling law: In 2020, Kaplan et al. [30] (the OpenAI team) firstly proposed
to model the power-law relationship of model performance with respective to
three major factors, namely model size (N), dataset size (D), and the amount of
training compute (C), for neural language models. Given a compute budget c,
they empirically presented three basic formulas for the scaling law:
where L(·) denotes the cross entropy loss in nats. The three laws were derived
by fitting the model performance with varied data sizes (22M to 23B tokens),
model sizes (768M to 1.5B non-embedding parameters) and training compute,
under some assumptions (e.g., the analysis of one factor should be not
bottlenecked by the other two factors). They showed that the model
performance has a strong depen dence relation on the three factors.
5
where E = 1.69, A = 406.4,B = 410.7, α = 0.34 and β = 0.28. By optimizing the
loss L(N,D) under the constraint C ≈ 6ND, they showed that the optimal
allocation of compute budget to model size and data size can be derived as
follows:
In the literature [31], emergent abilities of LLMs are formally defined as “the
abilities that are not present in small models but arise in large models”, which is
one of the most prominent features that distin guish LLMs from previous PLMs.
It further introduces a notable characteristic when emergent abilities occur [31]:
6
performance rises significantly above random when the scale reaches a certain
level. By analogy, such an emergent pattern has close connections with the
phenomenon of phase transition in physics [31, 58]. In principle, emergent
abilities can be defined in relation to some complex tasks [31, 59], while we are
more concerned with general abilities that can be applied to solve a variety of
tasks. Here, we briefly introduce three typical emergent abilities for LLMs and
representative models that possess such an ability7.
7
benchmarks (i.e., MMLU, BBH, TyDiQA and MGSM), though a much smaller
size might suffice for some specific tasks (e.g., MMLU).
It has been a long way that LLMs evolve into the current state: general and
capable learners. In the development process, a number of impor tant techniques
are proposed, which largely improve the capacity of LLMs. Here, we briefly list
several important techniques that (potentially) lead to the success of LLMs, as
follows.
8
more compute-efficient allocation of the compute resources. For example,
Chinchilla (with more training tokens) outper forms its counterpart model
Gopher (with a larger model size) by increasing the data scale with the same
compute budget [34]. In addition, data scaling should be with careful cleaning
process, since the quality of pre-training data plays a key role in the model
capacity.
Training: Due to the huge model size, it is very chal lenging to successfully
train a capable LLM. Distributed training algorithms are needed to learn the
network param eters of LLMs, in which various parallel strategies are of ten
jointly utilized. To support distributed training, several optimization frameworks
have been released to facilitate the implementation and deployment of parallel
algorithms, such as DeepSpeed[65] andMegatron-LM[66–68]. Also, op
timization tricks are also important for training stability and model
performance, e.g., restart to overcome training loss spike [56] and mixed
precision training [69]. More recently, GPT-4 [46] proposes to develop special
infrastructure and optimization methods that reliably predict the performance of
large models with much smaller models.
9
Alignment tuning: Since LLMs are trained to capture the data characteristics of
pre-training corpora (including both high-quality and low-quality data), they are
likely to generate toxic, biased, or even harmful content for humans. It is
necessary to align LLMs with human values, e.g., helpful, honest, and harmless.
For this purpose, InstructGPT [61] designs an effective tuning approach that
enables LLMs to follow the expected instructions, which utilizes the tech nique
of reinforcement learning with human feedback [61, 70]. It incorporates human
in the training loop with elaborately designed labeling strategies. ChatGPT is
indeed developed on asimilar technique to InstructGPT, which shows a strong
alignment capacity in producing high-quality, harmless re sponses, e.g.,
rejecting to answer insulting questions.
In addition, many other factors (e.g., the upgrade of hardware) also contribute to
the success of LLMs. Currently, we limit our discussion to the major technical
approaches and key findings for developing LLMs.
10
Test Purpose Determination/Specification
Test purpose determination is the foremost step in the test development cycle. It
involves identifying relevant educational issues that need to be addressed or key
areas that require producing new knowledge or modifying existing ones. To
Joshua (2012), some of the main purposes of testing revolve around evaluating
teachers’ effectiveness and students’ motivation, judging students’ learning
proficiency, their acquisition of essential skills and knowledge, diagnosing
students’ learning difficulties, ranking students’ learning achievement, and
measuring their growth over time. Since purpose of a test is built from the
course content or subject (Joshua, 2012), LLMs can be useful in the test
specification of purpose by analyzing the course content and identifying key
topics or concepts that need to be assessed. For instance, an LLM can analyze a
large amount of text data related to a specific course or subject, identify main
themes and concepts, and suggest appropriate test items that accurately measure
students’ understanding of those concepts.
Moreover, LLMs can be used to create adaptive tests that adjust the difficulty
level of questions based on students’ responses. This can ensure that students
are challenged appropriately and that the test accurately measures their
knowledge and skills. LLMs can also generate test items that align with specific
learning objectives and outcomes. For example, an LLM can analyze the course
content, identify the key skills or knowledge students are expected to acquire,
and generate test items that align with those objectives. LLMs can also help
determine the most appropriate testing method to address these purposes and
generate test items that align with each purpose. For example, LLMs can
analyze large amounts of text data related to students’ learning difficulties and
suggest test items that can diagnose those difficulties.
11
Similarly, LLMs can analyze student growth data over time and suggest test
items that accurately measure that growth. Furthermore, LLMs can help to
ensure that test items are valid, reliable, and relevant to the educational issues
being addressed. By analyzing large amounts of data related to student learning
and educational issues, LLMs can suggest appropriate test items that accurately
measure students’ knowledge and skills. Additionally, LLMs can help to ensure
that test items are fair and unbiased, which is essential in ensuring that the test
results accurately reflect students’ knowledge and skills.
12
determine the appropriate weightage or distribution of these items in the test.
This can help ensure the test is aligned with the objectives and intended
construct.
Test item development involves translating the course contents into test items or
questions that will stimulate the learners and elicit the required behaviour
specified in the instructional objectives of the course (Joshua, 2012). Test items
can be broadly classified into two categories:
(1) objective items (highly structured items that have a clear and specific correct
answer, often in the form of multiple-choice, true/false, or matching questions)
and
(2) essay items (open-ended question that requires the test-taker to provide a
written response that demonstrates their understanding of a topic, their ability to
articulate ideas clearly and coherently, and often their ability to analyze,
synthesize, and evaluate information).
During the test item generation phase, a large pool of items is expected to be
gathered from relevant sources, more than the number of items required for the
test. The initial item pools can be reviewed with the support of domain experts
or peers to identify relevant, clear, specific and unambiguous test items for
selection. Those that do not meet the criteria for selection can either be
strengthened or dropped.
LLMs can assist in creating relevant, clear, specific, and unambiguous test
items. These models can analyse course content and other relevant sources to
generate a large pool of potential test items. Domain experts or peers can then
review the generated items to identify suitable items for selection. LLMs can
also assist in developing objective test items such as multiple-choice, true/false,
13
or matching questions. These highly structured items have a clear and specific
correct answer that can be generated using LLMs. Training LLMs on relevant
texts can generate items that assess specific skills or knowledge areas. LLMs
can also generate distractors or incorrect options for multiple-choice questions,
ensuring they are plausible but incorrect. This can help in the creation of a
diverse and balanced item pool. These models can analyse the course content
and generate options that align with the instructional objectives. Moreover,
LLMs can also be useful in developing essay test items. These open-ended
questions require the test-taker to provide a written response demonstrating their
understanding of a topic, their ability to articulate ideas clearly and coherently,
and often their ability to analyse, synthesize, and evaluate information. LLMs
can assist in generating essay prompts that align with the instructional
objectives and are relevant to the course content.
For objectivity in testing, simple, specific and clear test instructions must be
developed to guide the test administrator and the respondents. According to
Joshua (2012), the instructions for the testing procedures should explain why
they are necessary. They should also contain information about how to organize
the testing environment, distribute and collect test materials, manage time,
procedures for recording answers and deal with anticipated and unforeseen
inquiries. For the test takers, the instructions should include the test’s purpose,
the time allowed for the test, the basis for answering, expected ethical behaviors
(dos and do nots), and discipline to be accorded for any breach of such rules.
14
LLMs can be extremely useful in the preparation of test instructions. These
models are designed to understand and process natural language, which makes
them ideal for tasks that require human-like language understanding and
processing. Regarding testing instructions, LLMs can help ensure that the
instructions are unambiguous. They can also help identify confusion or
misunderstanding using specific words or phrasing. Additionally, LLMs can
suggest alternative phrasing or wording that may be clearer or more easily
understood. Moreover, LLMs can help ensure the instructions are culturally
appropriate and sensitive to different audiences. They can identify potentially
offensive or insensitive language and suggest more appropriate alternatives.
LLMs can also help with the localization of test instructions. For example, if the
test is being translated into a different language, an LLM can help ensure that
the translated instructions accurately convey the intended meaning of the
original instructions.
Item Assembly/Selection
LLMs can be very useful in the test assembly process. Test assembly involves
selecting and organizing test items or questions to create a test that accurately
measures a specific construct or skill. By analysing relevant texts, LLMs can
assist in identifying the most relevant and appropriate items from the item pool
based on the test blueprint and objectives. This can help ensure that the test is
well-balanced, covers the intended construct, and is appropriate for the target
population. They can assist in this process by analysing the content of test items
and identifying potential issues with wording, phrasing, or cultural biases. For
example, an LLM can identify items with difficult vocabulary or syntax that
may be confusing to test-takers or items that use colloquial language that may
not be familiar to all test-takers. LLMs can analyse the coherence and
consistency of test items to ensure that they assess the intended construct or
skill fairly and validly. They can identify potential redundancies or overlaps in
15
the content of test items or identify items irrelevant to the intended construct or
skill. Furthermore, they can assist with translating test items into different
languages, ensuring that the translated items accurately convey the intended
meaning of the original items. LLMs can also assist in selecting and organizing
test items by using NLP techniques to identify relationships between items. For
example, an LLM can analyze the content of items and group them based on
shared themes or concepts. All of these can be done in just a matter of seconds.
Test Administration
16
technology to verify the test-takers’ identity, eye-tracking technology to detect
unusual eye movements, and keystroke analysis to detect unusual typing
patterns. Furthermore, LLMs can help ensure the security of tests by providing
features such as password protection, encryption, and monitoring tools to
prevent cheating and unauthorized access to test content.
Test Scoring
17
questions. However, it may also consider other factors such as partial credit,
essay responses, or subjective evaluations by the test scorer or teacher. Test
scoring is a common practice in education used to measure student knowledge,
understanding, and skill levels and provide feedback and guidance for further
learning.
LLMs can be useful for test scoring in several ways. They can be used to
automate the process of grading and scoring tests, which can save a significant
amount of time and effort for teachers and instructors. This is particularly useful
in cases, where large-scale tests need to be graded quickly, such as in
standardized testing or online assessments. LLMs can provide more accurate
and consistent scoring of tests than human graders, as they are not subject to
biases or errors arising from fatigue, distraction, or personal preferences. They
can also be programmed to recognize and account for common mistakes or
misconceptions made by students, which can help to identify areas, where
further teaching or support may be needed. LLMs can provide more detailed
feedback and analysis of test scores than traditional scoring methods. They can
be programmed to provide explanations or examples of correct answers and
highlight areas, where a student may need to improve or focus more attention.
This can guide further learning and development and provide more personalized
and targeted support for individual students. Training LLMs on a large corpus of
text can be fine-tuned to provide scores for open-ended questions or essays
based on various criteria, such as content, language use, and organization.
Automated scoring with LLMs can provide quick and consistent results,
particularly for large-scale assessments.
Interpreting test results means analysing and making sense of the scores or
outcomes obtained from a test or assessment. The interpretation of test results is
18
an important part of the testing process. It allows for meaningful conclusions to
be drawn about a student’s knowledge, skills, and abilities and can inform
decisions about teaching, learning, and further assessment. The interpretation of
test results typically involves comparing the scores obtained by an individual or
group of individuals to established norms or standards, such as the scores of
other students in the same grade or subject area, or to pre-determined criteria for
proficiency or mastery. This comparison can help identify areas of strength and
weakness and provide insight into the overall level of knowledge or
achievement of the test taker(s). In addition to comparing scores to established
norms or standards, interpreting test results may involve examining patterns or
trends in the scores over time or comparing scores on different tests or subtests
within a larger assessment. This can help identify areas, where further learning
or intervention may be needed and track progress or improvement.
LLMs can aid in interpreting assessment results by providing insights into the
meaning and implications of the data. LLMs can be used to analyse and
interpret test scores, performance levels, and other assessment outcomes in the
context of the test objectives, standards, and criteria. This can help educators
and policymakers make informed decisions about the performance of
individuals or groups of test takers. Apart from their ability to automate the
interpretation of a large volume of test results in record time, LLMs can provide
a more detailed and nuanced analysis of test results than traditional methods.
They can be programmed to recognize patterns or trends in the scores that may
not be immediately apparent to human graders or analysts. They can also
identify relationships between test items or subtests relevant to further teaching
or assessment. LLMs can provide personalized feedback and recommendations
based on individual test results, which can help to guide further learning and
development. For example, an LLM could analyse a student’s test results and
19
provide recommendations for specific areas, where they may need to focus
more, or practice based on their strengths and weaknesses.
Test Analysis/Appraisal
Test analysis refers to the process of examining the results of a test in order to
extract meaningful information about the performance of the test-takers. Test
analysis aims to identify areas of strength and weakness and provide insights
into the test-takers overall level of knowledge or achievement. The test analysis
process typically involves gathering the scores and other relevant information
from the test and organizing it to allow easy analysis. Computing descriptive
statistics such as mean, standard deviation, and frequency distributions to
summarize the test scores and provide an overview of the distribution of scores.
Examining item-level performance by analysing the performance of individual
test items to identify items that were particularly difficult or easy for test-takers
and identify items that may have been ambiguous or unclear. Identifying
patterns or trends in the test scores across different test-taker subgroups (e.g.,
gender, ethnicity, or age) and over time (e.g., comparing scores from different
test administrations). Based on the findings of the test analysis, conclusions can
be drawn about the performance of the test-takers, as well as recommendations
for further teaching or assessment be made. Various stakeholders, including test
developers, educators, and policymakers, may conduct test appraisals. The
process typically involves using established criteria or standards to evaluate the
test, such as those outlined by American Psychological Association or National
Council on Measurement in Education. The results of a test appraisal may be
used to inform decisions about test selection, interpretation, and use, as well as
to guide improvements in test development and administration processes.
LLMs have the potential to be valuable tools in test appraisals. LLMs can be
employed to analyse assessment data, including item statistics, item difficulty,
20
discrimination indices, and other performance metrics. LLMs can help identify
patterns, trends, and anomalies in the data and provide insights into the overall
performance of test takers and the quality of test items. LLMs can be useful in
providing detailed feedback to students, highlighting areas where they need
improvement or providing explanations for correct answers. The analytics
generated by LLMs can provide insights into student strengths and weaknesses,
highlight areas where additional instruction may be needed, and help teachers
and administrators make informed decisions about instruction and resource
allocation. LLMs can also identify potential sources of confusion or
misunderstanding in test questions, such as multiple word or phrase meanings.
This information can be used to revise questions or provide additional
clarification to ensure that all students have a fair and accurate understanding of
what is being asked. LLMs can also be used in item analysis. Item analysis
involves analyzing the performance of individual test items to identify areas,
where they may be flawed or ineffective. LLMs can use student responses to
specific items to highlight areas, where the item may be too difficult, too easy,
or poorly worded. LLMs can also identify patterns in student responses to
specific types of items, such as multiple-choice or essay questions. This
information can inform decisions about the design and format of future
assessments and ensure that assessments are as effective and fair as possible.
Reporting
21
proficiency in a particular skill or subject area. Reporting may also include
information about how students performed on different test items, or their
performance compared to other students in their class or school. Reporting may
also involve providing feedback to students and teachers about areas, where
students performed well and where they may need additional support or
instruction. This feedback can inform instruction and help students improve
their performance on future assessments. In addition to communicating the
results of a test, reporting may also include information about the validity and
reliability of the test. This information can be used to evaluate the quality of the
test and ensure that it provides accurate and useful information about student
performance.
LLMs can assist in the generation of test reports. By analysing the assessment
data, LLMs can generate comprehensive reports summarizing the test results,
including descriptive statistics, performance profiles, and graphical
representations. LLMs can also generate interpretive reports that provide
insights and recommendations based on the test results. These reports can be
used by educators, policymakers, and other stakeholders for decision-making
and planning purposes.
Related work
Question generation tools seek to accept input text and generate meaningful
questions that are extracted from the text. Existing question generation tools can
be categorised into two i.e., rule-based tools and neural based tools [12]. Rule
based tools such as [13] and [14] exploit manually crafted rules to extract
questions from text. Neural based techniques implement an end-to-end
architecture that follow attention-based sequence-to-sequence framework [15].
The sequence-to-sequence frameworks are mainly composed of two key parts
i.e., the encoder which learns a joint representation of the input text and the
22
decoder which generates the questions [12]. Currently, both the joint
representation learning, and question generation are implemented by attention-
based framework [16].
Jahsdf
The idea of utilizing AI systems for educational tasks is not a recent one.
Discussions of algorithmically generated learning materials date back to the
23
1970s [11]. However, the rapid growth of generative AI in the fields of natural
language processing (NLP) and computer vision have opened a wide array of
uses both as a tool for in-class and supplementary in struction, accelerating the
discourse in the recent years. An emerging body of research has investigated the
effectiveness of such tools in the class room and demonstrate their enormous
potential for automatic question generation and direct interaction with LLMs.
Prior reviews primarily focused on outlining potential applications of LLMs in
education and highlighting the need for additional literacy among both students
and educators to better understand the technology, such as Kasneki et al (2023)
[25]. The authors highlight future concerns such as the potential for student
over-reliance on models to erode critical-thinking and problem-solving skills.
These are important considerations should indeed guide specific
implementations of LLMs and algorithmically-generated content into learning
materials.
While much of this research is still in relative infancy with limited empirical
study [22], we give a brief outline of work done to date on applying AI and NLP
systems in education, as well as directions of continuing research.
The growing body of work in this field has found generally positive results in
the ability of LLMs to produce useful learning materials and serve as fruitful
conversational agents with learners [37] [23]. A significant virtue of
incorporating instruction via interaction is that such tools better incorporate
elements of personalized interaction to otherwise remote learning activities.
This allows for striking what Vie et al (2017) describe as “a better balance
between giving learners what they need to learn (i.e. adaptivity) and giving
them what they want to learn (i.e. adaptability).” [57] In short, NLP tools like
GPT-3 and its relatives help to alleviate the top-down nature of traditional
approaches to remote student work.
24
Incorporating open-ended conversations and responses to prompts generated by
chatbots is one such application toward this end that has received substantial
study. Steuer et al (2021) found automatically generated questions to be relevant
to their intended topics, free of language errors, and to contain natural and
easily-comprehensible language in a variety of domains using their
autoregressive language model [52]. Additionally, their generated questions
successfully addressed central concepts of their training texts and topics, which
the authors describe as pedagogical “coreness”. This suggests that the produced
tasks were indeed pedagogically useful within their subjects and contexts.
Though useful questions are essential, assessing how students respond to and
interact with them is also needed. Abdelghani et al (2023) compared question-
asking behavior among pri mary school students after utilizing the prompt-
based learning of GPT-3 to directly automate elements of course content [1].
This was a particularly encouraging result, since it featured a more open-ended
interaction structure and a greater focus on student responses than Steuer er al
(2021), giving some indication of how prior results might generalize to LLMs
applied to an even wider range of possible tasks. Overall, their results suggest
that such automated prompts generally elicited positive responses from students
and show potential for increasing curiosity and feelings of agency in their
learning.
25
While these results present highly encouraging paths forward, it is important to
consider the limited scope of much of the research conducted to date. Though
the studies discussed above involved some degree of open-ended interaction,
they were largely limited to provid ing prompts or keywords within a narrow
task framework. Fully open-ended chatbot-style conversations for pedagogical
uses has yet to receive specific attention.
26
This work suggests a strong potential for generative AI to enhance both the
instructional content being delivery, but also the mode of delivery itself in ways
the promote motivated and curious engagement from students across age and
ability spectra. This is a particularly intriguing area of research, since these
early results align well with the findings of We et al (2020) of AI-interaction
reducing some of the prominent downsides of loneliness in online learning
environments for students. Future work should seek to combine LLM
interactions with GAN-created animations, allowing interactive learning content
to be enjoyable and highly interactive for younger students as well.
Metrics
Each annotator was trained to assess the generated candidates on two of four
quality metrics, as well as a usefulness metric. This division was done to reduce
the cognitive load on an individual annotator. The quality metrics are: relevance
(binary variable representing if the question is related to the context), adherence
(binary variable representing if the question is an instance of the de sired
question taxonomy level); and grammar (binary variable representing if the
question is grammatically correct), answerability (binary variable representing
if there is a text span from the context that is an answer/leads to one). The
relevance, grammar,7 answerability, and adherence metrics are binary as they
are objective measures, often seen in QG literature to assess typical failures of
LLMs such as hallucinations or malformed outputs [5]. The subjective metric
assessed, the usefulness metric, is rated on a scale because it is more nuanced.
This is defined by a teacher’s answer to the question: “Assume you wanted to
teach about context X. Do you think candidate Y would be useful in a lesson,
homework, quiz, etc.?” This ordinal metric has the following four categories:
not useful, useful with major edits (taking more than a minute), useful with
27
minor edits (taking less than a minute), and useful with no edits. If a teacher
rates a question as not useful or useful with major edits we also ask them to
select from a list of reasons why (or write their own).
LLMs must demonstrate that they have the potential for an in-depth
understanding of a given input text for them to be deployed for generation of
questions. One indicator of input text’s comprehension is the ability to generate
meaningful questions and answers from the input text. Perhaps one of the
greatest powers of LLMs such as ChatGPT is the ability to respond to questions
posed to it on-the-fly. However, it is unclear to what extent they can connect
different aspect of input text to generate both low cognitive questions and
questions that require inference to be answered (i.e., high cognitive demand
questions). Moreover, how will it exploit the vast amount of knowledge it
acquired during training to boost its question asking ability? Furthermore, will it
be able to generate answers from the input text without being “confused” by its
internal knowledge. Evaluating the ability of LLMs to ask meaningful questions
that can be solely answered from the input story is of interest. Also, evaluation
of how accurate LLMs are in answering the questions when solely relying on its
understanding of the input text is also of interest. To do this, a given LLM is
prompt to generate questions based on an input text. The generated questions
are automatically evaluated by comparing their semantic similarity to a set of
baseline questions.
28
generated questions into one sentence and compare it with similarly
concatenated reference questions ( see [10] which uses a similar approach).
Question diversity
To test the student’s understanding of a given text being read, questions must be
generated that cover nearly all the sections of the story. We are interested in
evaluating the ability of ChatGPT and Bard to generate questions that are not
biased towards a given section of the text. Concretely, we are seeking to
quantify the variation of the questions being generated by LLMs. We
hypothesize that the more diverse the questions are, the more exhaustive the
questions cover the different topics in the input text. This will give us an idea on
suitability of LLMs to generate questions that cover the whole content being
read. In machine learning, several entropy reliant techniques have been
proposed to evaluate diversity of dataset. Research in [26] proposes Inception
Score (IS) to evaluate the synthetic samples generated by a Generative
Adversarial model(GAN),(G)[27]. The intuition behind IS is that the conditional
label distribution P(y/x) of images generated by GAN is projected to have low
label entropy i.e., the generated images belong to few classes while entropy
across the images should be high that is the marginal distribution p(y x =
G(z))dz should have high entropy. To capture this intuition, they suggest the
metric in equation 1.
29
follow a multi-dimensional Gaussian, therefore have maximum entropy
distribution for a given mean and covariance. If the model f(x),generates
embeddings with mean and covariance (m,C) given the synthetic data p() and
(mwCw)for real data pw(), then FID is defined as:
Other metrics that have been proposed to evaluate diversity, include precision
and recall metrics [29] [30]. One major problem with these metrics is that they
assume existence of reference samples where the generated samples can be
compared to [30]. In our case, we seek to evaluate the diversity of the questions
generated without comparing to any reference questions. To achieve this, we use
Vendi score (VS), a metric proposed for diversity evaluation in the absence of
reference data. VS is defined as s
VS(X) =exp(
30
1. Prompt the LLM to generate a set of questions Q given an input text.
3. Pass the set Q1 and Q2 through the BERTScore 3to extract the cosine
similarity matrix K.
On top of generating questions that cover the whole content, it is desirable for
LLMs to generate wide range of questions from low to high cognitive challenge
questions. This will make students to answer questions that address all the
cognitive domains. Low cognitive question mostly requires short answers which
require affirmation or not (e.g. "Did the queen have golden hair ?"). Conversely
high cognitive challenge questions require explanations, evaluations and some
speculations on the extension of text [11](e.g., "Why did the king’s advisors fail
to find the right wife for him ?"). The purpose of low cognitive challenge
questions is to evaluate the basic understanding of the text by the students and
ease them into the study interaction [11]. However, they have the potential of
promoting over-dominance of teachers. On the other hand, high cognitive
questions foster greater engagement of the students, generate inferential
responses and promote better comprehension of the reading material [11]. In
[11], the two categories of questions are differentiated by exploiting the
syntactic structure of the question. The high cognitive challenge questions are
signalled by wh-word i.e., questions that use words such as what, why, how, or
when. These questions are also composed of a continuum of high to low
challenge questions. Specifically, wh-pronoun and wh-determinat questions that
31
start with who, whom, whoever, what, which, whose, whichever and whatever
require low challenge literal responses. However, the wh-adverb questions such
how, why, where, and when are more challenging since they require more
abstract and inferential responses involving explanation of causation and
evaluation (e.g., “Why was the king sad?”; “How did the king’s daughter
receive the news of her marriage?”). The low cognitive challenge questions are
generally non wh-word questions. It has been suggested in [21][32] that
teachers should generally seek to ask high challenge questions as opposed to
low challenge questions whenever possible. In our case we seek to establish the
types of questions preferred by the two LLMs based on their level of challenge.
To evaluate this, we adopt three categories of questions i.e., confirmative,
explicit and implicit. Explicit are non-confirmative questions that require low
challenge literal responses i.e., where answers can be retrieved from the text
without much inference while implicit are questions that require inferential
responses.
Based on the responses to the teacher’s questions, the teacher can detect
students’ weaknesses and their demands [33]. While it was difficult to design an
evaluation on LLMs that can uncover students’ weaknesses based on their
responses, we resorted to evaluate their ability to recommend part of text where
the student needs to re-read based on the responses provided to the questions.
Basically, we evaluate the ability of a LLM to detect part of the text that the
student did not understand. This we believe plays some part in diagnosing
student’s need.
32
Generating questions from a given text is a challenging task in natural language
processing. There are several approaches and models that can be used for
question generation. Here are a few notable ones:
1. Rule-Based Approaches:
33
Transformer Models: Models like GPT-3 and BERT can be fine-tuned
for question generation by training them on a specific dataset.
5. Hybrid Models:
When choosing a model for question generation, factors such as the available
dataset, computational resources, and the specific requirements of the task
should be considered. The choice may also depend on whether the focus is on
extractive or abstractive question generation, where extractive involves
34
selecting portions of the input text as questions, and abstractive involves
generating questions in a more creative, paraphrased manner.
Model Training/Fine-Tuning
Hyper-parameter Tuning
35