0% found this document useful (0 votes)
11 views35 pages

Natural Learning

The document discusses the potential of large language models (LLMs) in generating educational quizzes and supporting guided reading in various educational contexts. It highlights the importance of question generation systems that can reduce teachers' cognitive load and tailor content to students' needs based on cognitive challenge levels. The study aims to explore the effectiveness of LLMs, particularly ChatGPT 3, in creating meaningful and diverse questions for educational purposes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views35 pages

Natural Learning

The document discusses the potential of large language models (LLMs) in generating educational quizzes and supporting guided reading in various educational contexts. It highlights the importance of question generation systems that can reduce teachers' cognitive load and tailor content to students' needs based on cognitive challenge levels. The study aims to explore the effectiveness of LLMs, particularly ChatGPT 3, in creating meaningful and diverse questions for educational purposes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

GENERATION OF EDUCATIONAL QUIZZES WITH LLMS

INTRODUCTION

The rapidly growing popularity of large language models (LLMs) has taken the
AI community and general public by storm. This attention can lead people to
believe LLMs are the right solution for every problem. In reality, the question of
the usefulness of LLMs and how to adapt them to real-life tasks is an open one.

The recent advancement of natural language processing is currently being


exemplified by the large language model (LLMs) such as GPT-3 [1], PaLM [2],
Galactica [3] and LLaMA [4]. The models have been trained on large amount of
text-data and are able to answer questions, generate coherent text and complete
most language related tasks. LLMs have been touted to have impact in domains
such as climate science [5], health [6] and education [7].

In education, it has been suggested that they can be exploited to boost learning
in different categories such as in elementary school children, middle and high
school children, university students etc [8]. This is line with a long-time goal of
AI to develop conversational agents that can support teachers in guiding
children through reading material such as reading storybooks [9] [10].
Normally, in reading a text such as children’s storybook, a teacher is expected to
guide the children through the text and periodically gauge their understanding
by posing questions from the text. The key concept in guided reading is the
ability to use questions to gauge understanding and encourage deeper thinking
about the material being read.

In a teacher-led-guided reading, apart from gauging understanding, questions


can be used to identify children’s support needs and enable the teacher to direct
attention to the critical content. To achieve the full benefits of guided reading,
the teacher is supposed to ask wide variety of questions ranging from low to

1
high cognitive challenge questions [11]. Low cognitive challenge questions are
constrained to short answers while high cognitive challenge questions require
explanations, evaluation or extension of text [11]. The use of questions to foster
understanding and learning from text is well established across a range of age
groups and learning contexts [11].

An emerging paradigm for text generation is to prompt (or ‘ask’) LLMs for a
desired output [5]. This works by feeding an input prompt or ‘query’ (with a
series of examples for a one- or few-shot setting) to a LLM. This paradigm has
inspired a new research direction called prompt engineering. One of the most
common approaches to prompt engineering involves prepending a string to the
context given to a LLM for generation [4]. For controllable text generation
(CTG), such a prefix must contain a control element, such as a keyword that
will guide the generation [5].

A robust question generation (QG) system has the potential to empower


teachers by decreasing their cognitive load while creating teaching material. It
could allow them to easily generate personalized content to fill the needs of
different students by adapting questions to Bloom’s taxonomy levels (i.e.,
learning goals) or difficulty levels. Already, interested teachers report huge
efficiency increases using LLMs to generate questions [1,8].

Questions are one of the most basic methods used by teachers to educate. As
this learning method is so broad, it uses many organizational taxonomies which
take different approaches to divide questions into groups. One popular example
is Bloom’s taxonomy [3], which divides educational material into categories
based on student’s learning goals. Another example is a difficulty-level
taxonomy, which usually divides questions into 3 categories of easy, medium,
and hard [7]. By combining CTG and these question taxonomies, we open doors

2
for question generation by prompting LLMs to meet specifications of the
educational domain.

It is therefore of interest to explore the use and effectiveness of LLMs to


perform the tasks involved in generation of questions.

Definition Main Goal of the Project

For LLMs to be viewed as potential support agents for teachers or even stand-
alone tools that can help in guided reading, they must be able to: generate
meaningful questions and answers from the text, generate diverse questions
both in terms of content coverage and difficulty of questions and identify the
support needs for the students. This study will investigate the use of ChatGPT 3
in the creation of a model for the generation of questions for students.

LITERATURE REVIEW

Large Language Models (LLMs)

Typically, large language models (LLMs) refer to Transformer language models


that contain hundreds of billions (or more) of parameters4, which are trained on
massive text data [32], such as GPT-3 [55], PaLM [56], Galactica [35], and
LLaMA [57]. LLMs exhibit strong capacities to un derstand natural language
and solve complex tasks (via text generation). LLMs are a specialized form of
AI that are purpose-built for comprehending, generating, and manipulating
human language. By leveraging NLP principles and machine learning
principles, LLMs are designed to process and interpret vast quantities of text
data (Dergaa et al., 2023). The “large” in their name refers to the massive
datasets they are trained on and numerous parameters they possess, which
enable them to grasp the subtle nuances and intricacies of human language.
LLMs can generate human-like text and are designed to understand and
generate text in a contextually relevant and coherent way.

3
LLMs are part of the family of generative models (Ingraham et al., 2019), which
means they can generate new text based on the patterns and structures learned
from the data used to train them. LLMs have many applications, including NLP,
conversational AI, text generation, machine translation, sentiment analysis, and
content creation. They can be used in various industries, such as healthcare,
finance, customer service, marketing, and entertainment, to automate tasks,
provide insights, and improve user experiences. One of the key features of
LLMs is their ability to understand and conversationally generate text (Kasneci
et al., 2023). They can engage in interactive and dynamic conversations with
users, respond to queries, provide information, and generate relevant and
coherent responses. LLMs are trained to understand context, tone, and style,
making them capable of generating text that closely mimics human like
conversation. The development of LLMs can be considered among the greatest
scientific advancements or breakthroughs in AI.

Examples of LLMs include BARD AI (Google), BERT (Google), Chat GPT


(OpenAI), DistillBERT (Hugging Face), ELECTRA (Google), MarianMT
(Microsoft Translator), Megatron (NVIDIA), RoBERTa (Facebook), T5
(Google/DeepMind), UniLM (Microsoft Research), and XLNet (Carnegie
Mellon University/Google). LLMs can play a significant role in various stages
of educational assessment process, such as test planning, item generation,
preparation of test instruction, item assembly/selection, test administration, test
scoring, test analysis, interpretation, appraisal, reporting, and follow up. Further
elaborations on how LLMs can serve useful purposes in educational
measurement are documented.

Scaling Laws for LLMs.

Currently, LLMs are mainly built upon the Transformer architecture [22], where
multi-head attention layers are stacked in a very deep neural network. Existing

4
LLMs adopt similar Transformer architectures and pre-training objectives (e.g.,
language modeling) as small language models. However, LLMs significantly
extend the model size, data size, and total compute (orders of mag nification).
Extensive research has shown that scaling can largely improve the model
capacity of LLMs [26, 55, 56]. Thus, it is useful to establish a quantitative
approach to characterizing the scaling effect. Next, we introduce two
representative scaling laws for Transformer language mod els [30, 34].

KMscaling law: In 2020, Kaplan et al. [30] (the OpenAI team) firstly proposed
to model the power-law relationship of model performance with respective to
three major factors, namely model size (N), dataset size (D), and the amount of
training compute (C), for neural language models. Given a compute budget c,
they empirically presented three basic formulas for the scaling law:

where L(·) denotes the cross entropy loss in nats. The three laws were derived
by fitting the model performance with varied data sizes (22M to 23B tokens),
model sizes (768M to 1.5B non-embedding parameters) and training compute,
under some assumptions (e.g., the analysis of one factor should be not
bottlenecked by the other two factors). They showed that the model
performance has a strong depen dence relation on the three factors.

Chinchilla scaling law: As another representative study, Hoffmann et al. [34]


(the Google DeepMind team) proposed an alternative form for scaling laws to
instruct the compute optimal training for LLMs. They conducted rigorous exper
iments by varying a larger range of model sizes (70M to 16B) and data sizes
(5B to 500B tokens), and fitted a similar scaling law yet with different
coefficients as below [34]:

5
where E = 1.69, A = 406.4,B = 410.7, α = 0.34 and β = 0.28. By optimizing the
loss L(N,D) under the constraint C ≈ 6ND, they showed that the optimal
allocation of compute budget to model size and data size can be derived as
follows:

where a = α/(3) α+β, b = β/α+β and G is a scaling coefficient that can be


computed by A, B, α and β. As analyzed in [34], given an increase in compute
budget, the KM scaling law favors a larger budget allocation in model size than
the data size, while the Chinchilla scaling law argues that the two sizes should
be increased in equal scales, i.e., having similar values for a and b in Equation
(3).

Though with some restricted assumptions, these scaling laws provide an


intuitive understanding of the scaling ef fect, making it feasible to predict the
performance of LLMs during training [46]. However, some abilities (e.g., in-
context learning [55]) are unpredictable according to the scaling law, which can
be observed only when the model size exceeds a certain level (as discussed
below).

Emergent Abilities of LLMs

In the literature [31], emergent abilities of LLMs are formally defined as “the
abilities that are not present in small models but arise in large models”, which is
one of the most prominent features that distin guish LLMs from previous PLMs.
It further introduces a notable characteristic when emergent abilities occur [31]:

6
performance rises significantly above random when the scale reaches a certain
level. By analogy, such an emergent pattern has close connections with the
phenomenon of phase transition in physics [31, 58]. In principle, emergent
abilities can be defined in relation to some complex tasks [31, 59], while we are
more concerned with general abilities that can be applied to solve a variety of
tasks. Here, we briefly introduce three typical emergent abilities for LLMs and
representative models that possess such an ability7.

In-context learning: The in-context learning (ICL) ability is formally


introduced by GPT-3 [55]: assuming that the language model has been provided
with a natural language instruction and/or several task demonstrations, it can
gen erate the expected output for the test instances by com pleting the word
sequence of input text, without requiring additional training or gradient update8.
Among the GPT series models, the 175B GPT-3 model exhibited a strong ICL
ability in general, but not the GPT-1 and GPT-2 models. Such an ability also
depends on the specific downstream task. For example, the ICL ability can
emerge on the arithmetic tasks (e.g., the 3-digit addition and subtraction) for the
13B GPT-3, but 175B GPT-3 even cannot work well on the Persian QA task
[31].

Instruction following: By fine-tuning with a mixture of multi-task datasets


formatted via natural language descrip tions (called instruction tuning), LLMs
are shown to perform well on unseen tasks that are also described in the form of
instructions [28, 61, 62]. With instruction tuning, LLMs are enabled to follow
the task instructions for new tasks without using explicit examples, thus having
an improved generalization ability. According to the experiments in [62],
instruction-tuned LaMDA-PT [63] started to significantly outperform the
untuned one on unseen tasks when the model size reached 68B, but not for 8B
or smaller model sizes. A recent study [64] found that a model size of 62B is at
least required for PaLM to perform well on various tasks in four evaluation

7
benchmarks (i.e., MMLU, BBH, TyDiQA and MGSM), though a much smaller
size might suffice for some specific tasks (e.g., MMLU).

Step-by-step reasoning: For small language models, it is usually difficult to


solve complex tasks that involve multiple reasoning steps, e.g., mathematical
word problems. In contrast, with the chain-of-thought (CoT) prompting strategy
[33], LLMs can solve such tasks by utilizing the prompting mechanism that
involves intermediate reasoning steps for deriving the final answer. This ability
is speculated to be potentially obtained by training on code [33, 47]. An
empirical study [33] has shown that CoT prompting can bring performance
gains (on arithmetic reasoning bench marks) when applied to PaLM and
LaMDA variants with a model size larger than 60B, while its advantage over the
standard prompting becomes more evident when the model size exceeds 100B.
Furthermore, the performance improvement with CoT prompting seems to be
also varied for different tasks, e.g., GSM8K > MAWPS > SWAMP for PaLM
[33].

Key Techniques for LLMs

It has been a long way that LLMs evolve into the current state: general and
capable learners. In the development process, a number of impor tant techniques
are proposed, which largely improve the capacity of LLMs. Here, we briefly list
several important techniques that (potentially) lead to the success of LLMs, as
follows.

Scaling: As discussed in previous parts, there exists an evident scaling effect in


Transformer language mod els: larger model/data sizes and more training
compute typically lead to an improved model capacity [30, 34]. As two
representative models, GPT-3 and PaLM explored the scaling limits by
increasing the model size to 175B and 540B, respectively. Since compute
budget is usually limited, scaling laws can be further employed to conduct a

8
more compute-efficient allocation of the compute resources. For example,
Chinchilla (with more training tokens) outper forms its counterpart model
Gopher (with a larger model size) by increasing the data scale with the same
compute budget [34]. In addition, data scaling should be with careful cleaning
process, since the quality of pre-training data plays a key role in the model
capacity.

Training: Due to the huge model size, it is very chal lenging to successfully
train a capable LLM. Distributed training algorithms are needed to learn the
network param eters of LLMs, in which various parallel strategies are of ten
jointly utilized. To support distributed training, several optimization frameworks
have been released to facilitate the implementation and deployment of parallel
algorithms, such as DeepSpeed[65] andMegatron-LM[66–68]. Also, op
timization tricks are also important for training stability and model
performance, e.g., restart to overcome training loss spike [56] and mixed
precision training [69]. More recently, GPT-4 [46] proposes to develop special
infrastructure and optimization methods that reliably predict the performance of
large models with much smaller models.

Ability eliciting: After being pre-trained on large-scale corpora, LLMs are


endowed with potential abilities as general-purpose task solvers. These abilities
might not be explicitly exhibited when LLMs perform some specific tasks. As
the technical approach, it is useful to design suitable task instructions or specific
in-context learning strategies to elicit such abilities. For instance, chain-of-
thought prompting has been shown to be useful to solve complex reasoning
tasks by including intermediate reasoning steps. Furthermore, we can perform
instruction tuning on LLMs with task descriptions expressed in natural
language, for improving the generalizability of LLMs on unseen tasks. These
eliciting techniques mainly correspond to the emergent abilities of LLMs, which
may not show the same effect on small lan guage models.

9
Alignment tuning: Since LLMs are trained to capture the data characteristics of
pre-training corpora (including both high-quality and low-quality data), they are
likely to generate toxic, biased, or even harmful content for humans. It is
necessary to align LLMs with human values, e.g., helpful, honest, and harmless.
For this purpose, InstructGPT [61] designs an effective tuning approach that
enables LLMs to follow the expected instructions, which utilizes the tech nique
of reinforcement learning with human feedback [61, 70]. It incorporates human
in the training loop with elaborately designed labeling strategies. ChatGPT is
indeed developed on asimilar technique to InstructGPT, which shows a strong
alignment capacity in producing high-quality, harmless re sponses, e.g.,
rejecting to answer insulting questions.

Tools manipulation: In essence, LLMs are trained as text generators over


massive plain text corpora, thus performing less well on the tasks that are not
best expressed in the form of text (e.g., numerical computation). In addition,
their capacities are also limited to the pre-training data, e.g., the inability to
capture up-to-date information. To tackle these issues, a recently proposed
technique is to employ external tools to compensate for the deficiencies of
LLMs [71, 72]. For example, LLMs can utilize the calculator for accurate
computation [71] and employ search engines to retrieve unknown information
[72]. More recently, ChatGPT has enabled the mechanism of using external
plugins (existing or newlycreated apps)9, which are by analogy with the “eyes
and ears” of LLMs. Such a mechanism can broadly expand the scope of
capacities for LLMs.

In addition, many other factors (e.g., the upgrade of hardware) also contribute to
the success of LLMs. Currently, we limit our discussion to the major technical
approaches and key findings for developing LLMs.

Uses of LLMs in Education

10
Test Purpose Determination/Specification

Test purpose determination is the foremost step in the test development cycle. It
involves identifying relevant educational issues that need to be addressed or key
areas that require producing new knowledge or modifying existing ones. To
Joshua (2012), some of the main purposes of testing revolve around evaluating
teachers’ effectiveness and students’ motivation, judging students’ learning
proficiency, their acquisition of essential skills and knowledge, diagnosing
students’ learning difficulties, ranking students’ learning achievement, and
measuring their growth over time. Since purpose of a test is built from the
course content or subject (Joshua, 2012), LLMs can be useful in the test
specification of purpose by analyzing the course content and identifying key
topics or concepts that need to be assessed. For instance, an LLM can analyze a
large amount of text data related to a specific course or subject, identify main
themes and concepts, and suggest appropriate test items that accurately measure
students’ understanding of those concepts.

Moreover, LLMs can be used to create adaptive tests that adjust the difficulty
level of questions based on students’ responses. This can ensure that students
are challenged appropriately and that the test accurately measures their
knowledge and skills. LLMs can also generate test items that align with specific
learning objectives and outcomes. For example, an LLM can analyze the course
content, identify the key skills or knowledge students are expected to acquire,
and generate test items that align with those objectives. LLMs can also help
determine the most appropriate testing method to address these purposes and
generate test items that align with each purpose. For example, LLMs can
analyze large amounts of text data related to students’ learning difficulties and
suggest test items that can diagnose those difficulties.

11
Similarly, LLMs can analyze student growth data over time and suggest test
items that accurately measure that growth. Furthermore, LLMs can help to
ensure that test items are valid, reliable, and relevant to the educational issues
being addressed. By analyzing large amounts of data related to student learning
and educational issues, LLMs can suggest appropriate test items that accurately
measure students’ knowledge and skills. Additionally, LLMs can help to ensure
that test items are fair and unbiased, which is essential in ensuring that the test
results accurately reflect students’ knowledge and skills.

Developing Test Blueprint

The test blueprint, also known as the table of specification, is a two-dimensional


table relating the levels of instructional objectives of Bloom's taxonomy in the
cognitive domain with carefully outlined content areas. Usually, the levels of
instructional objectives in the cognitive domain are horizontally organized in
ascending order of complexity at the top column. In contrast, the content areas
are vertically organized in the leftmost row of the table. The test blueprint
specifies the number of items required in each cell formed by the intersection or
cross-tabulation of each cognitive instructional objective level and each area of
the course contents. A test blueprint is essential to ensure that a test accurately
measures the intended learning outcomes. It helps ensure that the test covers all
critical content areas and that the questions appropriately align with the
instructional objectives. LLMs can be extremely useful in the test specification
of purpose for creating test blueprints or tables of specification. They can assist
in developing test blueprints by analysing the content areas and the levels of
instructional objectives in the cognitive domain. They can help educators
determine which instructional objectives are the most critical for a particular
test and ensure that the test covers all necessary content areas. LLMs can aid in
developing a test blueprint by electronically analysing relevant texts to identify
the key concepts, skills, and knowledge areas that need to be assessed and help

12
determine the appropriate weightage or distribution of these items in the test.
This can help ensure the test is aligned with the objectives and intended
construct.

Test Item Generation/Development

Test item development involves translating the course contents into test items or
questions that will stimulate the learners and elicit the required behaviour
specified in the instructional objectives of the course (Joshua, 2012). Test items
can be broadly classified into two categories:

(1) objective items (highly structured items that have a clear and specific correct
answer, often in the form of multiple-choice, true/false, or matching questions)
and

(2) essay items (open-ended question that requires the test-taker to provide a
written response that demonstrates their understanding of a topic, their ability to
articulate ideas clearly and coherently, and often their ability to analyze,
synthesize, and evaluate information).

During the test item generation phase, a large pool of items is expected to be
gathered from relevant sources, more than the number of items required for the
test. The initial item pools can be reviewed with the support of domain experts
or peers to identify relevant, clear, specific and unambiguous test items for
selection. Those that do not meet the criteria for selection can either be
strengthened or dropped.

LLMs can assist in creating relevant, clear, specific, and unambiguous test
items. These models can analyse course content and other relevant sources to
generate a large pool of potential test items. Domain experts or peers can then
review the generated items to identify suitable items for selection. LLMs can
also assist in developing objective test items such as multiple-choice, true/false,

13
or matching questions. These highly structured items have a clear and specific
correct answer that can be generated using LLMs. Training LLMs on relevant
texts can generate items that assess specific skills or knowledge areas. LLMs
can also generate distractors or incorrect options for multiple-choice questions,
ensuring they are plausible but incorrect. This can help in the creation of a
diverse and balanced item pool. These models can analyse the course content
and generate options that align with the instructional objectives. Moreover,
LLMs can also be useful in developing essay test items. These open-ended
questions require the test-taker to provide a written response demonstrating their
understanding of a topic, their ability to articulate ideas clearly and coherently,
and often their ability to analyse, synthesize, and evaluate information. LLMs
can assist in generating essay prompts that align with the instructional
objectives and are relevant to the course content.

In an experiment, a study found that questions generated by AI were


satisfactorily clear, acceptable and favourable to students, and relevant to the
subject matter (Nasution, 2023).

Preparation of Test Instruction

For objectivity in testing, simple, specific and clear test instructions must be
developed to guide the test administrator and the respondents. According to
Joshua (2012), the instructions for the testing procedures should explain why
they are necessary. They should also contain information about how to organize
the testing environment, distribute and collect test materials, manage time,
procedures for recording answers and deal with anticipated and unforeseen
inquiries. For the test takers, the instructions should include the test’s purpose,
the time allowed for the test, the basis for answering, expected ethical behaviors
(dos and do nots), and discipline to be accorded for any breach of such rules.

14
LLMs can be extremely useful in the preparation of test instructions. These
models are designed to understand and process natural language, which makes
them ideal for tasks that require human-like language understanding and
processing. Regarding testing instructions, LLMs can help ensure that the
instructions are unambiguous. They can also help identify confusion or
misunderstanding using specific words or phrasing. Additionally, LLMs can
suggest alternative phrasing or wording that may be clearer or more easily
understood. Moreover, LLMs can help ensure the instructions are culturally
appropriate and sensitive to different audiences. They can identify potentially
offensive or insensitive language and suggest more appropriate alternatives.
LLMs can also help with the localization of test instructions. For example, if the
test is being translated into a different language, an LLM can help ensure that
the translated instructions accurately convey the intended meaning of the
original instructions.

Item Assembly/Selection

LLMs can be very useful in the test assembly process. Test assembly involves
selecting and organizing test items or questions to create a test that accurately
measures a specific construct or skill. By analysing relevant texts, LLMs can
assist in identifying the most relevant and appropriate items from the item pool
based on the test blueprint and objectives. This can help ensure that the test is
well-balanced, covers the intended construct, and is appropriate for the target
population. They can assist in this process by analysing the content of test items
and identifying potential issues with wording, phrasing, or cultural biases. For
example, an LLM can identify items with difficult vocabulary or syntax that
may be confusing to test-takers or items that use colloquial language that may
not be familiar to all test-takers. LLMs can analyse the coherence and
consistency of test items to ensure that they assess the intended construct or
skill fairly and validly. They can identify potential redundancies or overlaps in

15
the content of test items or identify items irrelevant to the intended construct or
skill. Furthermore, they can assist with translating test items into different
languages, ensuring that the translated items accurately convey the intended
meaning of the original items. LLMs can also assist in selecting and organizing
test items by using NLP techniques to identify relationships between items. For
example, an LLM can analyze the content of items and group them based on
shared themes or concepts. All of these can be done in just a matter of seconds.

Test Administration

According to Joshua (2012), all students must be allowed to manifest the


desired behaviour being measured during test administration. During the test
administration, the examiners must announce the test in advance, telling the
examinees what, when, where, and how the test will be administered. The
examiners must also assure the examinees that the test conditions will be
satisfactory. There is also a need to minimize cheating using diverse approaches,
such as adjusting the sitting arrangements for physically taken tests (Owan et
al., 2023) and proctoring electronically taken tests (Owan, 2020; Owan et al.,
2019). LLMs can be useful in various ways during test administration to ensure
that all students have a fair chance to manifest the desired behaviour being
measured and to minimize cheating. LLMs can generate clear and accessible
test instructions that students can understand easily. This can help ensure that all
students have an equal opportunity to demonstrate their knowledge and skills,
regardless of their language proficiency or other factors affecting their ability to
understand the instructions. LLMs can monitor test-taking behaviour during the
test administration to detect unusual patterns that might suggest cheating or
other misconduct. LLMs can be trained on data from previous test
administrations to identify common cheating patterns and help examiners
identify potential misconduct cases. LLMs can be used to support remote
proctoring for electronically taken tests. This can include facial recognition

16
technology to verify the test-takers’ identity, eye-tracking technology to detect
unusual eye movements, and keystroke analysis to detect unusual typing
patterns. Furthermore, LLMs can help ensure the security of tests by providing
features such as password protection, encryption, and monitoring tools to
prevent cheating and unauthorized access to test content.

LLMs can be used in test administration to facilitate the delivery and


management of assessments. For instance, LLMs can be employed in CBT
environments to deliver test items, record student responses, and monitor test
progress. LLMs can also accommodate students with disabilities, such as text-
to-speech or speech-to-text capabilities. Using LLMs in test administration can
improve the testing experience’s quality and increase test results’ accuracy and
reliability. One of the most significant benefits of LLMs in test administration is
their ability to provide immediate feedback to test-takers. This can help students
or respondents see their progress and performance in real time, increasing
motivation and engagement. Immediate feedback can also help to identify areas,
where the test taker needs to improve and provide opportunities for remediation.
LLMs can also assist in delivering tests to many test-takers simultaneously,
which can be particularly useful in high-stakes testing situations such as college
entrance exams or professional certifications. Additionally, LLMs can provide a
more engaging and interactive testing experience. They can be programmed to
provide multimedia content, such as images, videos, and audio recordings,
enhancing the learning experience and helping test-takers better understand
complex concepts.

Test Scoring

Test scoring refers to evaluating and assigning a numerical score or grade to a


test or assessment taken by a student or group of students. The score is typically
based on the number of correct answers the student(s) gave on the test

17
questions. However, it may also consider other factors such as partial credit,
essay responses, or subjective evaluations by the test scorer or teacher. Test
scoring is a common practice in education used to measure student knowledge,
understanding, and skill levels and provide feedback and guidance for further
learning.

LLMs can be useful for test scoring in several ways. They can be used to
automate the process of grading and scoring tests, which can save a significant
amount of time and effort for teachers and instructors. This is particularly useful
in cases, where large-scale tests need to be graded quickly, such as in
standardized testing or online assessments. LLMs can provide more accurate
and consistent scoring of tests than human graders, as they are not subject to
biases or errors arising from fatigue, distraction, or personal preferences. They
can also be programmed to recognize and account for common mistakes or
misconceptions made by students, which can help to identify areas, where
further teaching or support may be needed. LLMs can provide more detailed
feedback and analysis of test scores than traditional scoring methods. They can
be programmed to provide explanations or examples of correct answers and
highlight areas, where a student may need to improve or focus more attention.
This can guide further learning and development and provide more personalized
and targeted support for individual students. Training LLMs on a large corpus of
text can be fine-tuned to provide scores for open-ended questions or essays
based on various criteria, such as content, language use, and organization.
Automated scoring with LLMs can provide quick and consistent results,
particularly for large-scale assessments.

Interpretation of Test Results

Interpreting test results means analysing and making sense of the scores or
outcomes obtained from a test or assessment. The interpretation of test results is

18
an important part of the testing process. It allows for meaningful conclusions to
be drawn about a student’s knowledge, skills, and abilities and can inform
decisions about teaching, learning, and further assessment. The interpretation of
test results typically involves comparing the scores obtained by an individual or
group of individuals to established norms or standards, such as the scores of
other students in the same grade or subject area, or to pre-determined criteria for
proficiency or mastery. This comparison can help identify areas of strength and
weakness and provide insight into the overall level of knowledge or
achievement of the test taker(s). In addition to comparing scores to established
norms or standards, interpreting test results may involve examining patterns or
trends in the scores over time or comparing scores on different tests or subtests
within a larger assessment. This can help identify areas, where further learning
or intervention may be needed and track progress or improvement.

LLMs can aid in interpreting assessment results by providing insights into the
meaning and implications of the data. LLMs can be used to analyse and
interpret test scores, performance levels, and other assessment outcomes in the
context of the test objectives, standards, and criteria. This can help educators
and policymakers make informed decisions about the performance of
individuals or groups of test takers. Apart from their ability to automate the
interpretation of a large volume of test results in record time, LLMs can provide
a more detailed and nuanced analysis of test results than traditional methods.
They can be programmed to recognize patterns or trends in the scores that may
not be immediately apparent to human graders or analysts. They can also
identify relationships between test items or subtests relevant to further teaching
or assessment. LLMs can provide personalized feedback and recommendations
based on individual test results, which can help to guide further learning and
development. For example, an LLM could analyse a student’s test results and

19
provide recommendations for specific areas, where they may need to focus
more, or practice based on their strengths and weaknesses.

Test Analysis/Appraisal

Test analysis refers to the process of examining the results of a test in order to
extract meaningful information about the performance of the test-takers. Test
analysis aims to identify areas of strength and weakness and provide insights
into the test-takers overall level of knowledge or achievement. The test analysis
process typically involves gathering the scores and other relevant information
from the test and organizing it to allow easy analysis. Computing descriptive
statistics such as mean, standard deviation, and frequency distributions to
summarize the test scores and provide an overview of the distribution of scores.
Examining item-level performance by analysing the performance of individual
test items to identify items that were particularly difficult or easy for test-takers
and identify items that may have been ambiguous or unclear. Identifying
patterns or trends in the test scores across different test-taker subgroups (e.g.,
gender, ethnicity, or age) and over time (e.g., comparing scores from different
test administrations). Based on the findings of the test analysis, conclusions can
be drawn about the performance of the test-takers, as well as recommendations
for further teaching or assessment be made. Various stakeholders, including test
developers, educators, and policymakers, may conduct test appraisals. The
process typically involves using established criteria or standards to evaluate the
test, such as those outlined by American Psychological Association or National
Council on Measurement in Education. The results of a test appraisal may be
used to inform decisions about test selection, interpretation, and use, as well as
to guide improvements in test development and administration processes.

LLMs have the potential to be valuable tools in test appraisals. LLMs can be
employed to analyse assessment data, including item statistics, item difficulty,

20
discrimination indices, and other performance metrics. LLMs can help identify
patterns, trends, and anomalies in the data and provide insights into the overall
performance of test takers and the quality of test items. LLMs can be useful in
providing detailed feedback to students, highlighting areas where they need
improvement or providing explanations for correct answers. The analytics
generated by LLMs can provide insights into student strengths and weaknesses,
highlight areas where additional instruction may be needed, and help teachers
and administrators make informed decisions about instruction and resource
allocation. LLMs can also identify potential sources of confusion or
misunderstanding in test questions, such as multiple word or phrase meanings.
This information can be used to revise questions or provide additional
clarification to ensure that all students have a fair and accurate understanding of
what is being asked. LLMs can also be used in item analysis. Item analysis
involves analyzing the performance of individual test items to identify areas,
where they may be flawed or ineffective. LLMs can use student responses to
specific items to highlight areas, where the item may be too difficult, too easy,
or poorly worded. LLMs can also identify patterns in student responses to
specific types of items, such as multiple-choice or essay questions. This
information can inform decisions about the design and format of future
assessments and ensure that assessments are as effective and fair as possible.

Reporting

Reporting refers to communicating the results of a test or assessment to relevant


stakeholders, such as teachers, students, parents, and administrators. It typically
involves summarizing the test results clearly and concisely and providing
information about how they can inform instruction and decision-making. In
educational testing, reporting often involves providing scores or grades that
indicate how well students performed on the test. For example, a test may be
scored on a scale from 0-100, with scores above a certain threshold indicating

21
proficiency in a particular skill or subject area. Reporting may also include
information about how students performed on different test items, or their
performance compared to other students in their class or school. Reporting may
also involve providing feedback to students and teachers about areas, where
students performed well and where they may need additional support or
instruction. This feedback can inform instruction and help students improve
their performance on future assessments. In addition to communicating the
results of a test, reporting may also include information about the validity and
reliability of the test. This information can be used to evaluate the quality of the
test and ensure that it provides accurate and useful information about student
performance.

LLMs can assist in the generation of test reports. By analysing the assessment
data, LLMs can generate comprehensive reports summarizing the test results,
including descriptive statistics, performance profiles, and graphical
representations. LLMs can also generate interpretive reports that provide
insights and recommendations based on the test results. These reports can be
used by educators, policymakers, and other stakeholders for decision-making
and planning purposes.

Related work

Question generation tools seek to accept input text and generate meaningful
questions that are extracted from the text. Existing question generation tools can
be categorised into two i.e., rule-based tools and neural based tools [12]. Rule
based tools such as [13] and [14] exploit manually crafted rules to extract
questions from text. Neural based techniques implement an end-to-end
architecture that follow attention-based sequence-to-sequence framework [15].
The sequence-to-sequence frameworks are mainly composed of two key parts
i.e., the encoder which learns a joint representation of the input text and the

22
decoder which generates the questions [12]. Currently, both the joint
representation learning, and question generation are implemented by attention-
based framework [16].

Work in [17] introduced the attention-based sequence-to-sequence architecture


to generate question from the input sentence. The encoder was implemented via
RNN to accept an input text and learn its representation. Its output is fed into
the decoder which generates a related question. The decoder exploits attention
mechanism to assign more weight to the most relevant part of the text. Other
question asking tools that implement attention-based sequence-to sequence
framework include [18], [19] and [20]. Although these neural based tools have a
common high-level encoder-decoder structure, their low-level implementations
of different aspects of question generation differ significantly [12]. While these
tools are generally developed for question generation, there are tool such as [9]
and [10] which target question generation for educational purposes. The use of
questions in guided reading has been widely studied [21] [22][23]. Questions
are used during guided reading to evaluate understanding, encourage deeper
thinking about the text and to scaffold understanding of challenging text [11]. It
has been suggested in [8] that large language models can play a significant role
in boosting education of different levels of children. It is therefore important to
evaluate them if indeed they are fit for purpose in which they are being
deployed in. In this work we evaluate their ability to participate in guided
reading and specifically, evaluate their question generation ability. Are they able
to follow the general trend of a human teacher in asking question during
comprehension reading?

Jahsdf

The idea of utilizing AI systems for educational tasks is not a recent one.
Discussions of algorithmically generated learning materials date back to the

23
1970s [11]. However, the rapid growth of generative AI in the fields of natural
language processing (NLP) and computer vision have opened a wide array of
uses both as a tool for in-class and supplementary in struction, accelerating the
discourse in the recent years. An emerging body of research has investigated the
effectiveness of such tools in the class room and demonstrate their enormous
potential for automatic question generation and direct interaction with LLMs.
Prior reviews primarily focused on outlining potential applications of LLMs in
education and highlighting the need for additional literacy among both students
and educators to better understand the technology, such as Kasneki et al (2023)
[25]. The authors highlight future concerns such as the potential for student
over-reliance on models to erode critical-thinking and problem-solving skills.
These are important considerations should indeed guide specific
implementations of LLMs and algorithmically-generated content into learning
materials.

While much of this research is still in relative infancy with limited empirical
study [22], we give a brief outline of work done to date on applying AI and NLP
systems in education, as well as directions of continuing research.

The growing body of work in this field has found generally positive results in
the ability of LLMs to produce useful learning materials and serve as fruitful
conversational agents with learners [37] [23]. A significant virtue of
incorporating instruction via interaction is that such tools better incorporate
elements of personalized interaction to otherwise remote learning activities.
This allows for striking what Vie et al (2017) describe as “a better balance
between giving learners what they need to learn (i.e. adaptivity) and giving
them what they want to learn (i.e. adaptability).” [57] In short, NLP tools like
GPT-3 and its relatives help to alleviate the top-down nature of traditional
approaches to remote student work.

24
Incorporating open-ended conversations and responses to prompts generated by
chatbots is one such application toward this end that has received substantial
study. Steuer et al (2021) found automatically generated questions to be relevant
to their intended topics, free of language errors, and to contain natural and
easily-comprehensible language in a variety of domains using their
autoregressive language model [52]. Additionally, their generated questions
successfully addressed central concepts of their training texts and topics, which
the authors describe as pedagogical “coreness”. This suggests that the produced
tasks were indeed pedagogically useful within their subjects and contexts.

Though useful questions are essential, assessing how students respond to and
interact with them is also needed. Abdelghani et al (2023) compared question-
asking behavior among pri mary school students after utilizing the prompt-
based learning of GPT-3 to directly automate elements of course content [1].
This was a particularly encouraging result, since it featured a more open-ended
interaction structure and a greater focus on student responses than Steuer er al
(2021), giving some indication of how prior results might generalize to LLMs
applied to an even wider range of possible tasks. Overall, their results suggest
that such automated prompts generally elicited positive responses from students
and show potential for increasing curiosity and feelings of agency in their
learning.

Additionally, Wu et al (2020) found that interaction with a chatbot in E-learning


environ ments alleviates feelings of isolation and detachment that often
accompany the use of such platforms [60]. As more learning content has shifted
online in the wake of the COVID-19 pandemic [4], being able to provide access
to high-quality instruction regardless of time and place is relevant both for
present and future impact on pedagogy.

25
While these results present highly encouraging paths forward, it is important to
consider the limited scope of much of the research conducted to date. Though
the studies discussed above involved some degree of open-ended interaction,
they were largely limited to provid ing prompts or keywords within a narrow
task framework. Fully open-ended chatbot-style conversations for pedagogical
uses has yet to receive specific attention.

A natural question in this setting is the degree to which information and


reasoning provided by AI agents is reliable. Jiang et al (2021) investigated the
calibration of LLMs on trivia-style question-answering tasks across a number of
disciplines [23]. They found that while the models tested (GPT-2, T5, BART)
performed well, they were generally poorly calibrated, tending to be over-
confident in their predictions. The authors show that model f ine-tuning
procedures substantially improve this issue. While less critical than fields like
medical diagnosis where safety and proper confidence calibration are essential,
properly calibrated degrees of confidence are highly relevant for student
feedback and interactions. Future work should build upon the degree to which
domain- and class-specific fine-tuning can improve LLM reliability at question-
answering tasks within a student-AI interaction setting.

In addition to creating course content and engaging students in discussions,


developments in generalized adversarial network (GAN) architectures [18] and
AI-generated media allow for systems that produce synthetic interfaces with
which students can interact. Pataranuta porn et al (2021) discuss potential use
cases of AI-generated animated characters utilizing GANarchitectures for
interaction in learning environments [45]. Prior research demonstrates that
learning materials incorporating interaction with fictional characters positively
impacts student experiences, improving motivation and attitudes [30].

26
This work suggests a strong potential for generative AI to enhance both the
instructional content being delivery, but also the mode of delivery itself in ways
the promote motivated and curious engagement from students across age and
ability spectra. This is a particularly intriguing area of research, since these
early results align well with the findings of We et al (2020) of AI-interaction
reducing some of the prominent downsides of loneliness in online learning
environments for students. Future work should seek to combine LLM
interactions with GAN-created animations, allowing interactive learning content
to be enjoyable and highly interactive for younger students as well.

Definition of Metrics of Project Success

Metrics

Each annotator was trained to assess the generated candidates on two of four
quality metrics, as well as a usefulness metric. This division was done to reduce
the cognitive load on an individual annotator. The quality metrics are: relevance
(binary variable representing if the question is related to the context), adherence
(binary variable representing if the question is an instance of the de sired
question taxonomy level); and grammar (binary variable representing if the
question is grammatically correct), answerability (binary variable representing
if there is a text span from the context that is an answer/leads to one). The
relevance, grammar,7 answerability, and adherence metrics are binary as they
are objective measures, often seen in QG literature to assess typical failures of
LLMs such as hallucinations or malformed outputs [5]. The subjective metric
assessed, the usefulness metric, is rated on a scale because it is more nuanced.
This is defined by a teacher’s answer to the question: “Assume you wanted to
teach about context X. Do you think candidate Y would be useful in a lesson,
homework, quiz, etc.?” This ordinal metric has the following four categories:
not useful, useful with major edits (taking more than a minute), useful with

27
minor edits (taking less than a minute), and useful with no edits. If a teacher
rates a question as not useful or useful with major edits we also ask them to
select from a list of reasons why (or write their own).

3.1 Ability to generate meaningful questions

LLMs must demonstrate that they have the potential for an in-depth
understanding of a given input text for them to be deployed for generation of
questions. One indicator of input text’s comprehension is the ability to generate
meaningful questions and answers from the input text. Perhaps one of the
greatest powers of LLMs such as ChatGPT is the ability to respond to questions
posed to it on-the-fly. However, it is unclear to what extent they can connect
different aspect of input text to generate both low cognitive questions and
questions that require inference to be answered (i.e., high cognitive demand
questions). Moreover, how will it exploit the vast amount of knowledge it
acquired during training to boost its question asking ability? Furthermore, will it
be able to generate answers from the input text without being “confused” by its
internal knowledge. Evaluating the ability of LLMs to ask meaningful questions
that can be solely answered from the input story is of interest. Also, evaluation
of how accurate LLMs are in answering the questions when solely relying on its
understanding of the input text is also of interest. To do this, a given LLM is
prompt to generate questions based on an input text. The generated questions
are automatically evaluated by comparing their semantic similarity to a set of
baseline questions.

To evaluate the performance of a given LLM on question generation, the


popularly used metrics in question generation task which include ROUGE-L
[24] and BERTScore [25] were explored and compare the semantic similarity of
questions generated with the baseline questions. The similarity between LLM
generated questions and reference questions is evaluated by concatenating the

28
generated questions into one sentence and compare it with similarly
concatenated reference questions ( see [10] which uses a similar approach).

Question diversity

To test the student’s understanding of a given text being read, questions must be
generated that cover nearly all the sections of the story. We are interested in
evaluating the ability of ChatGPT and Bard to generate questions that are not
biased towards a given section of the text. Concretely, we are seeking to
quantify the variation of the questions being generated by LLMs. We
hypothesize that the more diverse the questions are, the more exhaustive the
questions cover the different topics in the input text. This will give us an idea on
suitability of LLMs to generate questions that cover the whole content being
read. In machine learning, several entropy reliant techniques have been
proposed to evaluate diversity of dataset. Research in [26] proposes Inception
Score (IS) to evaluate the synthetic samples generated by a Generative
Adversarial model(GAN),(G)[27]. The intuition behind IS is that the conditional
label distribution P(y/x) of images generated by GAN is projected to have low
label entropy i.e., the generated images belong to few classes while entropy
across the images should be high that is the marginal distribution p(y x =
G(z))dz should have high entropy. To capture this intuition, they suggest the
metric in equation 1.

exp(ExKL(p(z x) p(y))) (1)

Another popular metric that provides a measure of diversity on synthetically


generated samples is the Frechet Inception Distance (FID) score [28]. This is a
metric that considers location and ordering of the data along the function space.
Taking the activations of the penultimate layer of a given model to represent
features of a given dataset x and considering only the first two moments i.e., the
mean and covariance, the FID assumes that the coding units of the model f(x)

29
follow a multi-dimensional Gaussian, therefore have maximum entropy
distribution for a given mean and covariance. If the model f(x),generates
embeddings with mean and covariance (m,C) given the synthetic data p() and
(mwCw)for real data pw(), then FID is defined as:

d2((mC) (mwCw) = m mw 2 2+Tr(C+cw 2(CCw)1 2) (2)

Other metrics that have been proposed to evaluate diversity, include precision
and recall metrics [29] [30]. One major problem with these metrics is that they
assume existence of reference samples where the generated samples can be
compared to [30]. In our case, we seek to evaluate the diversity of the questions
generated without comparing to any reference questions. To achieve this, we use
Vendi score (VS), a metric proposed for diversity evaluation in the absence of
reference data. VS is defined as s

VS(X) =exp(

Here X = xi i i log i) xn is the input data whose diversity is to be evaluated. i i =


1 (3) n are the eigenvalues of a positive semidefinite matrix K n whose entries
are Kij = k(xi xj) where k is a positive semidefinite similarity function with
k(xx) = 1 for all x. This metric which is like effective rank [31] seeks to
quantify the geometric transformation induced by performing a linear mapping
of a vector x from a vector space Rn to Rm by a matrix A i.e Ax. Normally, the
number of dimensions retained by a linear transformation Ax is captured by the
rank of the matrix A. The rank however is silent on the shape induced by the
transformation. Effective rank introduced by [31] seeks to capture the shape that
results due to the linear mapping. Effective rank therefore can be used to
capture the spread of data hence ideal to measure diversity. To compute the
diversity of questions generated by the two LLMs, we execute the following
steps:

30
1. Prompt the LLM to generate a set of questions Q given an input text.

2. Replicate the set Q1 = q1 qn to get a copy of the questions Q2 = q1 Designate


Q1 as the reference set and Q2 as the candidate set. qn .

3. Pass the set Q1 and Q2 through the BERTScore 3to extract the cosine
similarity matrix K.

4. Use the VS package4 to extract the VS score diversity value.

5. Compare the VS score diversity value to human generated diversity score.

Ability to generate questions that differ in difficulty

On top of generating questions that cover the whole content, it is desirable for
LLMs to generate wide range of questions from low to high cognitive challenge
questions. This will make students to answer questions that address all the
cognitive domains. Low cognitive question mostly requires short answers which
require affirmation or not (e.g. "Did the queen have golden hair ?"). Conversely
high cognitive challenge questions require explanations, evaluations and some
speculations on the extension of text [11](e.g., "Why did the king’s advisors fail
to find the right wife for him ?"). The purpose of low cognitive challenge
questions is to evaluate the basic understanding of the text by the students and
ease them into the study interaction [11]. However, they have the potential of
promoting over-dominance of teachers. On the other hand, high cognitive
questions foster greater engagement of the students, generate inferential
responses and promote better comprehension of the reading material [11]. In
[11], the two categories of questions are differentiated by exploiting the
syntactic structure of the question. The high cognitive challenge questions are
signalled by wh-word i.e., questions that use words such as what, why, how, or
when. These questions are also composed of a continuum of high to low
challenge questions. Specifically, wh-pronoun and wh-determinat questions that

31
start with who, whom, whoever, what, which, whose, whichever and whatever
require low challenge literal responses. However, the wh-adverb questions such
how, why, where, and when are more challenging since they require more
abstract and inferential responses involving explanation of causation and
evaluation (e.g., “Why was the king sad?”; “How did the king’s daughter
receive the news of her marriage?”). The low cognitive challenge questions are
generally non wh-word questions. It has been suggested in [21][32] that
teachers should generally seek to ask high challenge questions as opposed to
low challenge questions whenever possible. In our case we seek to establish the
types of questions preferred by the two LLMs based on their level of challenge.
To evaluate this, we adopt three categories of questions i.e., confirmative,
explicit and implicit. Explicit are non-confirmative questions that require low
challenge literal responses i.e., where answers can be retrieved from the text
without much inference while implicit are questions that require inferential
responses.

Ability to recommend section of text

Based on the responses to the teacher’s questions, the teacher can detect
students’ weaknesses and their demands [33]. While it was difficult to design an
evaluation on LLMs that can uncover students’ weaknesses based on their
responses, we resorted to evaluate their ability to recommend part of text where
the student needs to re-read based on the responses provided to the questions.
Basically, we evaluate the ability of a LLM to detect part of the text that the
student did not understand. This we believe plays some part in diagnosing
student’s need.

Potential Model(s) Exploration and Selection

32
Generating questions from a given text is a challenging task in natural language
processing. There are several approaches and models that can be used for
question generation. Here are a few notable ones:

1. Rule-Based Approaches:

Template-Based Systems: These systems use predefined templates to


generate questions. For example, a template might be "What is the [entity] in
[text]?" Templates are filled with relevant information extracted from the input
text.

2. Machine Learning-Based Approaches

Supervised Learning: A supervised learning approach involves training


a model on a labeled dataset where each input text is paired with its
corresponding questions. Features are extracted from the text, and the model
learns to map these features to questions.

Support Vector Machines (SVM): SVMs can be used for question


generation by training on a labeled dataset with input-output pairs.

Random Forests: Random Forests can be employed for feature


extraction and learning the mapping from text to questions.

Seq2Seq Models: Sequence-to-Sequence models, particularly those


based on recurrent neural networks (RNNs) or transformers, have been
successful in various natural language processing tasks, including question
generation.

Long Short-Term Memory (LSTM) Networks: LSTMs are a type of


RNN that can be used for sequence-to-sequence tasks.

33
Transformer Models: Models like GPT-3 and BERT can be fine-tuned
for question generation by training them on a specific dataset.

3. Reinforcement Learning-Based Approaches

Policy Gradient Methods: Reinforcement learning techniques can be used to


train models to generate questions by rewarding those that generate relevant and
meaningful questions.

Deep Q-Networks (DQN): DQN, a type of reinforcement learning model,


could be adapted for question generation tasks.

4. Pretrained Language Models

BERT (Bidirectional Encoder Representations from Transformers):


BERT can be fine-tuned for question generation tasks by training it on a dataset
with input text and corresponding questions.

GPT (Generative Pre-trained Transformer): GPT models can be used


for question generation by conditioning the model on input text and generating
questions as output.

5. Hybrid Models:

Combination of Rule-Based and ML Models: Some systems use a


combination of rule-based approaches and machine learning models to benefit
from the strengths of both approaches.

When choosing a model for question generation, factors such as the available
dataset, computational resources, and the specific requirements of the task
should be considered. The choice may also depend on whether the focus is on
extractive or abstractive question generation, where extractive involves

34
selecting portions of the input text as questions, and abstractive involves
generating questions in a more creative, paraphrased manner.

Determination of data to collect related to the quizzes

Collection of needed Data

Preparation of Data for Model training

Model Training/Fine-Tuning

Evaluation of Model’s Performance

Hyper-parameter Tuning

Report and Presentation

35

You might also like