1 s2.0 S2666920X22000303 Main

Computers and Education: Artificial Intelligence 3 (2022) 100075
Contents lists available at ScienceDirect
Computers and Education: Artificial Intelligence

journal homepage: www.sciencedirect.com/journal/computers-and-education-artificial-intelligence
Assessment in the age of artificial intelligence

Zachari Swiecki a, *, Hassan Khosravi b, Guanliang Chen a, Roberto Martinez-Maldonado a,
Jason M. Lodge b, Sandra Milligan c, Neil Selwyn a, Dragan Gašević a
a
Monash University, Australia
b
The University of Queensland, Australia
c
The University of Melbourne, Australia
A B S T R A C T
In this paper, we argue that a particular set of issues mars traditional assessment practices. They may be difficult for educators to design and implement; only provide
discrete snapshots of performance rather than nuanced views of learning; be unadapted to the particular knowledge, skills, and backgrounds of participants; be
tailored to the culture of schooling rather than the cultures schooling is designed to prepare students to enter; and assess skills that humans routinely use computers to
perform. We review extant artificial intelligence approaches that–at least partially–address these issues and critically discuss whether these approaches present
additional challenges for assessment practice.
1. Introduction 2. Background
Well-designed assessments are essential for determining whether 2.1. The standard assessment paradigm
students have learned (Almond, Steinber, & Mislevy, 2002; Mislevy,
Steinberg, & Almond, 2003). Traditional assessment practices, such as Mislevy and colleagues argue that educational assessment is often
multiple-choice questions, essays, and short answer questions, have framed within the standard assessment paradigm (SAP) (Mislevy, Behrens,
been widely used to infer student knowledge and learning (see, for Dicerbo, & Levy, 2012). A predefined set of items (e.g., problems or
example, Kaipa, 2021). In this paper, we argue that these traditional questions) is used to infer claims about students’ proficiency in one or
practices have several issues. First, they can be onerous for educators to more traits. The data used for these inferences are typically sparse, and
design and implement. Second, they may only provide discrete snap student learning may not be the focus of the assessment. Instances of the
shots of performance rather than nuanced views of learning. Third, they SAP include widely used assessment techniques such as multiple-choice
may be uniform and thus unadapted to the particular knowledge skills questions, essays, and short answer questions (Kaipa, 2021). While
and backgrounds of participants. Fourth, they may be inauthentic, methods like these are widely used, they have several potential
adhering to the culture of schooling rather than the cultures schooling is problems.
designed to prepare students to enter. And finally, they may be anti The first problem is a practical one. Assessments in the standard
quated, assessing skills that humans routinely use machines to perform. paradigm can be onerous. Assessment design requires carefully crafted
After outlining these arguments, we describe several applications of items and techniques for translating student responses into evaluations
artificial intelligence (AI) that have come to–at least partially–address of performance or learning—things like rubrics, answer keys, and,
these issues. However, we also acknowledge that traditional assessment increasingly, sophisticated statistical models (Mislevy et al., 2012).
practices were developed for a reason and, to some extent, have been Assessment is only one part of an educator’s practice in classroom
successful and valuable for understanding and improving student contexts. They also plan and lead learning activities, provide feedback,
learning. As such, we conclude with a discussion of the unique chal and, more generally, manage the classroom culture. Depending on the
lenges that AI may introduce to assessment practice to point to oppor number of students, the other responsibilities of the educator, and how
tunities for continued research and development. much help they have, manually designing assessments and making
* Corresponding author.
E-mail address: [email protected] (Z. Swiecki).
https://doi.org/10.1016/j.caeai.2022.100075
Received 9 August 2021; Received in revised form 4 March 2022; Accepted 21 April 2022
Available online 9 May 2022
2666-920X/© 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
Z. Swiecki et al. Computers and Education: Artificial Intelligence 3 (2022) 100075
inferences from them can be burdensome and potentially error-prone applied to what have become classroom tasks. The system of learning
(Suto, Nádas, & Bell, 2011).1 and using (and, of course, testing) thereafter remains hermetically
Second, these assessments may be discrete, providing only snapshots sealed within the self-confirming culture of the school. Conse
of what students can do at a single point in time. While these snapshots quently, contrary to the aim of schooling, success within this culture
may tell us something about what students do and do not know at a often has little bearing on performance elsewhere (Brown, Collins, &
given time, they may tell us nothing about learning. As others have Deguid, 1989, pg. 36).
argued, one goal of assessment practices is to foster learning (see, for
Finally, assessments in the SAP are often antiquated because they
example, Wiliam, 2011). As understood in the learning sciences,
assess skills that are becoming increasingly obsolete. As Shaffer and
learning is defined by change. For example, a change in mental repre
Kaput (1998) argue, computational media like computers make it
sentations (Perret-Clermont, 1980), a change from what you can do with
possible to externalise information processing much like written records
help to what you can do alone (Vygotsky & Cole, 1978), a process of
make it possible to externalise information storage. This change dis
acclimating to a new culture (Lave & Wenger, 1991). Without
tributes some cognitive tasks onto the computational media, for
comparing snapshots across time, we have no sense of change and thus
example, calculations in the case of doing mathematics with a calculator
no sense of learning. This logic underlies many basic analyses of learning
and editing in the case of writing with a word processor and frees
that control for prior knowledge. Just as we would be dubious of results
humans up to do other kinds of tasks. These other tasks might include
that report only post-tests and claim that learning was observed, we
understanding the problem, representing the problem in a variety of
should be wary of assessments that do the same.
external processing systems, and using the results of these systems in
Relatedly, there has been a shift in the literature on learning,
meaningful ways rather than doing the actual processes themselves.
particularly in the learning sciences and computer-supported collabo
Consequently, they argue that, in many cases, pedagogy and–we con
rative learning, that argues that learning processes, in addition to
tend—assessment should focus on the new kinds of tasks and skills
learning outcomes, are worthy objects of study (Puntambekar et al.,
afforded by external processing systems.
2011). Increasingly, it is becoming evident that understanding learning
Despite SAPs often being discrete, uniform, isolated, and antiquated,
processes over time is critical to both student progress and fundamental
they remain persistent in the culture of education. However, new ad
questions of how learning happens (Lodge, 2018). The capacity for
vances in technology and artificial intelligence (AI) have come to
students to engage in effective self-regulation of their learning (e.g.,
permeate many aspects of human life—from how we work, to the
Panadero, 2017), to make sound judgements about their progress (e.g.,
products we buy, to how we spend our free time. Some classrooms as
Boud, Ajjawi, Dawson, & Tai, 2018), and to change strategies when
well have come to use AI as part of their everyday practice (Hwang, Xie,
needed (e.g., Alter, Oppenheimer, Epley, & Eyre, 2007) are vital, not
Wah, & Gašević, 2020). This includes relatively established technologies
only for the task at hand but for longer-term learning and development
such as automated essay grading software (Ke & Ng, 2019) and adaptive
of the learner. Moreover, understanding processes that are indicative or
testing (van der Linden & Glas, 2010), alongside the more recent
predictive of learning can help to inform feedback, interventions, and
development of continuous data-driven assessment of students’ online
other pedagogical moves that might positively affect learning (Pun
engagements with learning materials (Shute & Rahimi, 2021). There is
tambekar et al., 2011).
also increasing interest in how AI-driven monitoring and manipulation
Third, assessments in the SAP may be uniform in the sense that the
of students’ engagements with online learning environments such as
same tasks or items are given to each student regardless of their prior
games and simulations can support authentic assessment of skills and
knowledge, abilities, experiences, and cultural backgrounds. This issue
behaviours exhibited in situ. In short, as Cope and colleagues argue:
is related to the first. If the assessment practice is not calibrated to the
students’ current state, then it speaks only to performance at the Assessment is perhaps the most significant area of opportunity
moment and not learning as we have come to define it. Moreover, offered by artificial intelligence for transformative change in edu
viewing assessments as one-size-fits-all may introduce bias to the cation. However, this is not assessment in its conventionally under
assessment in the sense that all students may not have equal opportu stood forms. AI-enabled assessment uses dramatically different
nities to demonstrate their learning (Gipps & Stobart, 2009). artifacts and processes from traditional assessments … Indeed, AI
Fourth, assessments in the SAP are often inauthentic. Take essay- could spell the abandonment and replacement of traditional assess
based assessments as an example. People for whom writing is a part of ments, and with this a transformation in the processes of education
their profession write with help. They research and use the ideas of (Cope, Kalantzis, & Searsmith, 2021, pg. 5).
others, share drafts, get feedback, and revise; they use tools like word
In the following sections, we review some existing AI approaches
processors that correct their spelling, grammar, and usage, and some
that may help to address the issues associated with the assessment in the
times suggest text. In contrast, writing for assessments may look quite
SAP.2
different. Graduate study admissions tests such as the Graduate Record
Examinations (GRE) ask people to write in isolation and without access
3. Artificial intelligence for assessment
to tools that have are now a standard part of writing practice (ETS,
2022). This misalignment between authentic practice and classroom
3.1. From onerous to feasible
culture bears on assessment more broadly. As Brown, Collins, and
Duguid argue:
AI-based techniques have been developed to fully or partially auto
When authentic activities are transferred to the classroom, their mate parts of the traditional assessment practice. AI can generate
context is inevitably transmuted: they become classroom tasks and assessment tasks, find appropriate peers to grade work, and automati
part of the school culture. Classroom procedures, as a result, are then cally score student work. These techniques offload tasks from humans to
AI and help to make assessment practices more feasible to maintain.
1
Of course, some SAP instances have widely implemented automated
methods to make the assessment practice less onerous. These include relatively
basic methods such as the automatic scoring of multiple-choice questions and
more sophisticated techniques for generating and selecting items, scoring ill-
2
formed and open-ended responses, and making inferences from log data. We Our review here is not meant to be exhaustive, but instead to highlight some
describe these techniques in relation to artificial intelligence and assessment exemplar approaches that we argue can address some of the existing issues with
below. traditional assessment practice.
2
3.1.1. Automated assessment construction decision (Zheng, Li, Li, Shan, & Cheng, 2017)?
One of the critical components of assessment design is the task used A simple approach would be to use summary statistics such as mean
to elicit evidence to support claims about learning. In recent years, a or median. However, summary statistics suffer from the assumption that
handful of studies have been proposed to apply AI techniques to auto all students have a similar judgmental ability, which has proven incor
mate the generation of such assessment tasks, such as multiple-choice rect (Abdi et al., 2021). An alternative is to use advanced consensus
questions and open-answer questions. Typically, these studies are built approaches that incorporate AI models to infer the reliability of each
upon AI techniques driven by deep neural networks. For instance, Jia, assessor (Darvishi, Khosravi, & Sadiq, 2020, 2021). Using such models
Zhou, Sun, and Wu (2020) proposed to improve the quality of the allows the system to use a weighted aggregation that emphasises the
generated questions in a two-step manner: the representation of input marks provided by the more reliable students. A related line of research
text is derived by applying a Rough Answer and Key Sentence Tagging has focused on developing spot-checking methods (Wang, An, & Jiang,
scheme, and then the input representation is further used by an 2018) that optimally utilise the minimal availability of instructors to
Answer-guided Graph Convolutional Network to capture the review the most controversial cases (i.e., those with low algorithmic
inter-sentences and intra-sentence relations for question generation. confidence or low inter-rater agreement) and provide explanations of
The success of such approaches often relies on the availability of the outcome to learners so that they can receive valuable individualised
large-scale and relevant datasets used to train those deep neural network feedback.
models. When using these datasets to train a question generator, the
source document related to each question (e.g., the transcript of a lec 3.1.3. Writing analytics
ture video or a piece of reading material) often contains multiple sen The automated assessment of student writing has been a rich area of
tences, and not every sentence is question-worthy. This suggests that the research since at least 1966 (Page, 1966). While both long-form and
question-worthy sentences in an article should be first identified before short answer responses have been investigated, the most successful ap
we use them as input to the question generator. Driven by these findings, proaches have focused on scoring longer student works. For example,
Chen, Yang, and Gasevic (2019) investigated the effectiveness of a total several systems have been developed and used in practice for automated
of nine sentence selection strategies in question generation and found essay scoring, among which MI Write (Graham, Hebert, & Harris, 2015)
that the stochastic graph-based method, LexRank, gave the most robust is a representative.
performance across multiple datasets. MI Write offers a web-based interactive system for students to prac
While automated question generation can be a powerful tool for tice and improve their writing skills. For every essay, MI Write provides a
making assessment design more feasible for educators, it is not without student with an overall score for the essay and six trait scores (i.e.,
its limitations. Large-scale datasets are needed to train the models that development of ideas, organisation, style, word choice, sentence
generate the questions. However, to our knowledge, most of the existing fluency, and conventions) for the student to focus on specific aspects of
datasets are not of direct relevance to teaching and learning, except for the essay. Several studies have demonstrated that automated essay
RACE (Lai, Xie, Liu, Yang, & Hovy, 2017) and LearningQ (Chen, Yang, scoring tools like MI Write can help students to improve their writing
Hauff, & Houben, 2018). While metrics do exist for evaluating the motivation (Wilson & Czik, 2016), writing self-efficacy (Wilson &
quality of the tasks in terms of overlap between the generated questions Roscoe, 2020) and writing skills (Palermo & Thomson, 2018), and help
and the human-crafted questions, for example (see Bleu-N (Papineni, teachers to facilitate their practices and effectively influence students’
Roukos, Ward, & Zhu, 2002) and Meteor (Denkowski & Lavie, 2014)) writing motivation and independence (Wilson et al. (2021).
these metrics do not guarantee the pedagogical value and appropriate A useful survey of automated essay scoring was provided by Ke and
ness of the generated questions (Horbach, Aldabe, Bexte, de Lacalle, & Ng (2019), who describe the various types of AI techniques developed
Maritxalar, 2020). and applied to the problem. Typically, these AI techniques tackled the
scoring task as (a) a regression task, which aimed to directly predict a
3.1.2. AI-assisted peer assessment score of an essay and often employed techniques like linear regression
The role of high-quality feedback in learner outcomes is well attested (Crossley, Allen, Snow, & McNamara, 2015) and support vector
in educational research (, in pressCarless). However, as class sizes in regression (Klebanov, Madnani, & Burstein, 2013); (b) a classification
crease, it becomes more challenging for instructors to provide rich and task, which aimed to classify an essay to one of a number categories (e.
timely feedback. Peer assessment has been recognised as a sustainable g., low quality vs. high quality) and often employed techniques like
and developmental assessment method that can address this challenge. Bayesian network classification (Rudner & Liang, 2002); and (c) a
Not only does it scale well to large class sizes, such as those in massive ranking task, which aimed to compare essays according to their quality
open online classes (MOOCs) (Shnayder & Parkes, 2016), it has also and often employed techniques like support vector machines (Yanna
been demonstrated to promote a higher level of learning compared to koudakis & Briscoe, 2012) and LambdaMART (Chen & He, 2013). Other
one-way instructor assessment (Er, Dimitriadis, & Gašević, 2020). A tools focus more on providing feedback to students rather than an
range of educational platforms such as Mechanical TA (Wright, Thorn overall evaluation. For example, the tool AcaWriter combines natural
ton, & Leyton-Brown, 2015), Dear Beta and Dear Gamma (Glassman, language processing and pattern matching to identify the presence and
Lin, Cai, & Miller, 2016), Aropä (Purchase & Hamer, 2018), Crowd absence of certain rhetorical moves and provide relevant feedback
Grader (De Alfaro & Shavlovsky, 2014), and RiPPLE (Khosravi, Kitto, & (Knight et al., 2020).
Williams, 2019) have been developed to support peer assessment. Another research line closely related to automated essay scoring is
Although some prior work has reported on learners’ ability to eval plagiarism detection software, e.g., Turnitin (Heckler, Rice, & Hobson
uate resources effectively (Abdi, Khosravi, Sadiq, & Demartini, 2021; Bryan, 2013). Different from systems used for automated essay scoring,
Whitehill, Aguerrebere, & Hylak, 2019), the judgements of students as Turnitin aims to compare a submission from a student against a large
experts-in-training cannot wholly be trusted, which compromises the collection of relevant documents, which may consist of submissions
reliability of peer assessment as an assessment instrument. However, from other students, online articles, and academic publications. By
some steps can be taken to increase reliability. One common strategy, comparison, Turnitin generates a report to indicate whether there is any
which is used in most of the platforms mentioned above, is to rely on the significant chunk of text from the submission that matches another
wisdom of a crowd rather than one individual by employing a source, which instructors can use to determine whether it is a plagiarism
redundancy-based strategy and assigning the same task to multiple case. A recent systematic literature review (Foltýnek, Meuschke, & Gipp,
users. This raises a new problem commonly referred to as the consensus 2019) showed significant advancement in plagiarism detection with the
problem: in the absence of ground truth, how can we optimally integrate increased use of AI techniques - specifically, semantic text analysis
the decisions made by multiple individuals towards an accurate final methods (e.g., latent semantic analysis and word embeddings) and
3
machine learning algorithms. Wilson and Scalise (2012) used a similar approach with log stream
data generated from online tasks undertaken by students to generate
3.2. From discrete to continuous measures of student ability to learn in networked digital environments.
Each of these studies used custom-built digital tasks to generate the data.
While traditional assessment practices may take discrete snapshots of Milligan and Griffin (2016) extended this method to use process data
performance, several AI techniques have been developed that afford a derived on open platforms, using data from the log stream of MOOCs to
more continuous view of performance and thus insights into learning. generate assessments of learner agency. Stealth methods are now
Some of these approaches take traditional assessment practises such as frequently used in commercial games and platforms for learning (Shute
quizzes and exams and move them to digital environments, while others et al., 2021).
apply to quite different assessment tasks and evidence.
3.2.3. Latent knowledge estimation
3.2.1. Electronic assessment platforms A key component of both EAPs and stealth assessment is the ability to
In recent years, electronic assessment platforms (EAPs) that provide the continuously track student actions and incorporate these actions into
ability for exams to be administered on or off-line have become models of performance and learning. A widely used AI technique for
increasingly popular (Llamas-Nistal, Fernández-Iglesias, González-Tato, generating these kinds of models is latent knowledge estimation (Corbett &
& Mikic-Fonte, 2013). Key advantages of EAPs include providing the Anderson, 1994). The reason this is referred to as latent lies in the fact
ability to deliver questions that would be difficult or impossible to that knowledge cannot be directly observed. What can be observed is
deliver on paper—such as questions incorporating multi whether a learner can apply a knowledge component in some context.
media—presenting questions in a predetermined or random order, as This is used in intelligent tutoring systems to collect data about learners’
well as the ability to provide learners with rapid and personalised actions to particular learning opportunities and whether they could
feedback (Dennick, Wilkinson, & Purcell, 2009). correctly apply distinct knowledge components (Desmarais & Baker,
As EAPs have evolved, the data extracted from each exam episode 2012). This indicates that learners can produce a binary data point for
has become more sophisticated, allowing for scrutiny beyond traditional each learning opportunity – they were either successful or unsuccessful
techniques like item analysis. These data may include timestamps for in applying knowledge components.
every action and response made by an examinee throughout their exam. Bayesian knowledge tracing (BKT) is the best-known technique for
Not only can these snapshots be used for exploring software bugs and latent knowledge estimation (Corbett & Anderson, 1994). The technique
investigating suspected academic misconduct, but they increasingly are uses four parameters to estimate whether a learner can apply a knowl
used to better understand learners’ behaviour. In particular, previous edge component, including (a) probability that the learner already
research has investigated: measuring and classifying test-taking effort masters a knowledge component, (b) probability of learning a knowl
(Wise & Gao, 2017); answering and revising behaviour during exams edge component after a learning opportunity, (c) probability of correctly
(Pagni et al., 2017); metacognitive regulation of strategy and cognitive applying a knowledge component even when the learner has not
processing (Goldhammer et al., 2014); the validation of test score mastered it (guess), and (d) probability of incorrectly applying a
interpretation (Engelhardt & Goldhammer, 2019); detecting knowledge component although they know it (slip). While BKT has been
rapid-guessing and pre-knowledge behaviours (Toton & Maynes, 2019); widely popular, new knowledge techniques have been proposed recently
modelling examinees’ accuracy, speed, and revisits (Bezirhan, von based on advancements in deep learning (Gervet, Koedinger, Schneider,
Davier, & Grabovsky, 2021); modelling students in real-time while & Mitchell, 2020), including the use of recurrent neural networks (Piech
taking a self-assessment (Papamitsiou & Economides, 2017); and un et al., 2015) and transformers (Shin et al., 2021).
derstanding students’ performance in various contexts such as complex Knowledge tracing has also been used as a foundation for developing
problem solving (Greiff, Stadler, Sonnleitner, Wolff, & Martin, 2015). a technique – moment by knowledge learning (Baker, Goldstein, &
Heffernan, 2011; 2013) – that can infer the exact moment when a
3.2.2. Stealth assessment learner mastered a particular skill. Not only has this technique been
Relatedly, stealth assessment techniques collect data that go beyond applied for learning about specific subject matter, but it has also been
whether students have simply answered questions correctly. The term used to estimate how well learners self-regulate their learning (Mole
“stealth assessment” was coined by Shute and Ventura (2013) for an naar, Horvers, & Baker, 2021) and offer personalised visualisations
approach in which they used data automatically collected from learners (Molenaar, 2022).
as they played a digital game. They developed measures of conscien
tiousness, creativity, and physics ability by collecting data generated in 3.2.4. Learning processes
a digital physics game commonly used in schools. They built models of Traditional assessment practice has tended to focus on judging an
the expected trajectory of behaviour evident in the game as students artefact produced by the learner, such as an essay, a laboratory report or
increased in capability, called a construct map (Wilson, 2005). The data a completed examination sheet. The main reason it has been difficult, if
were then used to place each learner on this map, generating a dynamic not impossible, to track learning processes is that it is very time and
assessment of the increasing capability of the learner as they played. resource-intensive. Constant monitoring of progress and the ongoing
As it was initially conceived, stealth assessment has four critical collection of indicators that allow inferences of cognitive and meta
components: (a) evidence-centered assessment design (Mislevy et al., cognitive processes are required. These can include self-report, behav
2003), (b) formative assessment and feedback to support learning, (c) ioural, psychophysiological and other data. Collecting and analysing
the support of pedagogical decisions, and (d) the use of learner models these data to date has been arduous, requiring specialised equipment,
that may include cognitive or non-cognitive information (Shute, 2011). laboratories and analysis. Building on approaches such as stealth
Typically, stealth assessment following Shute’s paradigm involves un assessment discussed previously, AI can be used to better understand
obtrusively capturing traces of learner behaviour in digital gameplay trends in learning processes.
environments and modelling learners via approaches such as Bayesian Recent developments in multimodal data collection, learning ana
networks (Pearl, 1988). lytics, and AI afford opportunities to improve the assessment of pro
While stealth assessment refers to a specific assessment design cesses. For example, the use of multichannel data such as clickstreams,
approach, elements of it have been widely adopted in the use of digital mouse movements, and eye-tracking (Azevedo & Gašević, 2019; Järvelä,
learning environments more generally. Using similar techniques, Griffin Malmberg, Haataja, Sobocinski, & Kirschner, 2020) along with
and Care (2015) used log stream data generated from two-player digital enhanced instrumentation of learning environments such as the use of
games to assess student performance in collaborative problem-solving. highlights or bookmarks (Van Der Graaf et al., 2021; Jovanović, Gašević,
4
Pardo, Dawson, & Whitelock-Wainwright, 2019; Zhou & Winne, 2012) healthcare, students and practitioners apply critical clinical knowledge
can offer empirical accounts about processes related to motivation, in close-to-real-life situations (e.g., addressing an antibiotic reaction,
affect, cognition, and metacognition. Promising directions for assessing simulating surgery) (Sullivan et al., 2018, Echeverria,
learning processes are being developed by analysing multichannel data Martinez-Maldonado, & Buckingham Shum, 2019). The physical
with different AI and machine learning techniques such as deep learning spaces closely mimic those spaces that students will experience
learning, process mining, and network analysis (Ahmad Uzir, Gašević, in the future.
Matcha, Jovanović, & Pardo, 2020; Fan, Saint, Singh, Jovanovic, & Simulations for learning are designed to help learners do the kinds of
Gašević, 2021; Saint, Gašević, Matcha, Uzir, & Pardo, 2020). things that professionals do. But in the real world it may be too difficult,
expensive or dangerous to let them do so. More importantly, they
3.3. From uniform to adaptive necessarily lack the expertise to do so. This expertise, after all, is what
they are trying to learn. To address this issue, virtual internships use AI
Rather than giving the same assessment task to all students, AI to create an environment in which it is possible, safe, and effective for
techniques have been developed that adjust the task to the student’s students to act like professionals. This is done via simulated professional
abilities, giving them tailored assessment experiences. tools, automated messages from co-workers and supervisors, and auto
Computerised adaptive testing systems (CATs) conduct an exam mated feedback on work products. Similarly, prospective nurses and
using a sequence of successively administered questions to maximise the physicians do not work with actual patients in physical healthcare
precision of the system’s current estimate of the student’s ability. There simulations. In some cases, they work with simulated patients who use
are five inter-connected technical components for building a CAT AI to behave like actual patients—for example, they exhibit specific
(Thompson, 2007): (1) a pool of items calibrated with pre-testing data; symptoms at specific times (Echeverria et al., 2019).
(2) a specific starting point for each examinee; (3) an item selection In addition to augmenting the assessment tasks and environment, AI
algorithm to select the next item; (4) a scoring algorithm to estimate the may collect, represent, and assess data from authentic assessments.
examinees’ ability, and (5) a termination criterion for the test. Given that authentic assessments may involve multiple individuals or
Item-response theory (IRT) (Embretson & Reise, 2013) is a common groups performing complex and ill-defined tasks, it can be challenging
psychometric technique used in many CATs for calibrating the items. for educators to be aware of all that is going on during a simulation and
One of the key characteristics of IRT that makes it a good fit for CAT is provide detailed feedback, especially to large cohorts (Murphy, Fox,
that it places the ability of examinees and the difficulty level of items on Freeman, & Hughes, 2017). Like stealth assessment and AI-driven as
the same metric, which helps the item selection algorithm decide which sessments of learning processes, AI is one way to address the complexity
item needs to be administered next. Heuristically, an examinee is of these assessment situations via integrated data collection and
measured most effectively when test items are neither too difficult nor modelling.
too easy. Given that IRT places exam-takers and items on the same For example, in virtual internships, the online platform automati
metric, it can identify an item that matches the user’s current ability. cally logs student chat messages. To relate this evidence to claims about
Consequently, if the examinee answers an item correctly, the next item learning, a supervised machine learning algorithm is used to automati
selected should be more difficult; if the answer is incorrect, the next item cally classify the chats as evidence of elements of an epistemic frame and
should be easier. epistemic network analysis (Shaffer, Collier, & Ruis, 2016) is used to
To make adaptive testing operational, the size of the item pool must identify relationships among these elements. A dashboard integrates
be large enough so that the selection algorithm can administer a suitable these techniques into live representations of the epistemic networks that
item based on the examinees’ current ability. An important factor in a educators can use to monitor group interaction and plan interventions in
CAT is the start point. If the system has some knowledge about the real-time (Herder et al., 2018).
examinee, it can optimise the starting point to their ability; otherwise, it In offline simulations, multimodal learning analytics are being
may assume the examinee is of average ability. Once an item is developed to capture millions of data points–including system logs,
administered, the CAT updates its estimate of the examinee’s ability position coordinates, speech, and physiological traces–in physical
level. This is commonly done by updating the item response function spaces and in a relatively short amount of time. AI may be integral to the
using either maximum likelihood estimation and Bayesian estimation functioning of these sensors, as with the case of automated transcription
(Sorrel, Barrada, de la Torre, & Abad, 2020) or rating systems such as tools. To make these data available for educators, one approach that has
Elo rating (Abdi, Khosravi, Sadiq, & Gasevic, 2019; Verschoor, Berger, been adopted is to use data storytelling principles to create interfaces in
Moser, & Kleintjes, 2019). Finally, the exam is usually terminated once which stories are extracted from the complex multimodal data to focus
the system estimates the student’s ability with a confidence level that on one learning or reflection goal at a time. For example, Echeverria and
exceeds a user-specified threshold. CATs have been demonstrated to colleagues (2020) focused on creating data stories related to common
have the ability to shorten the exam by 50% while maintaining higher errors performed by nursing students based on the automated assess
reliability in comparison to regular exams (Collares & ments of the sequence and timeliness of their logged actions.
Cecilio-Fernandes, 2019Collares & Cecilio-Fernandes, 2019).
3.5. From antiquated to modern
3.4. From inauthentic to authentic
Computational media like computers, calculators, and software
Authentic assessments measure learning using tasks that simulate make it possible to externalise information processing in new and
those undertaken by actual members of some community of practice powerful ways. While computational media exist in various domains,
(Reeves & Okey, 1996). AI techniques are now being used to augment here we briefly focus on some of those developed for writing tasks as an
simulated tasks and analyse the evidence associated with them. example.
In both virtual and physical learning environments, AI has come to Digital word processors have been in use since at least the 1970s
play an essential role. For example, in virtual simulations called virtual (Bergin, 2006). In addition to simply recording and storing text, their
internships, learners intern at a fictional company where they work in primary function has been to offload typical writing tasks, such as
teams to design a product (Shaffer, 2006a, 2006b). The goal of virtual editing, from humans to computers. Digital word processors commonly
internships is to give learners scaffolded experience doing the kinds of include automated techniques for checking spelling, grammar, and
things that actual professionals do, such as: conducting background usage. As these tools have developed, they have increasingly come to
research, holding design meetings, reporting to supervisors, and devel rely on AI to complete more sophisticated tasks.
oping and testing prototypes. In offline simulations such as those used in Today’s digital word processors like Microsoft Word and Google
5
Docs include AI techniques that suggest word and sentence completions (see, for example, Griffiths, 2021). Students too might welcome the
(Microsoft, 2022). Other commercial tools, like Grammarly (Grammarly, option of not having to subject themselves to face the vulnerability of
2022), include AI that infers tone and style. AI-based tools like Sudowrite being judged by their teachers, schools or other social institutions close
(Marche, 2021) now exist that generate entirely new sections of text to home—in other words, the frictions of being assessed by people who
based on a few sample lines. Because these tools may be used by learners actually know them.
and professionals in their everyday practices, assessment designs may Yet, AI-enabled assessment is not a simple case of deferring educa
incorporate them. Using tools to do increasingly complex and humanlike tional judgements to the dispassionate, objective, reliable gaze of the
tasks has important implications for assessment, some of which we machine. There is no such thing as neutral, dispassionate non-human
discuss below. assessment (Mayfield et al., 2019; Scheuneman, 1979). Instead,
AI-enabled assessment can more accurately be described as handing
4. Challenges for AI and assessment those decisions over to programmers, learning engineers, instructional
designers, software vendors and other humans that have no direct
Thus far, we have highlighted a set of issues with the SAP and knowledge of the students being assessed, their local contexts, or even
reviewed some AI-based approaches that bear on these issues. While the necessarily the educational systems that they are studying within. Thus,
sections above suggest that AI can improve the SAP, we acknowledge as with any form of assessment, AI-enabled assessment is an
that this paradigm has a long and, arguably, successful history. It is objective-partial process. As Hanesworth and colleagues put it:
worth, then, reflecting on what we might lose–or other problems we
No matter the structures and processes put in place, assessments are
might introduce–by introducing AI to this paradigm.
designed and evaluated by humans, with all their complex socio-
cultural backgrounds, educational experiences, and intellectual and
4.1. The sidelining of professional expertise
personal values (Hanesworth, Bracken, & Elkington, 2019, pg. 99).
Many researchers seek to develop AI technologies that support and In the case of AI-based assessment, the responsibility for the
guide teachers’ decision-making, freeing teachers from routine, uncon modelling and execution of educational assessment is deferred to distant
tentious tasks and decisions while continuing to defer to teachers’ ulti others (programmers, learning engineers). On the one hand, this can be
mate judgment and oversight (see, for example, Herder et al., 2018). In welcomed as distancing assessment decisions from the biases and as
this sense, it is reassuring to imagine that AI-enabled assessment will sumptions of classroom teachers. Yet, on the other hand, this also raises
retain humans-in-the-loop, with teachers able to oversee and override concerns that need to be taken more seriously in terms of how AI-
any automated decision when they see fit. enabled assessment then exposes the student to the biases, values, as
However, one potential danger of automated decision-making is the sumptions of those other people who otherwise have no knowledge of or
sidelining of professional expertise—that is, machine calculations and personal investment in those who are being assessed.
outputs being deferred to or automatically taken as correct. A hypo At the very least, in practical terms, these concerns raise the pressing
thetical example of this can be seen with plagiarism software at need for rigorous oversight of any AI-enabled assessment and the
educational institutions. In the past, teachers made decisions regarding establishment of clear lines of accountability for the decisions that these
whether student submissions were too similar to one another or avail systems and software produce—as well as clear lines of accountability
able sources. However, given the volume of possible sources and ad for how software outputs are then translated over into final grades by
vances in natural language processing, AI can now handle this task in educational institutions.
many contexts. Given the difficulty of this task and the efficacy of
existing algorithms, it is possible, and perhaps easy, for educators to take 4.3. Restricting the pedagogical role of assessment
their output as a correct decision rather than a tentative suggestion. It
would take a confident and time-rich teacher to regularly challenge Amidst the current enthusiasm for AI-enabled assessment, there is
these systems’ outputs. As such, there are understandable concerns that little acknowledgement of the pedagogical role of assessment. This re
we face the prospect of teachers’ decision-making capacity being ‘hol lates to the idea that educational assessment is not solely a matter of
lowed out’ as automated assessment systems “creat[e] a distance be gauging what a student has (and has not) learnt (Wiliam, 2011). Instead,
tween their decisions and the evidence-gathering processes on which when considering the consequences of increased use of AI-based as
those decisions must rely” (Couldry, 2020, p. 1139). sessments, it is important to consider how this might impact the ability
To prevent such a hollowing-out, researchers have begun to design of educators to engage with assessment as a pedagogical act.
systems in which the decision-making processes are explainable to the For example, on a personal level, teachers will often use traditional
teacher (Rosé, McLaughlin, Liu, & Koedinger, 2019; Khosravi et al., forms of teacher-graded assessment to motivate, support and cajole
2022). While this is a promising direction, more work is needed to better students (Cauley & McMillan, 2010; Harlen, 2012). This might involve
understand the balance between AI and teacher decision-making that is showing leniency when the teacher feels that a student will benefit from
best for teaching, learning, and assessment. being encouraged and seen to succeed. Alternatively, this might involve
being more punitive where a teacher feels that a student might benefit
4.2. The black-boxing of accountability from an intervention. In both instances, the act of assessment is rooted in
the personal relationships and knowledge that a teacher has established
While many researchers might argue that it is not their intention to with her student.
do so (see, for example, Baker, 2016), taking human teachers out of the Many educators also pay close attention to what is learnt from any
assessment loop is likely to be an appealing prospect for many key assessment act. This is implicit in some educators’ use of alternate forms
stakeholders involved in school and university education. Educational of assessment. For example, the rising popularity of peer assessment is
institutions may welcome the capacity for the reliable, timely produc rooted primarily as a means of encouraging self-reflection among stu
tion of assessment data at scale—avoiding inconsistencies over mis dents on their own work (Cho & Cho, 2011; Topping, 2018). The trend
marking or delays resulting from the marking simply not being done on for allowing student-led self-assessment is similarly based on intentions
time. to develop student deliberation on one’s own learning practices. Simi
Similarly, many teachers may be happy to defer responsibility and larly, growing interest in the use of ‘assessment for social justice’ seeks
dodge the awkward task of personally grading students that they have to support students’ engagement with multiple and contested perspec
grown to know—particularly given current tendencies for students to tives and dealing with variation arising from contextual differences,
appeal and contest grades, and even initiate legal action over misgrading historical aspects and personal normativities (see McArthur, 2016,
6
Hanesworth et al., 2019). This might entail, for example, allowing stu assessable and algorithmically rewarded. Put another way, “teaching to
dents to take a leading role in collectively deciding on the nature and the test” (Popham, 2001) is not necessarily avoided using AI-enabled
form of how they are assessed. In all cases, the intention is to support assessment.
students to reflect on educational processes and practices rather than At the same time, there is also a need for discussions of AI-enabled
produce an objective ‘measure’ of learning. assessment to better acknowledge the many forms of learning that
Concerns can be raised that some AI-enabled assessments prevent cannot yet be detected, measured, and modelled by non-humans. AI
teachers from using assessment in these alternate ways. Yet, such ex software is notoriously limited in detecting meaning in language or
amples also highlight the value-driven nature of how educational images—be it the simple development of a logical argument to nuance
assessment is undertaken—an aspect that has not featured in many and inflection such as irony and sarcasm. For example, natural language
discussions of AI-enabled assessment. The idea of ‘assessment for social processing technology might have a near-infinite capacity to recognise
justice’ certainly conveys a distinct set of values about what education is vocabulary but remains tone-deaf to the subtleties of language—double-
and what education is for. This, in turn raises questions about the im meanings, allusions, local vernacular, tone and subtext.
plicit values and ideological underpinnings of AI-enable assessment. Is it Similarly, AI-enabled assessment may remain understandably
fair to argue, as Saltman (2020, p.199) implies, that AI approaches to limited in its capacity to recognise (let alone assess) instances of
education appear to promote ideals of “standardized and improvisation, creativity, poetry, morals, or ethics. There may be little
transmission-oriented approaches to teaching”? Or, that AI-enabled room for recognising (and rewarding) distinctly different, unexpected
assessment corresponds closely with the employment conditions of the and perhaps unique ways of setting about a learning task— where stu
post-Fordist neoliberal workplace—preparing future workers for con dents engage in genuine originality and ‘out of the box’ thinking that a
ditions of continual tracking, monitoring of performance, nudging of good human assessor would be able to appreciate (even if they would
behaviours, and so on. These are concerns that the community that have never thought of it themselves). In short, there exist aspects of
works on AI-enabled assessment need to engage with. If not these ideals learning that remain perceptible to humans but not machines. As such,
and values, then what are the values and ideological underpinnings that discussions of AI-enabled assessment need to be more forthcoming in
are being advanced through the development of AI-enabled assessment? acknowledging what the technology cannot (and may never) be capable
Researchers have also begun to address this issue, at least implicitly. of assessing.
For example, several researchers have called for a more prominent role
for educational and learning theory in the development of AI approaches 4.5. Surveillance pedagogy
(Rogers, Gasevic, & Dawson, 2016, pp. 232–250). These theories take a
stance on what is valued with respect to learning, and they may differ In one sense, AI-enabled assessment builds on some distinct logics of
markedly from transmission-oriented approaches, instead, focusing, for ‘datafication’ in education, such as the idea of continuous, compre
example, on promoting the ideals of particular communities of practice hensive data generation relating to an individual’s ongoing engagement
(Shaffer, 2006a, 2006b) or the ability to regulate one’s learning (Aze with an online learning environment. This evokes promises of contin
vedo and Gašević, 2019; Molenaar, 2022). uous assessment that are not necessarily recognised by students as
assessment – thereby overcoming issues of ‘test anxiety’ (Colwell, 2013)
4.4. Assessing limited forms of learning and allowing for all aspects of an individual’s learning to be made
visible. However, these promises of continual background data moni
Extending the idea of AI-enabled assessment as curtailing different toring can be seen to constitute conditions of surveillance. As such, the
forms of teaching are concerns over restricted forms of learning implicit promise of data-driven educational environments “to make visible what
in the use of AI-enabled assessment. Of course, one of the central might otherwise be hidden or missed” (Bayne et al., 2020, p. 185) needs
promises of AI-enabled assessment is the capacity to recognise and to be acknowledged as potentially problematic, as well as potentially
respond to all the forms of learning prevalent in the digital age—to know beneficial.
things about what has been learnt that would otherwise remain un For example, there needs to be more acknowledgement in discus
known. Yet this promise of comprehensive assessment of learning in all sions of AI-enabled assessment regarding how this state of continuous
its forms obscures that any form of assessment demarcates and de surveillance also lends itself to processes of control and compliance—for
lineates what is understood by learning in any education system (Mes example, monitoring for indications of malpractice and other forms of
sick, 1994). As Taras (2008, p.389) puts it, assessment is “the single most cheating. Of course, most aspects of formal education institutions such
important component” that shapes student learning. as schools and universities are seen to be based traditionally around
In this sense, concerns can be raised that many forms of AI-enabled ‘surveillance pedagogies’—not least the traditional set-up of the class
assessment perpetuate the orientation of current traditional assess room or the examination hall—with seats arranged in rows, facing the
ment regimes toward emphasising skills, rational thinking and behav front of the class, teacher supervising student bodies (Luke, 2003;
iours, alongside predominantly white, male, middle-class, Western McLaren & Leonardo, 1998). Nevertheless, online education (and, by
values of objectivity and individualism (Hanesworth et al., 2019). In implication, AI-enabled assessment) extends and amplifies the scope of
other instances, the prominence of technologies such as eye-tracking this surveillance to all times and all spaces of the school day or the
highlights the dangers of AI-enable-assessment acting to reinforce and university experience.
privilege ableist—and especially—neurotypical models of learning and In this sense, it could be argued that AI-enabled assessment consti
what it means to exhibit learning-related behaviours (Swauger, 2020). tutes an administrative—rather than a pedagogic—gaze, impinging on
All told, strong arguments can be made that AI-enabled assessment may the fragile conditions of trust that most educators see as underpinning
well alter—but not necessarily expand—the forms of learning that are the teacher/student relationship:
being assessed.
… in higher education settings, a culture of surveillance, facilitated
Thus, conversations in the research community need to explore the
and intensified by technology, risks creating conditions that are
contention that AI-enabled assessment is not a neutral site where any
highly risk averse and destructive of the trust basis on which aca
form of learning will be detected and assessed. For example, as with any
demic and student autonomy and agency rely. Technology archi
form of assessment, it could be argued that any instance of AI-enabled
tectures introduced to build trust by mapping performance may end
assessment will inevitably codify specific cultural, disciplinary and in
up directly undermining these very goals” (Bayne et al., 2020,
dividual norms, value systems and knowledge hierarchies. Moreover, it
p.182).
may inculcate these norms, values and knowledge hierarchies within
students. Students will learn to perform in ways that are algorithmically It is also important to better consider the implications of continuous
7
surveillance of students in terms of pedagogical lines. In particular, the considered when designing and implementing assessments. We hope
SAP also conveys that learning can often best take place where there is that this paper brings both the issues with the standard assessment
no assessment. This contrasts with the benefits of continuous and paradigm and the challenges associated with AI and assessment into a
comprehensive monitoring and assessment of students’ educational deeper conversation that will ultimately improve assessment practices
engagement. As such, while by no means perfect, current forms of ed more generally.
ucation are set up in ways that support learning and progression to occur
during episodes where there is no assessment. Well-designed teaching
Declaration of competing interest
offers many moments of rehearsal—recognising the vulnerability of
learning and allowing students ample opportunity to learn in private,
The authors declare that they have no known competing financial
engage in preparatory work, experiment, make mistakes, and fail. In
interests or personal relationships that could have appeared to influence
other words, the absence of assessment is seen as the best condition for
the work reported in this paper.
learning and progression.
However, the importance of the absence of assessment also seems
Acknowledgements
contradictory. How do we know whether it is good for learning if we
cannot tell whether learning is occurring? Perhaps one way around this
This work was supported in part by funding from the Australian
issue is to continue to shift the focus of assessment from evaluation or
Research Council (DP210100060, DP220101209, CIRES/
judgement to development. In this view, continual monitoring of stu
IC200100022), Economic and Social Research Council of the United
dent processes is not a means of determining whether someone is doing
Kingdom (ES/S015701/1), and Jacobs Foundation (CELLA 2 CERES,
something “right” or “wrong”, but instead, monitor for opportunities to
Research Fellowship Program). Any opinions, findings, conclusions, or
provide feedback and improve learning (Wiliam, 2011). Thus, what is
recommendations expressed in this material are those of the authors.
needed is not necessarily a shift in how the assessments take place but a
They do not necessarily reflect the views of the funding agencies,
conceptual shift in what they mean and what they are for.
cooperating institutions, or other individuals.
4.6. Distributed assessment models
References
A final concern has to do with the changes that computational tools
Abdi, S., Khosravi, H., Sadiq, S., & Demartini, G. (2021). Evaluating the quality of
imply for assessment. One way to characterise assessment is as an learning resources: A learner sourcing approach. IEEE Transactions on Learning
argument from evidence (Messick, 1994). In evidence-centered assess Technologies, 14(1), 81–92.
Abdi, S., Khosravi, H., Sadiq, S., & Gasevic, D. (2019). A multivariate ELO-based learner
ment design, for example, this argument includes a student model that
model for adaptive educational systems. In Proceedings of the 12th international
describes the traits, skills, or abilities to be assessed; a task model that conference on educational data mining (pp. 462–467).
describes activities students will do to produce evidence that they have Ahmad Uzir, N., Gašević, D., Matcha, W., Jovanović, J., & Pardo, A. (2020). Analytics of
those traits; and an evidence model that describes the variables and time management strategies in a flipped classroom. Journal of Computer Assisted
Learning, 36(1), 70–88.
techniques that will be used to relate the evidence to the traits. One Almond, R. G., Steinber, L. S., & Mislevy, R. J. (2002). Enhancing the design and delivery
consequence of AI-based computational tools is that they complicate of assessment systems: A four-process architecture. The Journal of Technology,
each of these models. Learning, and Assessment, 5.
Alter, A. L., Oppenheimer, D. M., Epley, N., & Eyre, R. N. (2007). Overcoming intuition:
In terms of the student model, the presence of AI suggests that we Metacognitive difficulty activates analytic reasoning. Journal of Experimental
should adjust the traits, skills, and abilities assessed to be those that Psychology: General, 136, 569–576. https://doi.org/10.1037/0096-3445.136.4.569
require human influence rather than those that AI can accomplish on Azevedo, R., & Gašević, D. (2019). Analyzing multimodal multichannel data about self-
regulated learning with advanced learning technologies: Issues and challenges.
their own. In terms of the task model, AI suggests that we should allow Computers in Human Behavior, 96, 207–210.
students to use AI-based computational tools during the assessment. And Baker, R. S. (2016). Stupid tutoring systems, intelligent humans. International Journal of
in terms of the evidence model, the presence of AI suggests that we Artificial Intelligence in Education, 26(2), 600–614.
Baker, R. S., Goldstein, A. B., & Heffernan, N. T. (2011). Detecting learning moment-by-
should account for the fact that a human-AI team can generate assess
moment. International Journal of Artificial Intelligence in Education, 21(1–2), 5–25.
ment evidence. Depending on the sophistication of the AI, this could Baker, R. S., Hershkovitz, A., Rossi, L. M., Goldstein, A. B., & Gowda, S. M. (2013).
mean trying to separate the human and AI contributions, accounting for Predicting robust learning with the visual form of the moment-by-moment learning
curve. The Journal of the Learning Sciences, 22(4), 639–666.
the relationship between these contributions, or treating them as if they
Bayne, S., Evans, P., Ewins, R., Knox, J., Lamb, J., Macleod, H., et al. (2020). The
came from the same source. While some attempts have been made to manifesto for teaching online. MIT Press.
integrate assessment design theory with AI (see Mislevy et al., 2012), to Bergin, T. J. (2006). The origins of word processing software for personal computers:
date, they have mainly focused on the applications of AI to the evidence 1976-1985. IEEE Annals of the History of Computing, 28(4), 32–47.
Bezirhan, U., von Davier, M., & Grabovsky, I. (2021). Modeling item revisit behavior: The
model and less so on the task and student models. hierarchical speed–accuracy–revisits model. Educational and Psychological
Measurement, 81(2), 363–387.
5. Conclusion Boud, D., Ajjawi, R., Dawson, P., & Tai, J. (2018). Developing evaluative judgement in
higher education: Assessment for knowing and producing quality work. Abingdon, UK:
Routledge.
We have argued that several issues mar the standard assessment Brown, J. S., Collins, A., & Duguid, P. (1989). Situated cognition and the culture of
paradigm. learning. Educational Researcher, 18(1), 32–42.
Carless, D. (2022). From teacher transmission of information to student feedback
First, assessments in this paradigm can be onerous for educators to literacy: Activating the learner role in feedback processes. Active Learning in Higher
design and implement. Second, they may only provide discrete snap Education. https://doi.org/10.1177/1469787420945845 (in press).
shots of performance rather than nuanced views of learning. Third, they Cauley, K. M., & McMillan, J. H. (2010). Formative assessment techniques to support
student motivation and achievement. The Clearing House: A Journal of Educational
may be uniform and thus unadapted to the particular knowledge skills Strategies, Issues and Ideas, 83(1), 1–6.
and backgrounds of participants. Fourth, they may be inauthentic, Chen, H., & He, B. (2013). Automated essay scoring by maximizing human-machine
adhering to the culture of schooling rather than the cultures schooling is agreement. In Proceedings of the 2013 conference on empirical methods in natural
language processing (pp. 1741–1752).
designed to prepare students to become members of. And finally, they
Chen, G., Yang, J., & Gasevic, D. (2019). A comparative study on question-worthy
may be antiquated, assessing skills that humans routinely use machines sentence selection strategies for educational question generation. In Proceedings of
to perform. the 20th international conference on artificial intelligence in education (pp. 59–70).
While extant artificial intelligence approaches partially address the Cham: Springer.
Chen, G., Yang, J., Hauff, C., & Houben, G. J. (2018). LearningQ: A large-scale dataset for
issues above, they are not a panacea. As our discussion highlights, these educational question generation. In Proceedings of the 12th international AAAI
approaches bring with them a new set of challenges that must be conference on web and social media (pp. 481–490). AAAI.
8
Cho, Y. H., & Cho, K. (2011). Peer reviewers learn from giving comments. Instructional Hwang, G. J., Xie, H., Wah, B. W., & Gašević, D. (2020). Vision, challenges, roles and
Science, 39(5), 629–643. research issues of Artificial Intelligence in Education. Computers & Education:
Collares, C. F.Cecilio-Fernandes, D., … (2019). When I say computerised adaptive Artificial Intelligence, 1, Article 100001.
testing. Medical Education, 53(2), 115–116. Järvelä, S., Malmberg, J., Haataja, E., Sobocinski, M., & Kirschner, P. A. (2020). What
Colwell, N. M. (2013). Test anxiety, computer-adaptive testing and the common core. multimodal data can tell us about the students’ regulation of their learning process.
Journal of Education and Training Studies, 1(2), 50–60. Learning and Instruction, 45, Article 100727.
Cope, B., Kalantzis, M., & Searsmith, D. (2021). Artificial intelligence for education: Jia, X., Zhou, W., Sun, X., & Wu, Y. (2020). EQG-RACE: Examination-Type question
Knowledge and its assessment in AI-enabled learning ecologies. Educational generation. arXiv preprint arXiv:2012.06106.
Philosophy and Theory. https://doi.org/10.1080/00131857.2020.1728732 Jovanović, J., Gašević, D., Pardo, A., Dawson, S., & Whitelock-Wainwright, A. (2019).
Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of Introducing meaning to clicks: Towards traced-measures of self-efficacy and
procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278. cognitive load. In Proceedings of the 9th international conference on learning analytics &
Couldry, N. (2020). Recovering critique in an age of datafication. New Media & Society, knowledge (pp. 511–520). New York: ACM.
22(7), 1135–1151. Kaipa, R. M. (2021). Multiple choice questions and essay questions in curriculum. Journal
Crossley, S., Allen, L. K., Snow, E. L., & McNamara, D. S. (2015). Pssst... textual features... of Applied Research in Higher Education, 13(1), 16–32. https://doi.org/10.1108/
there is more to automatic essay scoring than just you. In Proceedings of the fifth JARHE-01-2020-0011
international conference on learning analytics and knowledge (pp. 203–207). Ke, Z., & Ng, V. (2019). Automated essay scoring: A survey of the state of the art. In
Darvishi, A., Khosravi, H., & Sadiq, S. (2020). Utilising learner sourcing to inform design Proceedings of the 28th international joint conference on artificial intelligence (pp.
loop adaptivity. In Proceedings of the 14th European conference on technology-enhanced 6300–6308).
learning (pp. 332–346). Springer. Khosravi, H., Conati, C., Martinez-Maldonado, R., Knight, S., Kay, J., Chen, G., et al.
Darvishi, A., Khosravi, H., & Sadiq, S. (2021). Employing peer review to evaluate the (2022). Explainable AI in education. Computers & Education: Artificial Intelligence. In
quality of student generated content at scale: A trust propagation approach. In this issue.
Proceedings of the eighth ACM conference on learning@ scale (pp. 139–150). Khosravi, H., Kitto, K., & Williams, J. J. (2019). Ripple: A crowdsourced adaptive platform
De Alfaro, L., & Shavlovsky, M. (2014). Crowdgrader: A tool for crowdsourcing the for recommendation of learning activities. arXiv preprint arXiv:1910.05522.
evaluation of homework assignments. In Proceedings of the 45th ACM technical Klebanov, B. B., Madnani, N., & Burstein, J. (2013). Using pivot-based paraphrasing and
symposium on computer science education (pp. 415–420). sentiment profiles to improve a subjectivity lexicon for essay data. Transactions of the
Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation Association for Computational Linguistics, 1, 99–110.
evaluation for any target language. In Proceedings of the ninth workshop on statistical Knight, S., Shibani, A., Abel, S., Gibson, A., Ryan, P., Sutton, N., et al. (2020). Acawriter:
machine translation (pp. 376–380). A learning analytics tool for formative feedback on academic writing. Journal of
Dennick, R., Wilkinson, S., & Purcell, N. (2009). Online eAssessment: AMEE guide no. 39. Writing Research, 12(1), 141–186.
Medical Teacher, 31(3), 192–206. Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017). Race: Large-scale reading
Desmarais, M. C., & Baker, R. S. (2012). A review of recent advances in learner and skill comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
modeling in intelligent learning environments. User Modeling and User-Adapted Lave, J., & Wenger, E. (1991). Situated learning: Legitimate peripheral participation.
Interaction, 22(1), 9–38. Cambridge University Press.
Echeverria, V., Martinez-Maldonado, R., & Buckingham Shum, S. (2019). Towards van der Linden, W. J., & Glas, C. A. W. (Eds.). (2010). Elements of adaptive testing. New
collaboration translucence: Giving meaning to multimodal group data. In Proceedings York, NY: Springer.
of the 2019 CHI conference on human factors in computing systems (pp. 1–16). Llamas-Nistal, M., Fernández-Iglesias, M. J., González-Tato, J., & Mikic-Fonte, F. A.
Educational Testing Service. (2022, March 1). What to expect during the GRE general test”. (2013). Blended e-assessment: Migrating classical exams to the digital world.
https://www.ets.org/gre/revised_general/test_day/expect. Computers & Education, 62, 72–87.
Embretson, S. E., & Reise, S. P. (2013). Item response theory. Psychology Press. Lodge, J. M. (2018). A futures perspective on information technology and assessment. In
Engelhardt, L., & Goldhammer, F. (2019). Validating test score interpretations using time J. Voogt, G. Knezek, R. Christensen, & K.-W. Lai (Eds.), International handbook of
information. Frontiers in Psychology, 10, 1131. information technology in primary and secondary education (2nd ed., pp. 1–13). Berlin:
Er, E., Dimitriadis, Y., & Gašević, D. (2020). A collaborative learning approach to Springer.
dialogic peer feedback: A theoretical framework. Assessment & Evaluation in Higher Luke, C. (2003). Pedagogy, connectivity, multimodality, and interdisciplinarity. Reading
Education, 46(4), 586–600. Research Quarterly, 38(3), 397–403.
Fan, Y., Saint, J., Singh, S., Jovanovic, J., & Gašević, D. (2021). April). A learning Marche, S. (2021, April 3). The computers are getting better at writing. https://www.newyo
analytic approach to unveiling self-regulatory processes in learning tactics. In rker.com/culture/cultural-comment/the-computers-are-getting-better-at-writing.
LAK21: 11th international learning analytics and knowledge conference (pp. 184–195). Mayfield, E., Madaio, M., Prabhumoye, S., Gerritsen, D., McLaughlin, B., Dixon-
Foltýnek, T., Meuschke, N., & Gipp, B. (2019). Academic plagiarism detection: A Román, E., et al. (2019, August). Equity beyond bias in language technologies for
systematic literature review. ACM Computing Surveys, 52(6), 1–42. education. In Proceedings of the 14th workshop on innovative use of NLP for building
Gervet, T., Koedinger, K., Schneider, J., & Mitchell, T. (2020). When is deep learning the educational applications (pp. 444–460).
best approach to knowledge tracing? Journal of Educational Data Mining, 12(3), McArthur, J. (2016). Assessment for social justice. Assessment & Evaluation in Higher
31–54. Education, 41(7), 967–981.
Gipps, C., & Stobart, G. (2009). Fairness in assessment. In Educational assessment in the McLaren, P., & Leonardo, Z. (1998). Deconstructing surveillance pedagogy. Studies in the
21st century (pp. 105–118). Dordrecht: Springer. Literary Imagination, 31(1), 127–147.
Glassman, E. L., Lin, A., Cai, C. J., & Miller, R. C. (2016). Learner sourcing personalized Messick, S. (1994). The interplay of evidence and consequences in the validation of
hints. In Proceedings of the 19th ACM conference on computer-supported cooperative performance assessments. Educational Researcher, 23(2), 13–23.
work & social computing (pp. 1626–1636). Microsoft. (2022, March 1). Microsoft Editor checks grammar and more in documents,
Goldhammer, F., Naumann, J., Stelter, A., Tóth, K., Rölke, H., & Klieme, E. (2014). The mail, and the web. https://support.microsoft.com/en-us/office/microsoft-edito
time on task effect in reading and problem solving is moderated by task difficulty r-checks-grammar-and-more-in-documents-mail-and-the-web-91ecbe1b-d021-4e9e
and skill: Insights from a computer-based large-scale assessment. Journal of -a82e-abc4cd7163d7.
Educational Psychology, 106(3), 608–626. https://doi.org/10.1037/a0034716 Milligan, S. K., & Griffin, P. (2016). Understanding learning and learning design in
Graham, S., Hebert, M., & Harris, K. R. (2015). Formative assessment and writing: A MOOCs: A measurement-based interpretation. Journal of Learning Analytics, 3(2),
meta-analysis. The Elementary School Journal, 115(4), 523–547. 88–115.
Grammarly. (2022, March 1). About grammarly. https://www.grammarly.com/about. Mislevy, R. J., Behrens, J. T., Dicerbo, K. E., & Levy, R. (2012). Design and discovery in
Greiff, S., Stadler, M., Sonnleitner, P., Wolff, C., & Martin, R. (2015). Sometimes less is educational assessment: Evidence-centered design, psychometrics, and educational
more: Comparing the validity of complex problem solving measures. Intelligence, 50, data mining. Journal of educational data mining, 4(1), 11–48.
100–113. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational
Griffin, P., & Care, E. (2015). Assessment and teaching of 21st century skills: Methods and assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–62.
approaches (Vol. 2). Dordrecht: Springer. Molenaar, I. (2022). The concept of hybrid human-AI regulation: Exemplifying how to
Griffiths, S. (2021). Families to sue over ‘wrong’ marks given by teachers. The Times. support young learners’ self-regulated learning. Computers & Education: Artificial
Retrieved from https://www.thetimes.co.uk/article/families-to-sue-over-wrong- Intelligence. In this issue.
marks-given-by-teachers-g2qjjc8x7. Molenaar, I., Horvers, A., & Baker, R. S. (2021). What can moment-by-moment learning
Hanesworth, P., Bracken, S., & Elkington, S. (2019). A typology for a social justice curves tell about students’ self-regulated learning? Learning and Instruction, 72,
approach to assessment. Teaching in Higher Education, 24(1), 98–114. Article 101206.
Harlen, W. (2012). The role of assessment in developing motivation for learning. In Murphy, V., Fox, J., Freeman, S., & Hughes, N. (2017). Keeping it real”: A review of the
J. Gardner (Ed.), Assessment and learning (2nd ed., pp. 61–80). SAGE Publications. benefits, challenges and steps towards implementing authentic assessment. All
Heckler, N. C., Rice, M., & Hobson Bryan, C. (2013). Turnitin systems: A deterrent to Ireland Journal of Higher Education, 9(3), 1–13.
plagiarism in college classrooms. Journal of Research on Technology in Education, 45 Page, E. B. (1966). The imminence of... grading essays by computer. Phi Delta Kappan, 47
(3), 229–248. (5), 238–243.
Herder, T., Swiecki, Z., Fougt, S. S., Tamborg, A. L., Allsopp, B. B., Shaffer, D. W., et al. Pagni, S. E., Bak, A. G., Eisen, S. E., Murphy, J. L., Finkelman, M. D., & Kugel, G. (2017).
(2018). Supporting teachers’ intervention in students’ virtual collaboration using a The benefit of a switch: Answer-changing on multiple-choice exams by first-year
network based model. In Proceedings of the 8th international conference on learning dental students. Journal of Dental Education, 81(1), 110–115.
analytics and knowledge (pp. 21–25). Palermo, C., & Thomson, M. M. (2018). Teacher implementation of self-regulated
Horbach, A., Aldabe, I., Bexte, M., de Lacalle, O. L., & Maritxalar, M. (2020). Linguistic strategy development with an automated writing evaluation system: Effects on the
appropriateness and pedagogic usefulness of reading comprehension questions. In argumentative writing performance of middle school students. Contemporary
Proceedings of the 12th language resources and evaluation conference (pp. 1753–1762). Educational Psychology, 54, 255–270.
9
Panadero, E. (2017). A review of self-regulated learning: Six models and four directions Sorrel, M. A., Barrada, J. R., de la Torre, J., & Abad, F. J. (2020). Adapting cognitive
for research. Frontiers in Psychology, 8, 883–928. https://doi.org/10.3389/ diagnosis computerized adaptive testing item selection rules to traditional item
fpsyg.2017.00422 response theory. PLoS One, 15(1), Article e0227196.
Papamitsiou, Z., & Economides, A. A. (2017). Student modeling in real-time during self- Sullivan, S. A., Warner-Hillard, C., Eagan, B., Thompson, R., Ruis, A. R., Haines, K., et al.
assessment using stream mining techniques. In Proceedings of the 17th IEEE (2018). Using epistemic network analysis to identify targets for educational
international conferenceon advanced learning technologies (pp. 286–290). IEEE. interventions in trauma team communication. Surgery, 163(4), 938–943.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic Suto, I., Nádas, R., & Bell, J. (2011). Who should mark what? A study of factors affecting
evaluation of machine translation. In Proceedings of the 40th annual meeting of the marking accuracy in a biology examination. Research Papers in Education, 26(1),
association for computational linguistics (pp. 311–318). 21–51.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Swauger, S. (2020). Our bodies encoded: Algorithmic test proctoring in higher education.
San Mateo, CA: Kaufmann. In J. Stommel, C. Friend, & S. Morris (Eds.), Critical digital pedagogy. Press Books.
Perret-Clermont, A.-N. (1980). Social interaction and cognitive development in children. Taras, M. (2008). Assessment for learning. Journal of Further and Higher Education, 32(4),
Academic Press. 389–397.
Piech, C., Spencer, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L., et al. (2015). Deep Thompson, N. A. (2007). A practitioner’s guide for variable-length computerized
knowledge tracing. arXiv preprint arXiv:1506.05908. classification testing. Practical Assessment, Research and Evaluation, 12(1), 1.
Popham, W. J. (2001). Teaching to the test? Educational Leadership, 58(6), 16–21. Topping, K. J. (2018). Using peer assessment to inspire reflection and learning. Routledge.
Puntambekar, S., Erkens, G., & Hmelo-Silver, C. (Eds.). (2011). Analyzing interactions in Toton, S. L., & Maynes, D. D. (2019). Detecting examinees with pre-knowledge
CSCL. Boston, MA: Springer US. https://doi.org/10.1007/978-1-4419-7710-6. inExperimental data using conditional scaling of response times. Frontiers in
Purchase, H., & Hamer, J. (2018). Peer-review in practice: Eight years of Aropä. Education, 4, 49.
Assessment & Evaluation in Higher Education, 43(7), 1146–1165. Van Der Graaf, J., Lim, L., Fan, Y., Kilgour, J., Moore, J., Bannert, M., … Molenaar, I.
Reeves, T. C., & Okey, J. R. (1996). Alternative assessment for constructivist learning (2021). April). Do instrumentation tools capture self-regulated learning?. In LAK21:
environments. In B. G. Wilson (Ed.), Constructivist learning environments: Case studies 11th international learning analytics and knowledge conference (pp. 438–448).
in instructional design (pp. 191–202). Englewood Cliffs, NJ: Educational Technology Verschoor, A., Berger, S., Moser, U., & Kleintjes, F. (2019). On-the-Fly calibration in
Publications. computerized adaptive testing. In Theoretical and practical advances in computer-based
Rogers, T., Gašević, D., & Dawson, S. (2016). Learning analytics and the imperative for educational measurement (pp. 307–323). Cham: Springer.
theory driven research. The SAGE Handbook of E-Learning Research. Vygotsky, L. S., & Cole, M. (1978). Mind in society: Development of higher psychological
Rosé, C. P., McLaughlin, E. A., Liu, R., & Koedinger, K. R. (2019). Explanatory learner processes. Harvard University Press.
models: Why machine learning (alone) is not the answer. British Journal of Wang, W., An, B., & Jiang, Y. (2018). Optimal spot-checking for improving evaluation
Educational Technology, 50(6), 2943–2958. accuracy of peer grading systems. In Proceedings of the 32nd AAAI conference on
Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes’ theorem. The artificial intelligence (pp. 833–840).
Journal of Technology, Learning, and Assessment, 1(2). Whitehill, J., Aguerrebere, C., & Hylak, B. (2019). Do learners know what’s good for
Saint, J., Gašević, D., Matcha, W., Uzir, N. A. A., & Pardo, A. (2020). Combining analytic them? Crowdsourcing subjective ratings of oers to predict learning gains. In
methods to unlock sequential and temporal patterns of self-regulated learning. In Proceedings of the 12th international conference on educational data mining (pp.
Proceedings of the tenth international conference on learning analytics & knowledge (pp. 462–467). IEDMS.
402–411). New York: ACM. Wiliam, D. (2011). What is assessment for learning? Studies In Educational Evaluation, 37
Saltman, K. (2020). Artificial intelligence and the technological turn of public education (1), 3–14.
privatization. London Review of Education, 18(2), 196–208. Wilson, M. (2005). Constructing measures: An item response modeling approach. New York:
Scheuneman, J. (1979). A method of assessing bias in test items. Journal of Educational Taylor & Francis Group.
Measurement, 143–152. Wilson, J., Ahrendt, C., Fudge, E. A., Raiche, A., Beard, G., & MacArthur, C. (2021).
Shaffer, D. W. (2006a). Epistemic frames for epistemic games. Computers in Education, 46 Elementary teachers’ perceptions of automated feedback and automated scoring:
(3), 223–234. Transforming the teaching and learning of writing using automated writing
Shaffer, D. W. (2006b). How computer games help children learn. New York, NY: Palgrave. evaluation. Computers & Education, 168, Article 104208.
Shaffer, D. W., Collier, W., & Ruis, A. R. (2016). A tutorial on epistemic network analysis: Wilson, J., & Czik, A. (2016). Automated essay evaluation software in English Language
Analyzing the structure of connections in cognitive, social, and interaction data. Arts classrooms: Effects on teacher feedback, student motivation, and writing
Journal of Learning Analytics, 3(3), 9–45. quality. Computers & Education, 100, 94–109.
Shaffer, D. W., & Kaput, J. J. (1998). Mathematics and virtual culture: An evolutionary Wilson, J., & Roscoe, R. D. (2020). Automated writing evaluation and feedback: Multiple
perspective on technology and mathematics education. Educational Studies in metrics of efficacy. Journal of Educational Computing Research, 58(1), 87–125.
Mathematics, 37, 97–119. Wilson, M., & Scalise, K. (2012). Assessment of learning in digital networks. In P. Griffin,
Shin, D., Shim, Y., Yu, H., Lee, S., Kim, B., & Choi, Y. (2021). Saint+: Integrating & E. Care (Eds.), Assessment and teaching of 21st century skills: Methods and approach
temporal features for ednet correctness prediction. In Proceedings of the 11th (pp. 37–56). Dordrecht: Springer.
international learning analytics and knowledge conference (pp. 490–496). Wise, S. L., & Gao, L. (2017). A general approach to measuring test-taking effort on
Shnayder, V., & Parkes, D. C. (2016). Practical peer prediction for peer assessment. In computer-based tests. Applied Measurement in Education, 30(4), 343–354.
Proceedings of the fourth AAAI conference on human computation and crowdsourcing Wright, J. R., Thornton, C., & Leyton-Brown, K. (2015). Mechanical TA: Partially
(pp. 199–208). automated high-stakes peer grading. In Proceedings of the 46thACM technical
Shute, V. J. (2011). Stealth assessment in computer-based games to support learning. symposium on computer science education (pp. 96–101).
Computer games and instruction, 55(2), 503–524. Yannakoudakis, H., & Briscoe, T. (2012). Modeling coherence in ESOL learner texts. In
Shute, V. J., & Rahimi, S. (2021). Stealth assessment of creativity in a physics video Proceedings of the seventh workshop on building educational applications using NLP (pp.
game. Computers in Human Behavior, 116, Article 106647. 33–43).
Shute, V., Rahimi, S., Smith, G., Ke, F., Almond, R., Dai, C. P., … Sun, C. (2021). Zheng, Y., Li, G., Li, Y., Shan, C., & Cheng, R. (2017). Truth inference in crowdsourcing:
Maximizing learning without sacrificing the fun: Stealth assessment, adaptivity and Is the problem solved? Proceedings of the VLDB Endowment, 10(5), 541–552.
learning supports in educational games. Journal of Computer Assisted Learning, 37(1), Zhou, M., & Winne, P. H. (2012). Modeling academic achievement by self-reported
127–141. versus traced goal orientation. Learning and Instruction, 22(6), 413–419.
Shute, V., & Ventura, M. (2013). Stealth assessment: Measuring and supporting learning in
video games. Cambridge, MA: MIT Press.
10

1 s2.0 S2666920X22000303 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S2666920X22000303 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S2666920X22000303 Main

Uploaded by

Copyright:

Available Formats

Computers and Education: Artificial Intelligence 3 (2022) 100075

Contents lists available at ScienceDirect

Computers and Education: Artificial Intelligence

Assessment in the age of artificial intelligence

You might also like