0% found this document useful (0 votes)
53 views13 pages

Applied Cognitive Psychology - 2020 - Ebersbach - Comparing The Effects of Generating Questions Testing and Restudyin

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views13 pages

Applied Cognitive Psychology - 2020 - Ebersbach - Comparing The Effects of Generating Questions Testing and Restudyin

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022].

See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Received: 19 March 2019 Revised: 18 November 2019 Accepted: 16 January 2020
DOI: 10.1002/acp.3639

RESEARCH ARTICLE

Comparing the effects of generating questions, testing, and


restudying on students' long-term recall in university learning

Mirjam Ebersbach | Maike Feierabend | Katharina Barzagar B. Nazari

Department of Psychology, University of


Kassel, Kassel, Germany Summary
We compared the long-term effects of generating questions by learners with answer-
Correspondence
Mirjam Ebersbach, Department of Psychology, ing questions (i.e., testing) and restudying in the context of a university lecture. In
University of Kassel, Hollaendische Str. 36-38, contrast to previous studies, students were not prepared for the learning strategies,
D-34127 Kassel, Germany.
Email: [email protected] learning content was experimentally controlled, and effects on factual and transfer
knowledge were examined. Students' overall recall performance after one week
profited from generating questions and testing but not from restudying. When ana-
lyzing the effects on both knowledge types separately, traditional analyses revealed
that only factual knowledge appeared to benefit from testing. However, additional
Bayesian analyses suggested that generating questions and testing similarly benefit
factual and transfer knowledge compared with restudying. The generation of ques-
tions thus seems to be another powerful learning strategy, yielding similar effects as
testing on long-term retention of coherent learning content in educational contexts,
and these effects emerge for factual and transfer knowledge.

KEYWORDS

desirable difficulties, factual and transfer knowledge, question generation, testing effect,
university learning

1 | I N T RO DU CT I O N when retention intervals (i.e., the period between learning and testing)
include at least one day (see also Adesope, Trevisan, & Sundararajan,
When students prepare for their exams, they typically restudy the 2017; Agarwal, Karpicke, Kang, Roediger, & McDermott, 2008;
learning material by rereading or rehearsing (Karpicke, Butler, & Roediger & Karpicke, 2006) in contrast to many laboratory studies in
Roediger, 2009). However, the acquisition of knowledge referring to which the learning outcome has often been tested immediately after
coherent, complex learning material benefits little from this type of the learning phase (e.g., Wouters, van Nimwegen, van Oostendorp, &
superficial restudying (Callender & McDaniel, 2009), and long-term van der Spek, 2013).
retention might even be impaired by this strategy (for an overview,
see Dunlosky, Rawson, Marsh, Nathan, & Willingham, 2013). Long-
term retention of curriculum-related material is a central aim in educa- 1.1 | Desirable difficulties in learning: The testing
tion because prior knowledge facilitates the further acquisition of effect
knowledge and allows knowledge to be applied in a variety of con-
texts outside formal learning environments, such as when working as One branch of learning strategies is predicated on desirable difficulties,
a professional. Therefore, identifying learning strategies that promote denoting mechanisms that make the learning process subjectively har-
long-term retention is substantial. We refer to “long-term retention” der but help learners to retain information in the long run (Bjork, 1994).

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any
medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
© 2020 The Authors. Applied Cognitive Psychology published by John Wiley & Sons Ltd.

724 wileyonlinelibrary.com/journal/acp Appl Cognit Psychol. 2020;34:724–736.


10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EBERSBACH ET AL. 725

One of these desirable difficulties is testing by which learners try to have been conducted in laboratory settings, the authors called for
answer questions about the learning material during the learning phase more research on this topic in authentic educational settings. For
before their knowledge is fully consolidated. Testing yields medium to example, Batsell, Perry, Hanley, and Hostetter (2017) revealed posi-
large effects on retention performance in the laboratory and in natural tive effects (p2 = .35) of quizzing on the performance in the final exam
learning contexts (for meta-analyses, see Adesope et al., 2017: Hedges' in a university psychology class compared with restudying, and most
g = 0.61; Rowland, 2014: Hedges' g = 0.50). importantly, this effect also emerged for questions that were not
One explanation for the testing effect is that it promotes retrieval included in the quizzes (d > 0.59), which can be conceived as a trans-
practice when learners try to remember the studied contents during fer effect.
the learning phase (for an overview on retrieval-based learning, see Apart from the scarcity of studies investigating the testing effect
Karpicke, 2017). Retrieval practice has direct effects by strengthening on transfer knowledge in authentic educational settings, often only
the memory trace through the retrieval attempt and mediated effects the immediate effects of testing on transfer performance were exam-
by providing feedback to learners about the extent of their learning ined in these studies. An exception is the study of Butler (2010) who
(Roediger & Karpicke, 2006). In addition, retrieval practice can even reported positive effects of initially tested items referring to a short
enhance the retrieval of other information that is learned after the ini- text passage on far transfer in a final test after 1 week (d = 0.99). The
tial testing phase (for a review, see Pastötter & Bäuml, 2014). present study addresses, among other aspects, far transfer by investi-
Many studies on the testing effect have focused on the retrieval gating the long-term testing effects on factual and transfer
of facts acquired in the learning phase rather than on transfer effects knowledge.
(for an overview, see Carpenter, 2012). However, Thomas, Weywadt,
Anderson, Martinez-Papponi, and McDaniel (2018) reported beneficial
and even crossover effects of testing for different knowledge formats 1.2 | Generating questions
in an online learning environment with adult students learning about
neuropsychology. Factual questions in the initial testing phase Instructing learners to generate questions based on the learning mate-
enhanced the final test performance with regard to application knowl- rial also yields medium to large effects on comprehension, recall, and
edge, whereas initial testing with application questions improved the problem solving (for an overview, see Song, 2016). Generating ques-
final test performance with regard to factual knowledge. McDaniel, tions may stimulate a deeper processing and reflection of the learning
Thomas, Agarwal, McDermott, and Roediger (2013) reported similar material as well as retrieval practice in comparison with restudying.
transfer effects for the learning of science in middle school but with However, in most of the reviewed studies, learners were trained on
one exception: Factual questions in the initial testing phase did not how to generate questions effectively and practiced this strategy in
improve performance in application questions in the final exam, advance and under supervision. In addition, the learning material
whereas application questions yielded transfer effects on factual involved only short text passages, and only short-term effects were
questions in the final exam (d = 0.34). Pan and Rickard (2018) con- examined.
ducted a meta-analysis on transfer effects of testing. Transfer was Bugg and McDaniel (2012), for example, instructed undergradu-
defined relatively broadly, occurring when the cues or required ate students in the laboratory on how to generate either factual or
responses (or both) in the initial testing phase and in the final perfor- conceptual questions and presented them with examples for each
mance tests differed. This definition includes close (e.g., rephrasing question type. Thereafter, the students were asked to read short text
information) and far transfer (e.g., drawing new inferences). The meta- passages on scientific phenomena and to generate questions and
analysis revealed a small to medium effect of initial testing on transfer answers related to these texts. The generation of questions was com-
performance (d = 0.40). This effect was moderated by several condi- pared with rereading. Students had access to the text passages in all
tions and was negligible when these conditions were not present. conditions (i.e., open-book condition). The final test, including factual
Transfer effects were stronger (a) for certain kinds of transfer tasks, and conceptual test questions, took place immediately after the learn-
for example, for application and inference questions (weaker or even ing phase. The generation of conceptual questions yielded a benefit
negative transfer effects occurred for questions in which stimulus and for conceptual test questions (p2 = .19) compared with rereading,
response were rearranged compared with the initial test, or for ini- whereas the generation of factual questions yielded no effect.
tially presented but untested material), (b) when the initial testing Evidence for the effects of the generation of questions on com-
involved the retrieval of broad knowledge, not of isolated concrete prehension was accumulated in a meta-analysis across 26 studies in
facts, and (c) when retrieval was successful in the initial testing phase. which children and college students were trained in multiple sessions
Tran, Rohrer, and Pashler (2015) explicitly examined the effect of test- on how to generate questions related to written texts. The analysis
ing on far transfer by asking participants to making deductive infer- yielded medium to large short- and long-term effects on the compre-
ences based on premises. Although participants recalled premises to a hension of the studied material (Rosenshine, Meister, & Chapman,
greater extent when they were initially tested than when they only 1996: g = 0.61).
restudied them, participants' performance in the final test with regard Although many of the studies on question generation were con-
to deductive inferences was not enhanced by initial testing. Given ducted in the laboratory, some studies examined this effect in real
that the majority of studies, focusing on transfer effects of testing, learning settings (i.e., in school or at a university). King (1992), for
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
726 EBERSBACH ET AL.

example, compared the effects of self-questioning, summarizing, and and answers. An important question is whether both strategies yield
note taking (as the control condition) in the context of videotaped uni- similar effects or whether question generation is even superior to
versity lectures on sociopolitical themes. Students first received back- testing given that it includes not only responses but also the formula-
ground information and a comprehensive training on self-questioning tion of the questions.
or summarizing (i.e., a 50-min training phase and four practice phases, Studies that compared the effects of testing and question genera-
50 min each). Thereafter, students saw the videotaped university lec- tion were often based on short texts, the method included only short
tures and were asked to apply their respective learning strategy. In an test delays (i.e., a couple of days), and most critically, the conditions
immediate comprehension test directly after the last lecture, students were often not comparable with regard to the extent of learning
in the self-questioning condition and in the summarizing condition material covered in the testing or question generation condition and
outperformed the control group, whereas no difference was found the expenditure of time to perform the tasks.
between the first two groups. In the final recall test after one week, These previous studies yielded contradicting results. A larger
self-questioners performed significantly better than summarizers and benefit of testing compared with generating questions and rereading
students in the control group, with no differences between the last was reported by Denner and Rickards (1987) for 5th to 11th
two groups (no effect sizes reported). graders. Weinstein, McDermott, and Roediger (2010) revealed simi-
A similar field study was conducted by King (1994) with fourth lar benefits from both question generation (d = 0.75) and testing
and fifth graders within their regular science curriculum. They (d = 0.96) compared with rereading in a sample of adult students.
followed real lessons on the structure and functioning of the body. Other studies suggested that the generation of questions might be
The children first received an introduction into the respective learning even more effective than answering questions generated by others
strategy (i.e., generating and answering questions in dyads that either (e.g., by teachers: Hartman, 1994; Palinscar & Brown, 2009). Foos,
targeted discovering relationships between different concepts within Mora, and Tkacz (1994) found a general advantage of students who
one lesson or relating the lesson content to their prior knowledge) generated parts of the learning material themselves (including self-
and practiced it during three lessons. Children in a control group were generated questions) compared with students who were provided
not guided on how to generate questions. Comprehension and knowl- by others with the material (including other-generated questions),
edge construction, tested one week after the treatment phase, was g = 0.15. Bae, Therriault, and Redifer (2019) held the learning time
better in both groups in which children were guided on how to pose constant across conditions, including testing and the generation of
questions than in the control group. Thus, with ample training, gener- questions, and found—in contrast to Foos et al. (1994)—an advan-
ating questions in real learning contexts can promote short- and long- tage of testing over the generation of questions in a final test after
term retention in children and adults. one week in a sample of students. However, the demands in the
Other studies have been conducted in real learning contexts with- learning conditions differed with regard to the tasks included, for
out training, but they suffer from methodological shortcomings. In the example, retrieving all information of the text that could be remem-
Berry and Chew (2008) study, students were not randomly assigned bered (i.e., free recall), answering 20 multiple choice questions
to the respective learning strategies. Instead, the students decided (i.e., testing), generating an undefined number of exam questions
whether or not they wanted to generate questions about the lecture (i.e., question generation), or generating five keywords related to
content. In the Levin and Arnold (2008) study, two experimental the text (i.e., keywords). Thus, the reported effects could be
question-generation groups were compared but no control group was attributed to these differences between the conditions.
included in which questions were not generated. Given the inconsistent findings, the question of whether question
In sum, the extent that generation of questions also yields robust generation and testing boost retention to a similar degree compared
effects on retention performance when learners are not trained with restudying when both conditions are comparably manipulated
remains an open question, as such training and practicing strategies is needs to be further investigated. Generating questions could arguably
effortful and time-consuming in real learning contexts. We address be more favorable than testing because it requires the active reflec-
this question in the present study by examining the long-term effect tion of the learning content in search for material that can be reflected
of the generation of questions (and answers) in the context of a uni- in a question, followed by the generation of the corresponding
versity lecture without prior training. answer. Moreover, the cognitive processing involved in the generation
of questions is greater than with testing because with testing, the
content is already implied by the question, and only the answer is
1.3 | Studies comparing the effects of testing and required to be generated.
the generation of questions These shortcomings when question generation and testing were
compared will be addressed in the present study, which fits with the
Testing in terms of answering questions generated by others might be ongoing discussion on teacher- versus student-centered learning
conceived as complementary to generating questions oneself. How- (e.g., Kirschner, Sweller, & Clark, 2006). Testing can be conceived as a
ever, questions generated by the learners could also be seen as a form teacher-centered approach, which addresses content that the teacher
of testing because the previously processed information must be believes to be relevant. The generation of questions by students, in
retrieved in the generation phase to formulate adequate questions contrast, can be conceived as a student-centered approach because
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EBERSBACH ET AL. 727

the content reflected in the questions is selected by the learners. feedback in closed-book tests. A recent laboratory study found
Thus, they must discern the relevance and importance of the informa- support for the crucial role of retrievability in the testing effect.
tion when generating questions. Moreover, the strategy also requires Roelle and Berthold (2017) reported an advantage of open-book
the generation of corresponding answers, which might evoke a deeper tests compared with closed-book tests in fostering long-term
processing than just answering questions on a test. recall of complex learning material. In contrast, Rummer,
Schweppe, and Schwede (2019) reported the opposite finding in a
field experiment stretching over multiple seminar lessons, which
1.4 | Open- versus closed-book tests might be assigned to the fact that students had restudied at
home. In sum, the testing effect seems to be more pronounced if
As outlined in Section 1.1, testing yields robust effects on learning retrieval is accompanied by feedback, either by using open-book
and retention. However, the retrievability of the content in the ini- tests or by closed-book tests with feedback, especially when com-
tial testing phase appears to be a crucial factor for the testing plex knowledge is tested.
effect (Rowland, 2014). When information is not retrievable in the
initial testing phase, it cannot be consolidated by the mere attempt
of retrieval. To solve this problem, testing can be conducted with 1.5 | The present study
feedback. When learners are given the correct response after hav-
ing tried to retrieve the response in the initial testing phase, long- The present study aims at extending empirical findings on the effects
term memory is additionally enhanced (Butler & Roediger, 2008). of testing and generating questions with regard to the following
Feedback can be provided either as a formal response to the aspects: We examine the long-term effects on the recall of factual and
learners' answers or by offering learners the opportunity to search transfer knowledge when using this strategy in the context of a univer-
for the information in their notes or learning material. The latter sity lecture. More specifically, we compared a testing condition and a
option is called an open-book test, compared with closed-book generating questions condition with a restudy condition. The generat-
tests, where learners are not allowed to use the material in the ini- ing questions condition involved no prior training. The material indi-
tial testing phase and do not get explicit feedback, at least until the cated the content that students should address when generating
phase is finished. Open-book tests also reflect more validly what questions. This method ensured that this condition was comparable
students often do in the frame of their self-regulated learning. Usu- with the testing condition in which students received questions that
ally, after having memorized new information, students try to recall addressed the same content. In addition, all students were provided
this information and then look in the learning material when their with the same material in the learning phase to enhance the compara-
recall attempt fails. bility of the conditions. This procedure resulted in an open-book con-
Agarwal et al. (2008) compared the testing effect in an open- dition for initial testing and the question generation group (see
book condition in which learners were allowed to look up the Agarwal et al., 2008). Students were allowed to look up information
material during initial testing and two closed-book conditions in after they had tried to retrieve the learned content when solving the
which learners either completed the initial tests during the learn- tasks in the learning phase. Thus, students' performance was less
ing phase with feedback (i.e., they were provided with the learn- dependent on their retrieval success compared with traditional
ing material after they completed the initial test and were told to closed-book conditions. However, the open-book condition when
check their answers) or without feedback. Scores in a final test generating questions also increased the comparability of this condi-
immediately after the learning phase were higher in the open- tion with the restudy condition, which was by nature of its activity an
book condition compared with the two closed-book conditions open-book condition, allowing students to correct their long-term
(d = 1.12). In a second final test after one week, open-book test- memory stores. A final surprise test was administered after one
ing outperformed closed-book testing without feedback (d = 0.45), week.1 The test included both factual and transfer questions to com-
whereas the performance in the open-book condition and the pare whether one type of knowledge benefits more from the different
closed-book condition with feedback was similar, and both condi- learning conditions. Students were additionally asked how they usu-
tions yielded better retention than simple restudying (ds > 0.87; ally prepare for exams to contrast potential effects of the learning
cf. Nevid, Pyun, & Cheney, 2016 for similar results). Furthermore, conditions with their learning strategies. Self-testing is not a fre-
the performance in the closed-book condition with feedback was quently used learning strategy (Karpicke et al., 2009). Thus, we
better than in the closed-book condition without feedback in the assumed that generating questions would also not be reported as a
second final test (d = 0.57). The initial advantage of the open- frequently used strategy.
book test can be attributed to the fact that learners have the We expected that (a) students in the generating questions condi-
chance to correct their memory stores (cf. Gharib, Phillips, & Mat- tion would perform better in the final test after 1 week than students
hew, 2012). Closed-book tests without feedback, in contrast, do in the testing condition because of the greater generation activity,
not offer such an opportunity. Thus, learners might potentially (b) students in both experimental conditions would outperform stu-
recall incorrect information, which is then strengthened by the ini- dents in the restudy condition, and (c) the effects of the generation of
tial testing. This shortcoming can be prevented by means of questions and testing would emerge for both factual and transfer
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
728 EBERSBACH ET AL.

knowledge. We additionally explored whether the depth of the ques- 2.3 | Material
tions in the question generation condition was related to students'
recall performance. The lecture was about a topic in the field of developmental psychol-
ogy (i.e., the development of domain-specific knowledge in infancy
and childhood). Usually, students who attend this lecture have not
2 | METHOD encountered this subject before in their studies. Thus, prior knowl-
edge can therefore be ruled out as an unlikely confound (see also Sec-
2.1 | Design tion 2.1). A paper booklet with demographic questions and an open
question about how the student usually prepare for exams (multiple
The study followed an experimental pre-/post-design. Learning condi- answers were possible) was distributed to all students at the end of
tion (i.e., generating questions, testing, and restudy) served as the the lecture. The booklet also included instructions for the particular
between-subjects variable to which students were randomly assigned. learning task and 10 slides of the lecture, which were identical in the
All students were tested one week after the learning session to three learning conditions (see Supporting information, including the
assess their long-term retention. The final test included factual ques- original data, in OSF: https://osf.io/a3w9y/). Relevant words were
tions that assessed information found in the learning content and printed in bold on the slides. In the generating questions condition, stu-
transfer questions that assessed students' deeper understanding of dents were instructed to formulate one exam question in an open
the learning content. Final test performance (i.e., proportion correct) response format for the content of each slide and to also provide an
was the dependent variable. answer to the question based on the relevant keywords that were
printed in bold. In the testing condition, one question per slide was for-
mulated referring to the bold keywords. The students' task was to try
2.2 | Participants to answer the questions first without help and to only look up the
answer in the slides if they were not able to provide an answer. In the
Participants were recruited and attended a lecture in developmental restudy condition, the instruction was to go through all 10 slides and
psychology. In the experimental learning session at the end of one lec- memorize the content. The proportion of questions generated by the
ture, 105 students consented to take part (77% female; age: learners in the generating questions condition was similar to the pro-
M = 21.8 years, SD = 4.5; 49% psychology students, 43% teacher portion of questions answered in the testing condition (i.e., 99% of
trainees, and 8% other students). A priori calculations of the sample the requested questions were generated in the generating questions
size required for linear regressions, assuming a medium effect size of condition and 96% of all questions were answered in the testing con-
learning condition (i.e., f2 = .15; see Section 1), a power of .90, and dition). However, students in the generating questions condition gen-
including two predictors (i.e., generation of questions and testing, res- erated a larger proportion of correct answers in the learning phase
tudy as reference group), yielded an N = 88 (G*Power: Faul, Erdfelder, (99%) compared with students in the testing condition (83%).
Buchner, & Lang, 2009). The psychology students were in their first The final surprise test was conducted again within a lecture but
semester and teacher trainees in their third semester. Attending a lec- this time by means of an internet-based test, accessible via a link by
ture on developmental psychology before in their studies, addressing means of smartphones or other electronic devices. The few students
the topic covered in the present study, was highly unlikely. who had no electronic device received the tests in a paper–pencil ver-
The students were randomly assigned to one of the three learning sion. The final test was not announced in advance to prevent students
conditions. They participated voluntarily. However, the learning ses- from preparing for this test and to rule out self-selection processes. It
sion and the final test session took place within the course, and the included 10 factual questions, asking for isolated facts that could eas-
lecture material was relevant for their exam at the end of the semes- ily be derived from the bolded words on the slides (e.g., “Which fac-
ter. As an additional incentive, students who finished both sessions tors contribute to the development of a Theory of Mind?”). The
could take part in a lottery. factual questions were identical to the questions presented to the stu-
The final sample, only including students who took part in the dents in the testing condition in the learning phase. In addition, each
learning session and the final test session, consisted of 82 students in final test included 10 transfer questions that assessed students'
total: 30 students in the restudy condition (83% female; age: deeper understanding of the learning content in terms of being able
M = 22.2 years, SD = 5.5), 22 students in the question generation con- to use it in new contexts (Pan & Rickard, 2018). Transfer questions
dition (86% female; age: M = 20.2 years, SD = 3.0), and 30 students in referred to the same slides as factual questions, but required the
the testing condition (70% female; age: M = 21.0 years, SD = 2.5). The application of knowledge beyond the bolded words in the slides such
decrease in sample size between the lecture that included the experi- as transferring it to new contexts, making generalizations or infer-
mental learning phase condition and the final test that took place in ences (see Appendix A for an example of a slide and the
another lecture can be attributed to the fact that students were not corresponding factual and transfer question; for all questions and
obliged to be present in the lectures. No systematic attrition effect answers, see supplementary information in OSF: https://osf.io/
occurred in any of the conditions because the final test was not wc29h/). Due to a mistake, one transfer question referred to an infor-
announced. mation that was presented on a slide. This question, which was part
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EBERSBACH ET AL. 729

of Item set 2, was therefore excluded from the analyses. To keep the note cards), (c) elaboration (e.g., visualization, working examples, and
final test sessions short, about half of the students received five fac- consulting further resources), (d) self-testing and explaining to others,
tual questions and five transfer questions (i.e., Item set 1), and the (e) generating questions, (f) miscellaneous (interrater reliability:
other half received the remaining five factual and four transfer ques- Cohen's kappa = .91). Active summarizing strategies were reported
tions (i.e., Item set 2). The answers in the final test were scored with most frequently (130),3 followed by restudying strategies (116).
1 to 4 points per question. These scores were summarized across Clearly, strategies that included testing (50), elaboration (19), and
questions and transformed into proportion correct separately for fac- other strategies (15) were reported less frequently. Generating ques-
tual and transfer knowledge as well as for the total score. The final tions was reported only once.
test performance of about 50% of the students was rated by a second We then checked in advance whether the final test performance
rater, yielding satisfying interrater reliabilities ranging between .94 varied as a function of the study course (i.e., psychology, education,
and .98. and other courses). No differences were found between the students
from different study courses, p = .27. Therefore, the data were col-
lapsed across these groups.
2.4 | Procedure Finally, we checked whether the assignment of students to the
parallel item sets was balanced in each condition. A simple cross
Students attended a regular university lecture on the development of tabulation revealed that somewhat more than half of the students
domain-specific knowledge as part of their courses. About 20 min (63%) received Item set 2 in the restudy condition, and somewhat
before the end of this lecture, students were informed that an extra more than half of the students received Item set 1 in the generating
learning phase would follow that would help them memorize the con- questions condition (59%) and the testing condition (57%). To con-
tent of the lecture. In addition, students were informed that there trol for this not perfect distribution of item sets across the condi-
would be different conditions but that all conditions were expected to tions, the item set variable was included in all of the following
have positive effects. Thereafter, students were randomly placed in models.
three separate groups to avoid interferences between the conditions,
and the materials (i.e., the booklets with the instructions, tasks, and
slides) were distributed. After the students finished all the tasks, the 3.2 | Testing the hypotheses
booklets were collected. The slides were accessible at all times in all
conditions of the learning phase. One week later, at the beginning of A linear regression model was computed in R (R Core Team, 2017;
the next lecture, the final online test was administered based on infor- RStudio Team, 2016) to test our first two hypotheses that (a) students
mation given in the lecture. The final surprise test that took about in both experimental conditions would outperform students in the
20 min was not scheduled in advance to prevent students from pre- restudy condition and (b) students in the generating questions condi-
paring for the test. In the final test, students were instructed to tion would perform better in the final test after one week than stu-
respond to the questions without additional help and without commu- dents in the testing condition. Packages used for data preparation and
nicating. To ensure that cheating did not occur, three to four experi- analyses were dplyr (Wickham, Francois, Henry, & Müller, 2017) and
menter assistants supervised the students during the test. All emmeans (Lenth, 2018)2.
participants were informed about the results of the three learning In the linear regression model, learning condition (three levels: res-
conditions after the study was finished but before the exam took tudy, generating questions, and testing; restudy as reference group)
place. This procedure was implemented to counteract any disadvan- and the control variable item set (1 or 2) were included as predictors.
tages due to the imposed strategy, especially for students in the res- The criterion variable was the overall final test performance, measured
tudy condition, which is known to be less effective compared with as proportion correct across factual and transfer knowledge items.
testing (e.g., Roediger & Karpicke, 2006). Thus, all students had the The results are shown in Table 1, including the unstandardized regres-
chance to use the most effective strategy to boost their performance sion coefficients that can be interpreted as percentage points by
for the exam. which one group differed from the reference group (restudy). Given
that the dependent variable (i.e., proportion correct) could range
between 0 and 1, a value of .20, for example, would indicate that the
3 | RESULTS students in that condition scored 20 percentage points higher in the
final test than the reference group. The analyses revealed significant
3.1 | Preliminary analyses positive effects for the generating questions and testing compared
with the restudy condition (see Table 1: Model 1 and Figure 1 for
First, we analyzed the learning strategies that the students had descriptive statistics). Students in both experimental conditions
reported before the learning session, which they typically use when (i.e., generating questions and testing) scored on average 11 percent-
2
preparing for exams. More than one strategy could be mentioned. age points higher on the final test compared with students in the res-
Individual learning strategies were categorized into (a) restudying the tudy condition. No significant difference was found between
material, (b) active summarizing (e.g., writing notes, summaries, and generating questions and testing, p = .93.
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
730 EBERSBACH ET AL.

TABLE 1 Linear regression models predicting test performance one week after learning

Dependent variable
Overall test performance (Model 1) Factual knowledge (Model 2) Transfer knowledge (Model 3)
Intercept 0.45*** (0.04) 0.58*** (0.05) 0.37*** (0.05)
Question generation (ref.: restudying) 0.11* (0.05) 0.11 (0.06) 0.11 (0.06)
Testing (ref.: restudying) 0.11* (0.05) 0.13* (0.06) 0.07 (0.05)
Item set 2(ref.: item set 1) −0.15*** (0.04) −0.30*** (0.05) 0.12* (0.05)
2
R (adj.) .21*** .38*** .06

Note: Models include unstandardized regression coefficients; standard errors in parentheses. Ref. indicates the reference category against which the target
category was tested; item set serves as control variable. N = 82.
***p < .001.; *p < .05.

F I G U R E 1 Final test performance (proportion correct) one week


after the learning session, separately for each learning condition
F I G U R E 2 Final test performance (proportion correct) 1 week
after the learning session, separately for factual and transfer
knowledge and for each learning condition. BF10 indicates how much
more likely, based on the presented data, that the respective
To test our third hypothesis that the effects of question genera- experimental learning condition compared with the restudy condition
tion and testing would emerge for factual and transfer knowledge, has a positive effect than a negative effect
two additional models were computed, one for each knowledge type
(i.e., factual and transfer) with the same independent variables as in
the first model (see Table 1 and Figure 2 for descriptive statistics). In
the second model with factual knowledge as dependent variable, a sig- approach is more advantageous for small sample sizes and allows to
nificant positive effect of testing was found compared with restudying test null effects. In contrast to classical inferential statistics, it provides
(13 percentage points difference), but no significant effect of generat- relative evidence for the null or alternative hypothesis in the form of a
ing questions was found compared with restudying, p = .08. The dif- Bayes factor instead of a binary decision. We report the 95% credible
ference between generating questions and testing was also not interval for each reported effect, which indicates the range of values
significant, p = .76. In the third model with transfer knowledge as that are most likely for the respective effect. Additionally, based on
dependent variable, the effects were not significant for generating the distribution of the possible parameter values, the Bayes factor
questions (p = .09) and testing (p = .24) compared with restudying. BF10 can be used to express the likelihood ratio that the alternative
Furthermore, generating questions and testing did not differ, p = .53. hypothesis is correct and the likelihood that the null hypothesis is cor-
In order to alternatively check the insignificant results, we rea- rect, given the data. For example, a Bayes factor of BF10 = 10 would
nalyzed the respective regression models by means of Bayesian ana- indicate that the alternative hypothesis is 10 times more likely than
lyses, using the R package brms (Bürkner, 2017). The Bayesian the null hypothesis (the complement Bayes factor BF01 is used to
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EBERSBACH ET AL. 731

express the likelihood ratio that the null hypothesis is correct and the the question required a deeper conceptual analysis or the integration
likelihood that the alternative hypothesis is correct). Improper flat with other knowledge domains, it was scored as 3. Interrater reliability
priors over the reals were used for the analyses, which means that the was r(290) = .89. Mean question depth did not vary much between
prior distributions had little influence on the results and were instead participants (min: 1, max: 1.5, M = 1.22, SD = 0.142). Moreover, no sig-
mainly driven by the data (Bürkner, 2017; Kruschke, 2013). nificant correlation was found between question depth and the final
The Bayesian regression analysis with the same variable structure test performance in the generating questions condition, neither when
as the models described above for the overall final test performance the overall final test performance was considered nor when factual
confirmed that no evidence exists for a difference between generat- and transfer knowledge were considered separately, ps > .49.
ing questions and testing (95% credible interval for the effect of test-
ing compared with generating questions from −0.11 to 0.10,
BF10 = 1). In contrast to the nonsignificant effect of generating ques- 4 | DI SCU SSION
tions on factual knowledge compared with restudying, the Bayesian
model provided strong evidence for a positive effect of generating 4.1 | Summary of the findings
questions compared with restudying (95% credible interval from
−0.01 to 0.23, BF10 = 25). That is, although the effect was not signifi- The aim of this field study was to examine and compare the effects of
cant in the more traditional frequentist approach, the Bayesian analy- generating questions, testing, and restudying on final test perfor-
sis suggests that generating questions compared with restudying is mance that addressed the content of a university lecture. In contrast
more likely to have had a positive effect on performance than a nega- to previous studies, we specifically investigated long-term effects on
tive or no effect. In addition, the nonsignificant difference between factual and transfer knowledge, and students were not trained in
testing and generating questions on factual knowledge was confirmed advance on how to generate questions effectively. In addition, we
by the Bayesian model (95% credible interval for the effect of testing made the conditions maximally comparable with rule out confounding
compared with generating questions from −0.10 to 0.14, BF10 = 0.6). effects (e.g., by differences concerning the contents covered in the
Finally, the more traditional analysis revealed that both effects of gen- different learning conditions).
erating questions and testing on transfer knowledge compared with Students who generated questions and answers performed simi-
restudying were not significant. However, the Bayesian model indi- larly well as students who answered experimenter-generated ques-
cated strong evidence for a positive effect of generating questions of tions (i.e., testing) on their overall performance in the final test after
about 11 percentage points compared with restudying (95% credible one week. An important finding is that students in both conditions
interval from −0.02 to 0.24, BF10 = 21) and moderate evidence for a performed significantly better than students who had only restudied
positive effect of testing of about 7 percentage points compared with the material. The positive effects of question generation and testing
restudying (95% credible interval from −0.05 to 0.19, BF10 = 8). The compared with restudying were also confirmed when factual and
model also indicated no difference between the effects of generating transfer knowledge were analyzed separately. Although a traditional
questions and testing (95% credible interval for the effect of testing frequentist approach revealed no significant effects of question gen-
compared with generating questions from −0.17 to 0.08, BF10 = 3) eration on factual knowledge and of testing and question generation
(see Figure 2). for transfer knowledge, additional Bayesian analyses suggested strong
In sum, the more traditional analyses showed that generating evidence that question generation yielded positive effects on factual
questions and testing—compared with restudying—improve the over- and transfer knowledge compared with restudying and moderate evi-
all final test performance in the long run and that a similar effect also dence that testing yielded a positive effect on transfer knowledge.
emerges at least for testing when factual knowledge is analyzed sepa- The depth of the generated questions was not related to students'
rately, whereas no such effects emerged for transfer knowledge. final test performance.
However, Bayesian analyses indicated positive effects of generating Our results show that generating questions in an open-book for-
questions and testing compared with restudying on both factual and mat is—like testing—a powerful learning strategy in real learning con-
transfer knowledge, although the effects tend to be smaller than the texts that may help students enhance and consolidate their
effects on the overall final test performance (which might explain why knowledge over longer periods of time compared with restudying.
they did not reach statistical significance in the more traditional This finding is important because one central aim of education is to
approach). Furthermore, question generation did not outperform promote the long-term retention of knowledge so that it can be
testing. applied in different contexts. In addition, long-term retention supports
We additionally analyzed whether the depth of the generated the acquisition of new knowledge by facilitating its assimilation with
questions affected the effect of generating questions. The depth of prior knowledge.
the questions was evaluated by two raters4 according to the scoring How can the effect of question generation be explained? In gen-
scheme adapted from Berry and Chew (2008). When factual knowl- eral, it is assumed that generating questions stimulates a deeper elab-
edge or the definition of a concept was addressed in a question, oration of the learning material and a deeper processing (King, 1992;
question depth was scored as 1; when the question addressed Song, 2016). Furthermore, rephrasing might be a plausible mechanism
application-related transfer knowledge, it was scored as 2; and when explaining the effect (Doctorow, Wittrock, & Marks, 1978; Wittrock,
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
732 EBERSBACH ET AL.

1974). The extant literature provides ample evidence showing that van Gog & Sweller, 2015). We showed that an open-book condition is
rephrasing or paraphrasing is an effective tool for enhancing the not only effective in combination with testing but also in combination
processing, comprehension, and recall of the paraphrased information with the generation of questions. Moreover, an open-book condition
(e.g., Bui, Myerson, & Hale, 2013; Hagaman, Casey, & Reid, 2012; corresponds to the typical approach of learners when they test them-
Rosenshine & Meister, 1994; Wammes, Meade, & Fernandes, 2017). selves in a self-regulated learning environment (e.g., Kornell & Son,
Rephrasing establishes representational variability of the learning con- 2009; Wissman, Rawson, & Pyc, 2012).
tent and therewith generates multiple memory traces to retrieve this The finding that testing did not outperform the generation of
content. This assumption is related to the encoding variability hypoth- questions contradicts the findings of Bae et al. (2019) and Foos et al.
esis, stating that retrieval of information is facilitated when it is (1994). However, in these studies, the generation of questions condi-
encoded in multiple ways or by different encoding strategies (Estes, tion was not fully comparable with the testing condition with regard
1950; Glenberg, 1979). Moreover, rephrasing can also be conceived to the number of questions and the content. As a result, students in
as a generative activity because new wording is created. Previous previous studies could have generated less questions, or multiple
research on the generation effect as another desirable difficulty in questions based on the same aspect of the learning material, or ques-
learning (Bjork, 1994) has shown that information not only enhances tions that only addressed easily comprehensible aspects, thereby fail-
memory when larger parts or whole words from the learning material ing to exhibit similar effects as in the testing condition that covered a
are generated by the learner (for a meta-analysis see Bertsch, Pesta, broader range of learning content. We overcame this problem by pre-
Wiscott, & McDaniel, 2007) but also when only single letters of the scribing the number of questions to be generated and the content
words are generated or switched by the learner (Donaldson & Bass, that should be addressed in the generating questions to be able to
1980; Nairne & Widner, 1987). Thus, even a slight generation activity compare it with the testing condition. The instructions, for example,
can be effective. Arguably, the generation of new words is not neces- to form questions based on the bolded words, were followed by the
sary to stimulate a generation effect. The effect can be invoked, for students. One possible critique of our method could be that students
example, by setting words in a different order in which only details in the testing condition had the advantage by receiving the same test
have to be changed. However, further research is needed to clarify items on the final test as they had in the learning phase. Thus, they
how rephrasing and generation contribute to the positive effect of could have been more familiar with the final test questions than stu-
generating questions. Our results also suggest that retrieval practice dents in the generating questions condition, which in turn could have
might not be the essential factor constituting the effect of question leveled out a potential advantage of question generation. We tried to
generation because retrieval practice was the weakest learning strat- rule out this effect by prescribing the terms that should be included
egy in the present study given the open-book format (cf. Agarwal when generating questions (i.e., bolded words on the slides). These
et al., 2008). terms were also included in the questions of the testing condition.
We also demonstrated that generating questions has a significant Thus, both conditions were comparable in terms of the core content
impact on knowledge acquisition even when learners were not of the questions in the learning phase.
prompted or trained in advance on how to generate questions effec- Nonetheless, our finding that testing and question generation with-
tively as in previous studies (e.g., King, 1992) and when questions and out prior training yielded similar effects in a real learning context is prom-
answers were not evaluated by the instructor or others afterwards ising. Our results suggest that the two learning strategies, which are
(Song, 2016). Thus, the application of this strategy in educational both clearly more effective than simple restudying, can be recommended
practice requires little effort and boosts learning. For example, by (university) teachers to learners and can also be recommended for
teachers could instruct students to generate exam questions during (self-regulated) learning. Furthermore, the effects of question generation
the lecture from the perspective of the teacher. To provide an incen- and testing emerged for factual and transfer knowledge, confirmed by
tive for students, the lecturer could tell the students that selected the Bayesian analyses. The finding for testing is in line with results of the
questions would be included in the exam. This practice has been infor- meta-analysis of Pan and Rickard (2018) who reported transfer effects
mally reported by several lecturers who taught university courses. of testing, which were particularly strong for application and interference
Generating questions yielded similar effects as testing in our questions that fall in the same category as our transfer questions. How-
study, and both conditions outperformed simple restudying. These ever, we also showed that question generation may positively affect fac-
effects also became evident for the two different knowledge types tual and transfer knowledge, despite the small effects when the two
(i.e., factual and transfer knowledge) when Bayesian analyses were knowledge types were analyzed separately.
applied. Given that the open-book format clearly limited retrieval Apart from its positive effects, generating questions—like
practice in the question generation and testing conditions, other testing—is also an effortful strategy for learners. The learning process
mechanisms could have contributed to the positive effects, as activated by generating questions and testing is more difficult than
discussed earlier. One advantage of open-book formats is that restudying, and their lack of use can be inferred from our findings. We
learners can consolidate correct knowledge by looking up the material also observed unsystematically during the experiment that students in
(e.g., Agarwal et al., 2008) in contrast to a closed-book condition in the generating questions condition and in the testing condition took
which they might not recall the information if it is too complex, or slightly longer than students in the restudy condition (cf. Weinstein
they might recall the wrong information (cf. Roelle & Berthold, 2017; et al., 2010). The extra time makes sense because a more intensive
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EBERSBACH ET AL. 733

examination and a deeper elaboration of the learning content, induced degree for the fact that testing did not outperform question genera-
by the generation of questions and testing, takes more time than tion in the present study. To tease out potentially smaller differences
when only trying to memorize content (Endres, Carpenter, Martin, & between testing and question generation, we recommend replicating
Renkl, 2017). However, our study did not explicitly control for time our method with a larger sample size.
spent on the material despite the fact that the learning period was A second limitation refers to the design of the material in which
fixed in all conditions. In a self-paced environment, one could test relevant aspects were printed in bold. In self-regulated learning, stu-
whether time on task could have contributed to the beneficial effects dents often must identify the relevant aspects in the learning material
of testing and generating questions (see also Hoogerheide, Staal, before memorizing them. Thus, examining the effects of testing and
Schaap, & van Gog, 2019). This issue should be addressed in future question generation in future studies with material that does not indi-
research to separate qualitative effects of the learning condition from cate the relevant aspects in advance would be informative.
simple quantitative effects of study time. In addition, it might be Third, the topic used in our study was rather specific (i.e., the
instructive to assess cognitive load to further explain the revealed development of domain-specific knowledge). Future studies could
effects. investigate whether the results can be replicated with other topics,
Finally, the depth of the generating questions was not related to stu- for example, with statistics, which is more abstract and requires pro-
dents' final test performance in this condition, which is in line with the cedural and descriptive skills. In addition, the retention phase can be
findings of Berry and Chew (2008). Nevertheless, other studies found extended. The testing effect is known to become stronger over longer
that students performed better in recall tests when they were trained to intervals between learning and testing (Roediger & Karpicke, 2006).
generate cognitively challenging questions (e.g., Bugg & McDaniel, 2012; Investigating how the effect of question generation evolves across
Levin & Arnold, 2008). The null finding of question depth in our study longer intervals would also be informative.
might be due to the fact that (a) the generating questions mostly A fourth limitation is common to many studies that sample uni-
addressed superficial facts rather than requiring conceptual analyses or versity students. The psychology students and teacher trainees in the
inferences (i.e., low question depth in general) and that (b) question present study had undergone a rigid selection process to get a place
depth varied little in the student sample. The finding of a strong effect of at the university, which was primarily based on their final high school
generating questions is astonishing, given that students were not certificate grade (i.e., the German Abitur as the main criterion used for
instructed in advance on how to generate questions and given that the the numerus clausus policy). In contrast to university students, poor
questions were not very elaborate. Thus, the generation of questions learners could conceivably have difficulties when generating ques-
might be a rather effective strategy independent of prior training. tions. Thus, the generalization of the results using other groups of stu-
dents or learners (e.g., pupils or high school students) in real learning
contexts should be explored in further studies. In addition, the fluctu-
4.2 | Limitations, future directions, and ation and dropout rate was fairly high in our study. This fluctuation is
implications normal for the university where this study took place as students are
not obliged to attend the courses. Hence, our method mirrored char-
Several aspects of our study might limit the generalizability of the acteristics of a real learning situation. A constant sample across mea-
results and suggest further research. The first limitation is that testing surements, however, could provide more valid results. Nevertheless,
and generating questions were based on an open-book format, that is, we believe that our study makes an important contribution to the lit-
students were allowed to search for information in the material to erature on the efficacy of generating questions and testing in applied
generate questions and answers. Comparing an open- and a closed- settings, such as in university lectures, and some of the shortcomings
book condition on the effects of generating questions and testing in comparison with lab studies can also be conceived as strengths in
would be informative. Studies have shown that the testing effect is terms of the applicability in real-life learning contexts.
stronger in an open-book condition, especially for complex learning Related to the discussion about the sample is the question of
material (Agarwal et al., 2008; Roelle & Berthold, 2017). Open-book prior knowledge. Despite the unlikelihood that students had previ-
testing is nevertheless not a common technique in formal instruction, ously been introduced into the development of domain-specific
even if it reduces test anxiety when applied in exams (e.g., Gharib knowledge (see also Sections 2.1 and 2.3), prior knowledge on the
et al., 2012). However, the self-paced study time of students might topic cannot be fully ruled out. However, if some students had prior
decrease because the students are expecting an open-book exam knowledge of the topic, these students would have been distributed
(Agarwal & Roediger, 2011). equally across the learning conditions because of random sampling
Interestingly, students in the testing condition of the present and thus should not have biased the main effect of the condition. In
study generated less correct answers during the learning phase (83%) future studies, working with a larger sample might be fruitful, even if
than the students in the question generation condition (99%). Thus, an a priori sample size calculation yields a similar sample size as we
the activity of generating appropriate questions and answers in an had. The effects of the learning conditions in the present study were
open-book condition might stimulate students to a stronger degree fairly small when the knowledge types were considered separately. A
than the activity of testing to look up the material. This difference in larger sample size might yield stronger effects with more traditional
practice performance might have also accounted at least to some statistical analyses.
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
734 EBERSBACH ET AL.

Finally, the moderate performance of students in the final surprise Agarwal, P. K., Karpicke, J. D., Kang, S. H. K., Roediger, H. L., &
test warrants a closer look (see Figure 1). This performance might be McDermott, K. B. (2008). Examining the testing effect with open- and
closed-book tests. Applied Cognitive Psychology, 22, 861–876. https://
due to the fact that this test was unannounced, and the final exam
doi.org/10.1002/acp.1391
took place about one month after the end of the study. Thus, most Agarwal, P. K., & Roediger, H. L. (2011). Expectancy of an open-
students might not have started to study the lecture content. In addi- book test decreases performance on a delayed closed-book test.
tion, the performance level shows that the learning content was Memory, 19, 836–852. https://doi.org/10.1080/09658211.2011.
613840
rather complex and was not easy to learn from a single lecture, even
Bae, C. L., Therriault, D. J., & Redifer, J. L. (2019). Investigating the testing
when additional learning opportunities are given. effect: Retrieval as a characteristic of effective study strategies. Learn-
In sum, although further research is necessary to replicate and ing and Instruction, 60, 206–214. https://doi.org/10.1016/j.
extend our findings, generating questions is a learning strategy as learninstruc.2017.12.008
Batsell, W. R., Perry, J. L., Hanley, E., & Hostetter, A. B. (2017). Ecological
powerful as testing in real learning contexts, which requires no exten-
validity of the testing effect: The use of daily quizzes in introductory
sive training or prompts, and might strengthen students' long-term psychology. Teaching of Psychology, 44, 18–23. https://doi.org/10.
recall of factual and transfer knowledge. Given its task characteristics 1177/0098628316677492
and requirements and the fact that students hardly use this strategy Berry, J. W., & Chew, S. L. (2008). Improving learning through interven-
tions of student-generated questions and concept maps. Teaching of
spontaneously when learning, the generation of questions might be
Psychology, 35, 305–312. https://doi.org/10.1080/
conceived as a further learning strategy related to desirable difficulties
00986280802373841
in learning (Bjork, 1994). Bertsch, S., Pesta, B. J., Wiscott, R., & McDaniel, M. A. (2007). The genera-
tion effect: A meta-analytic review. Memory & Cognition, 35, 201–210.
ACKNOWLEDGMENTS https://doi.org/10.3758/BF03193441
Bjork, R. A. (1994). Memory and meta-memory considerations in the
We thank Karina Senftner for her support in the data collection and
training of human beings. In J. Metcalfe & A. Shimamura (Eds.), Meta-
scoring and Mike Cofrin for proofreading the manuscript. cognition: Knowing about knowing (pp. 185–205). Cambridge: MIT-
Press.
CONF LICT OF IN TE RE ST Bugg, J. M., & McDaniel, M. A. (2012). Selective benefits of question self-
generation and answering for remembering expository text. Journal of
Hereby we declare that there is no conflict of interest.
Educational Psychology, 104, 922–931. https://doi.org/10.1037/
a0028661
ENDNOTES Bui, D. C., Myerson, J., & Hale, S. (2013). Note-taking with computers:
1
The identical final test was repeated four weeks after the learning Exploring alternative strategies for improved recall. Journal of Educa-
phase. However, as testing time was manipulated within-subjects, a gen- tional Psychology, 105, 299–309.
eral testing effect in all conditions cannot be ruled out. Therefore, the Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models
results of this second test are not reported here but can be inspected as using Stan. Journal of Statistical Software, 80, 1–28. https://doi.org/10.
Supporting information: https://osf.io/a3w9y/. 18637/jss.v080.i01
2
Butler, A. C. (2010). Repeated testing produces superior transfer of learn-
The data that support the findings of this study are available from the
ing relative to repeated studying. Journal of Experimental Psychology.
corresponding author upon reasonable request.
Learning, Memory, and Cognition, 36, 1118–1133. https://doi.org/10.
3
These frequencies could exceed total sample size when individuals men- 1037/a0019902
tioned two or more strategies that were assigned to the same superordi- Butler, A. C., & Roediger, H. L. (2008). Feedback enhances the positive
nate strategy. effects and reduces the negative effects of multiple-choice testing.
4
One rater was one of the authors, and the second rater was a student Memory & Cognition, 36, 604–616. https://doi.org/10.3758/MC.36.
assistant who coded the questions according to a predefined scoring 3.604
scheme. Callender, A. A., & McDaniel, M. A. (2009). The limited benefits of
rereading educational texts. Contemporary Educational Psychology, 34,
30–41. https://doi.org/10.1016/j.cedpsych.2008.07.001
DATA AVAI LAB ILITY S TATEMENT Carpenter, S. K. (2012). Testing enhances the transfer of learning. Current
Directions in Psychological Science, 21, 279–283. https://doi.org/10.
Supplementary material, including the original data, in OSF: https://
1177/0963721412452728
osf.io/a3w9y/ Denner, P. R., & Rickards, J. P. (1987). A developmental comparison of the
effects of provided and generated questions on text recall. Contempo-
ORCID rary Educational Psychology, 12, 135–146. https://doi.org/10.1016/
S0361-476X(87)80047-4
Mirjam Ebersbach https://orcid.org/0000-0003-3853-4924
Doctorow, M., Wittrock, M. C., & Marks, C. (1978). Generative
Katharina Barzagar B. Nazari https://orcid.org/0000-0003-4909- processes in reading comprehension. Journal of Educational Psy-
272X chology, 70, 109–118. https://doi.org/10.1037/0022-0663.70.
2.109
Donaldson, W., & Bass, M. (1980). Relational information and memory for
RE FE R ENC E S
problem solutions. Journal of Verbal Learning & Verbal Behavior, 19,
Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking 26–35.
the use of tests: A meta-analysis of practice testing. Review of Educa- Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T.
tional Research, 87, 659–701. https://doi.org/10.3102/ (2013). Improving students' learning with effective learning tech-
0034654316689306 niques: Promising directions from cognitive and educational
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EBERSBACH ET AL. 735

psychology. Psychological Science in the Public Interest, 14, 4–58. McDaniel, M. A., Thomas, R. C., Agarwal, P. K., McDermott, K. B., &
https://doi.org/10.1177/1529100612453266 Roediger, H. L. (2013). Quizzing in middle-school science: Successful
Endres, T., Carpenter, S., Martin, A., & Renkl, A. (2017). Enhancing learning transfer performance on classroom exams. Applied Cognitive Psychol-
by retrieval: Enriching free recall with elaborative prompting. Learning ogy, 27, 360–372. https://doi.org/10.1002/acp.2914
and Instruction, 49, 13–20. https://doi.org/10.1016/j.learninstruc. Nairne, J. S., & Widner, R. L., Jr. (1987). Generation effects with nonwords:
2016.11.010 The role of test appropriateness. Journal of Experimental Psychology:
Estes, W. K. (1950). Toward a statistical theory of learning. Psychological Learning, Memory, & Cognition, 13, 164–171.
Review, 57, 94–107. https://doi.org/10.1037/h0058559 Nevid, J. S., Pyun, Y. S., & Cheney, B. (2016). Retention of text
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power material under cued and uncued recall and open and closed book
analyses using G*power 3.1: Tests for correlation and regression ana- conditions. International Journal for the Scholarship of Teaching
lyses. Behavior Research Methods, 41, 1149–1160. and Learning, 10, Article 10. https://doi.org/10.20429/ijsotl.2016.
Foos, P. W., Mora, J. J., & Tkacz, S. (1994). Student study techniques and 100210
the generation effect. Journal of Educational Psychology, 86, 567–576. Palinscar, A. S., & Brown, A. L. (2009). Reciprocal teaching of
https://doi.org/10.1037//0022-0663.86.4.567 comprehension-fostering and comprehension-monitoring activities.
Gharib, A., Phillips, W., & Mathew, N. (2012). Cheat sheet or open-book? Cognition and Instruction, 1, 117–175. https://doi.org/10.1207/
A comparison of the effects of exam types on performance, retention, s1532690xci0102_1
and anxiety. Psychology Research, 2, 469–478. https://doi.org/10. Pan, S. C., & Rickard, T. C. (2018). Transfer of test-enhanced learning:
17265/2159-5542/2012.08.004 Meta-analytic review and synthesis. Psychological Bulletin, 144,
Glenberg, A. M. (1979). Component-levels theory of the effects of spacing 710–756. https://doi.org/10.1037/bul0000151
of repetitions on recall and recognition. Memory & Cognition, 7, Pastötter, B., & Bäuml, K.-H. (2014). Retrieval practice enhances new
95–112. learning. The forward effect of testing. Frontiers in Psychology, 5, 286.
Hagaman, J. L., Casey, K. J., & Reid, R. (2012). The effects of the https://doi.org/10.3389/fpsyg.2014.00286
paraphrasing strategy on the reading comprehension of young stu- R Core Team. (2017). R: A language and environment for statistical comput-
dents. Remedial and Special Education, 33, 110–123. https://doi.org/ ing. Vienna, Austria: R Foundation for Statistical Computing. https://
10.1177/0741932510364548 www.R-project.org
Hartman, H. J. (1994). From reciprocal teaching to reciprocal education. Roediger III, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: Taking
Journal of Developmental Education, 18, 2–8. memory tests improves long-term retention. Psychological Science, 17,
Hoogerheide, V., Staal, J., Schaap, L., & van Gog, T. (2019). Effects of study 249–255. doi: 0.1111/j.1467–9280.2006.01693.x
intention and generating multiple choice questions on expository text Roelle, J., & Berthold, K. (2017). Effects of incorporating retrieval into
retention. Learning and Instruction, 60, 191–198. https://doi.org/10. learning tasks: The complexity of the tasks matters. Learning and
1016/j.learninstruc.2017.12.006 Instruction, 49, 142–156. https://doi.org/10.1016/j.learninstruc.2017.
Karpicke, J. D. (2017). Retrieval-based learning. A decade of progress. In 01.008
J. H. Byrne (Hg.): Learning and memory. A comprehensive reference (2nd Rosenshine, B., & Meister, C. (1994). Cognitive strategy instruction in
edition. 487–514). Oxford, UK: Academic Press. reading. In D. A. Hayes & S. A. Stahl (Eds.), Instructional models in read-
Karpicke, J. D., Butler, A. C., & Roediger, H. L. (2009). Metacognitive strat- ing (pp. 85–107). Hillsdale, NJ: Erlbaum.
egies in student learning: Do students practise retrieval when they Rosenshine, B., Meister, C., & Chapman, S. (1996). Teaching students to
study on their own? Memory, 17, 471–479. https://doi.org/10.1080/ generate questions: A review of the intervention studies. Review of
09658210802647009 Educational Research, 66, 181–221. https://doi.org/10.3102/
King, A. (1992). Comparison of self-questioning, summarizing, and 00346543066002181
notetaking-review as strategies for learning from lectures. American Rowland, C. A. (2014). The effect of testing versus restudy on retention: A
Educational Research Journal, 29, 303–323. https://doi.org/10.3102/ meta-analytic review of the testing effect. Psychological Bulletin, 140,
00028312029002303 1432–1463. https://doi.org/10.1037/a0037559
King, A. (1994). Guiding knowledge construction in the classroom: Effects Rummer, R., Schweppe, J., & Schwede, A. (2019). Open-book versus
of teaching children how to question and how to explain. American closed-book tests in university classes: A field experiment. Fron-
Educational Research Journal, 31, 338–368. https://doi.org/10.3102/ tiers in Psychology, 10, 463. https://doi.org/10.3389/fpsyg.2019.
00028312031002338 00463
Kirschner, P. A., Sweller, J., & Clark, R. E. (2006). Why minimal guidance Song, D. (2016). Student-generated questioning and quality questions: A
during instruction does not work: An analysis of the failure of con- literature review. Research Journal of Educational Studies and Review, 2,
structivist, discovery, problem-based, experiential, and inquiry-based 58–70.
teaching. Educational Psychologist, 41, 75–86. Team, R. S. (2016). RStudio: Integrated development for R. In RStudio. Bos-
Kornell, N., & Son, L. K. (2009). Learners' choices and beliefs about self- ton, MA: Inc.. http://www.rstudio.com/
testing. Memory, 17, 493–501. https://doi.org/10.1080/ Thomas, R. C., Weywadt, C. R., Anderson, J. L., B. Martinez-Papponi, B., &
09658210902832915 McDaniel, M. A. (2018). Testing encourages transfer between factual
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal and application questions in an online learning environment. Journal of
of Experimental Psychology: General, 142, 73–603. https://doi.org/10. Applied Research in Memory and Cognition, 7, 252–260. doi: https://
1037/a0029146 doi.org/10.1016/j.jarmac.2018.03.007.
Lenth, R. V. (2018). emmeans: Estimated Marginal Means, aka Least- Tran, R., Rohrer, D., & Pashler, H. (2015). Retrieval practice: The lack of
Squares Means. R package version 1.3.0. https://CRAN.R-project.org/ transfer to deductive inferences. Psychonomic Bulletin & Review, 22,
package=emmeans 135–140. https://doi.org/10.3758/s13423-014-0646-x
Levin, A., & Arnold, K.-H. (2008). Fragen stellen, um Antworten zu erhalten van Gog, T., & Sweller, J. (2015). Not new, but nearly forgotten: The test-
– oder Fragen generieren, um zu lernen? Zeitschrift für Pädagogische ing effect decreases or even disappears as the complexity of learning
Psychologie, 22, 135–142. https://doi.org/10.1024/1010-0652.22. materials increases. Educational Psychology Review, 27, 247–264.
2.135 https://doi.org/10.1007/s10648-015-9310-x
10990720, 2020, 3, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/acp.3639 by INASP/HINARI - INDONESIA, Wiley Online Library on [15/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
736 EBERSBACH ET AL.

Wammes, J. D., Meade, M. E., & Fernandes, M. A. (2017). Learning terms


and definitions: Drawing and the role of elaborative encoding. Acta How to cite this article: Ebersbach M, Feierabend M,
Psychologica, 179, 104–113. https://doi.org/10.1016/j.actpsy.2017.
Nazari KBB. Comparing the effects of generating questions,
07.008
Weinstein, Y., McDermott, K. B., & Roediger, H. L. (2010). A comparison of testing, and restudying on students' long-term recall in
study strategies for passages: Rereading, answering questions, and university learning. Appl Cognit Psychol. 2020;34:724–736.
generating questions. Journal of Experimental Psychology: Applied, 16, https://doi.org/10.1002/acp.3639
308–316. https://doi.org/10.1037/a0020992
Wickham, H., Francois, R., Henry, L., & Müller, K. (2017). dplyr: A grammar
of data manipulation. R package version 0.7.4. https://CRAN.R-
project.org/package=dplyr
Wissman, K. T., Rawson, K. A., & Pyc, M. A. (2012). How and when do stu-
dents use flashcards? Memory, 20, 568–579. https://doi.org/10.1080/
APPENDIX A.: | Example slide and corresponding factual
09658211.
2012.687052 and transfer question (for the complete material, see OSF)
Wittrock, M. C. (1974). Learning as a generative process. Educational
Psychologist, 11, 87–95. https://doi.org/10.1080/ Factual question:
00461527409529129
“Wilkening demonstrated that kindergartners already consider
Wouters, P., van Nimwegen, C., van Oostendorp, H., & van der Spek, E.
the speed and duration of movement when estimating the distance
(2013). A meta-analysis of the cognitive and motivational effects of
serious games. Journal of Educational Psychology, 105, 249–265. covered by animals. What is the constraint, assumed by Piaget to be
https://doi.org/10.1037/a0031311 prevalent among kindergartners, that was shown not to be exhibited
by kindergartners?”
Transfer question:
SUPPORTING INFORMATION If you generalize the findings of Wilkening on intuitive physics to
Additional supporting information may be found online in the the estimation of the volume of cylinders, which dimension(s) would
Supporting Information section at the end of this article. preschoolers consider in their estimations?

You might also like