Benefits From Retrieval Practice Are Greater For
Benefits From Retrieval Practice Are Greater For
Pooja K. Agarwal, Jason R. Finley, Nathan S. Rose & Henry L. Roediger III
To cite this article: Pooja K. Agarwal, Jason R. Finley, Nathan S. Rose & Henry L. Roediger III
(2016): Benefits from retrieval practice are greater for students with lower working memory
capacity, Memory, DOI: 10.1080/09658211.2016.1220579
Benefits from retrieval practice are greater for students with lower working
memory capacity
Pooja K. Agarwala, Jason R. Finleyb, Nathan S. Rosec and Henry L. RoedigerIIIa
a
Department of Psychology, Washington University in St. Louis, St. Louis, MO, USA; bDepartment of Psychology, Fontbonne University,
Clayton, MO, USA; cDepartment of Psychology, University of Notre Dame, Notre Dame, IN, USA
Testing is a powerful technique to enhance learning, intervals, but longer lags produce superior performance
because the act of retrieving information from memory at long retention intervals (Karpicke & Roediger, 2007;
promotes the ability to recall material again in the future Whitten & Bjork, 1977). Third, benefits from retrieval prac-
(Carpenter & DeLosh, 2005; Carrier & Pashler, 1992; see tice substantially increase when feedback is provided, com-
Roediger & Karpicke, 2006a, for a review). The use of retrie- pared to retrieval without feedback; however, the timing of
val practice as a learning strategy, by teachers and stu- feedback following retrieval (immediate vs. delayed) and
dents, has been shown to increase students’ long-term the length of the retention interval (e.g., one day vs. one
retention and transfer of knowledge to new situations week) influence its potency (Butler, Karpicke, & Roediger,
(Agarwal, Bain, & Chamberlain, 2012; Butler, 2010). 2007). In summary, lag, retention interval, and feedback
In laboratory and classroom settings, several factors all modulate benefits from retrieval practice, and various
modulate benefits from retrieval practice, also referred to combinations of these factors produce varying degrees
as the “testing effect” (for a review, see Dunlosky, of enhanced learning.
Rawson, Marsh, Nathan, & Willingham, 2013). These Individual differences may also influence retrieval-
factors include the time elapsed or the number of items enhanced learning (Unsworth & Engle, 2007). For instance,
between initial study and retrieval attempts (i.e., lag), the recent examinations reveal relationships between individ-
delay between initial retrieval practice and the final test ual differences and retrieval difficulty (Bui, Maddox, &
(i.e., retention interval), and the presence or absence of Balota, 2013), accessibility of retrieval cues (Unsworth, Spil-
feedback during initial retrieval. First, regarding lag, in lers, & Brewer, 2012), and presentation duration (Unsworth,
general longer intervals between study of material and a 2016). Regarding the testing effect, Wiklund-Hörnqvist,
test lead to better long-term retention, though the Jonsson, and Nyberg (2014) concluded that retrieval prac-
precise benefit from various schedules is complex and tice benefits did not differ as a function of working
under debate (e.g., Balota, Duchek, & Logan, 2007; Karpicke memory; however, their design manipulated trial type
& Roediger, 2007; Pyc & Rawson, 2007; Roediger & Karpicke, (study–study vs. study–test) between-subjects, so it
2011). Second, regarding retention interval, a tradeoff is cannot be determined the extent to which individual sub-
often found such that restudying improves retention jects exhibited the retrieval practice effect, making the null
in the short-term, but retrieval practice benefits learning result difficult to interpret.
in the long-term (e.g., Roediger & Karpicke, 2006b). In In a paired associate paradigm, Brewer and Unsworth
addition, shorter lags between study and retrieval trials (2012) found a small benefit of retrieval practice, which
often produce superior performance at short retention was significantly correlated with some individual difference
CONTACT Pooja K. Agarwal [email protected], www.poojaagarwal.com Washington University in St. Louis, St. Louis, MO 63130, USA
© 2016 Informa UK Limited, trading as Taylor & Francis Group
2 P. K. AGARWAL ET AL.
measures (e.g., episodic memory) but not others (e.g., combinations may prove effective for different students.
working memory). As Brewer and Unsworth noted, the This research contributes to our practical understanding
relatively small testing effect they found is inconsistent about the conditions that lead to the greatest benefits
with those of larger magnitude typically seen in the litera- of test-enhanced learning, and how these conditions
ture, leaving open the question whether there are “apti- might be tailored to enhance learning.
tude × treatment interactions” (pp. 414–415). In other
words, individual benefits from retrieval may vary depend-
ing on factors known to modulate the testing effect, includ- Methods
ing lag, retention interval, and feedback. Subjects
In a follow-up study, Pan, Pashler, Potter, and Rickard
(2015) conducted a replication attempt using Brewer One hundred sixty-six subjects (M age = 20.0 years, 103
and Unsworth’s materials and general procedures. female) were recruited from the Washington University in
Across two experiments, one online and one in the lab- St. Louis Department of Psychology human subject pool.
oratory, Pan et al. found substantial testing effects Subjects received either credit towards completion of a
(larger than in Brewer and Unsworth), but no significant research participation requirement or cash payment ($10/
correlations between an individual difference measure hour). Data from 10 subjects were excluded from analyses
(episodic memory) and benefits from testing. Pan et al. because they did not follow instructions or they did not
speculated that subtle procedural distinctions might return for the second session. Thus, data are reported
have contributed to the discrepancy between the two from 156 subjects.
studies. Namely, the differences in counterbalancing and We note that the 156 subjects were tested at two differ-
the blocking or mixing of presentations may account for ent time periods. The initial experiment was conducted in
the increased testing effect and/or the lack of a corre- 2008 with 104 subjects. In 2011, we added 52 more sub-
lation with the individual difference measure in the Pan jects from the same pool for greater power. The design
et al. study. and procedures used at the two time periods were identi-
Lastly, in a foreign language vocabulary paradigm, Tse cal, and analyses reported in the results section confirmed
and Pu (2012) found a small benefit of retrieval practice, a replication of findings between the two cohorts of sub-
albeit significantly correlated with a combined working jects. Accordingly, we have collapsed the remainder of
memory and test anxiety measure. Echoing the concluding the methods and results sections across the two cohorts
remarks by others, Tse and Pu acknowledged that the for maximal power and variability across individuals,
unexpected small testing effect might be a result of unless otherwise noted.
using a short lag between items, even when employing a
7-day retention interval. In other words, ascertaining a Design
strong relationship between the testing effect and individ-
ual differences can be challenging when using shorter We used a 2 (Trial type: study–study, study–test) × 6 (Lag: 0,
lags, which are known to be less potent for learning (e.g., 1, 3, 5, 7, 9) × 2 (Feedback for study–test trials: present,
Dunlosky et al., 2013). absent) × 2(Retention interval: 10 minutes, 2 days) mixed
To summarise, across recent studies examining individ- design. Trial type and lag were manipulated within sub-
ual differences, factors known to improve test-enhanced jects, whereas feedback and retention interval were
learning (lag, retention interval, and feedback) were manipulated between-subjects (39 subjects per cell). A
held constant. As a result, prior studies with small non-studied baseline condition was included such that all
testing effects and/or small correlations with individual subjects were tested on some items only during the final
difference measures provide an initial glimpse into the test (without initially studying these items) to assess how
precise relationship between retrieval practice and indi- much learning had taken place during the experimental
vidual differences. Our aim was to explore both the session.
relationship between the testing effect and individual
differences, as well as the relationship between individ-
Materials
uals and optimal retrieval conditions. We examined indi-
vidual differences across various levels of lag, retention One hundred ten general knowledge questions drawn
interval, and feedback, variables that are known to from the Nelson and Narens (1980) norms were used for
modulate the benefits of retrieval practice. Based on the this experiment. An example general knowledge question
current literature, we expected to find large benefits used was, “What is the city in which the Baseball Hall of
from retrieval when testing at longer lags, with feedback, Fame is located?” Based on the norms, items had a 10%
at a delayed retention interval. We also measured average recall in college students, ranging from 0.4% to
working memory capacity to determine whether individ- 22% recall. As noted in our results section, the average
uals might differ in the factors needed to provide the baseline (non-studied) recall for the general knowledge
greatest benefit from retrieval. An ideal combination of questions found in our study was 12%, in accordance
factors may not exist for all students; rather, different with the Nelson and Narens norms.
MEMORY 3
the final test phase, subjects were presented with the 78 Accordingly, a 2 × 6 mixed ANOVA confirmed that there
critical items in random order and were provided 14 was neither a main effect of feedback group, F(1, 154)
seconds to type in their answer for each question. = .153, MSE = .142, p = .697, v̂p2 < .001, nor an interaction
The total time required for this procedure was approxi- between feedback and lag, F(1, 154) = 1.49, MSE = .021, p
mately 90 minutes (60 min for the learning phase and = .191, v̂p2 = .001. Thus, the data in Figure 1a are collapsed
working memory task, 30 min for the final test phase). over feedback conditions.
Upon completion of the experiment, subjects were
debriefed and thanked for their time.
Final test performance
Final test performance is shown in Figure 1b (10-min reten-
Results
tion interval) and 1c (2-day interval) as a function of
An alpha level of .05 was used for all tests of statistical sig- whether repetitions across lags were in the study–study
nificance except where otherwise noted. Where Mauchly’s or study–test condition. Reliability (Cronbach’s α) was
test indicated that the assumption of sphericity was vio- .953 for final test performance. We first conducted an
lated for a within-subjects factor in an analysis of variance overall 2 × 6 × 2 × 2 mixed ANOVA (trial type × lag × feed-
(ANOVA), the Greenhouse–Geisser correction was applied back × retention interval) and determined that feedback
to the degrees of freedom. Effect sizes for comparisons (present or absent) showed no significant main effects
of means are reported as Cohen’s d calculated using the and was not involved in any significant interactions.
pooled standard deviation of the groups being compared. Thus, the data in Figure 1b and 1c and further analyses
Effect sizes for ANOVAs are reported as v̂ 2 (one way) or v̂p2 in this section were collapsed across feedback groups.
calculated using the formulae provided by Maxwell and Feedback may not have had an effect because perform-
Delaney (2004, p. 598). Standard deviations reported are ance in the tested conditions was reasonably high at the
uncorrected for bias (i.e., calculated using N, not N – 1). lags we used (see Figure 1a).
For initial learning performance, a three-way ANOVA Second, we examined final test performance as a func-
(cohort, lag, and feedback) showed that cohort had no sig- tion of retention interval to determine if there were signifi-
nificant effect and was not involved in any significant inter- cant retrieval practice effects after 10 minutes and after 2
actions (ps ≥ .245). For final test performance, a five-way days. A 2 × 2 mixed ANOVA (trial type × retention interval)
ANOVA (cohort, trial type, lag, feedback, and retention confirmed a main effect of trial type: overall final test per-
interval) showed that cohort had no significant effect and formance was better for study–test items (M = 62%, SD =
was not involved in any significant interactions 23%) than for study–study items (M = 54%, SD = 23%),
(ps ≥ .136). Furthermore, working memory capacity did F(1, 154) = 73.18, MSE = .007, p < .001, v̂p2 = .036. Forgetting
not significantly differ between the two cohorts, t(154) = occurred between 10 minutes (M = 69%, SD = 19%) and 2
0.12, p = .903. Thus, we combined the data from the two days (M = 46%, SD = 20%), F(1, 154) = 55.86, MSE = .074,
cohorts for all analyses, except where otherwise noted. p < .001, v̂p2 = .260. The interaction between trial type and
retention interval did not reach statistical significance,
F(1, 154) = 2.67, MSE = .007, p = .104, v̂p2 < .001, indicating
Initial learning performance
that regardless of retention interval, final performance
Initial learning performance is shown in Figure 1a. was always greater for study–test items (10 minutes: M =
Reliability (Cronbach’s α) was .855 for initial learning per- 73%, SD = 19%; 2-day: M = 51%, SD = 21%) than for
formance. Initial recall of answers to general knowledge study–study items (10 minutes: M = 66%, SD = 20%;
questions declined as the lag between study and test 2-day: M = 42%, SD = 19%). In addition, final performance
increased. This was confirmed by a one-way ANOVA for non-studied baseline items (M = 12%, SD = 14%) was
across lags, F(5, 775) = 12.22, MSE = 0.261, p < .001, v̂ 2 significantly worse compared to study–study items,
= .030. Follow-up t-tests of all 15 pairwise comparisons con- t(155) = 22.89, p < .001, d = 1.48, and study–test items,
firmed that lag 0 led to greater initial recall than the other t(155) = 27.46, p < .001, d = 1.78, confirming that subjects
lags, ts > 5.11, ps < .001, ds > 0.41, though differences were indeed learning the obscure facts and did not know
between lags greater than 0 were not significant at the most of them ahead of time.
Bonferroni adjusted alpha level of .0033. We also per- Next, we examined final performance as a function of
formed an alternative analysis using regression to test lag in order to determine whether there was an optimal
the apparent decreasing pattern. For each subject, we lag for learning and whether this lag differed for the
obtained a slope using simple linear regression predicting study–study and study–test conditions. Parallel analyses
mean initial learning performance as a function of lag. The were conducted for the 10-min and 2-day retention inter-
mean slope was −.01 (SD = .02), which was significantly val. In both cases, the pattern in Figure 1a for initial learn-
different from zero, t(155) = 5.71, p < .001, d = 0.46. ing was reversed at final test – whereas greater lags
Because subjects did not receive feedback until after between initial study and restudy/test impaired perform-
initial test trials and there was only one test per item, no ance during initial learning, they enhanced performance
effect of feedback was expected on initial learning. on the final test at both retention intervals, illustrating
MEMORY 5
the pattern Bjork (1994) described as a “desirable diffi- compared to other subjects (not low on the range of poss-
culty.” The conditions leading to best initial performance ible scores on the working memory task).
led to poorest long-term retention (and vice versa). Working memory was significantly correlated with initial
Two separate 2 × 6 repeated measures ANOVAs (trial recall success for study–test items, r = .31, t = 4.09, p < .001.
type × lag), one for each retention interval, confirmed Next, we computed correlations between working memory
main effects of lag for the 10-min retention interval, F(5, scores and the difference between final performance on
385) = 8.26, MSE = .029, p < .001, v̂p2 = .024, and for the study–test items vs. study–study items, and did so separ-
2-day retention interval F(5, 385) = 9.94, MSE = .036, ately for all the between-subjects conditions. These data
p < .001, v̂p2 = .031. Benefits from retrieval practice are shown as scatterplots in Figure 2. At the 10-min reten-
appeared to increase as a function of lag at both retention tion interval (Figure 2, top panels), there was no significant
intervals (see Figure 1b and 1c), although the interaction correlation between working memory capacity and retrie-
between trial type and lag did not reach statistical signifi- val practice effects in the no feedback condition, r = .18, t
cance at the 10-min retention interval, F(4.5, 346.3) = 1.15, (37) = 1.09, p = .282, and none in the feedback condition,
MSE = .032, p = .335, v̂p2 = .001, nor after 2 days, F(4.4, r = .11, t(37) = 0.69, p = .494. Note that although the trend
335.3) = 2.11, MSE = .035 p = .074, v̂p2 = .003. in both 10-min conditions was positive, it was not statisti-
Next, we performed an alternative analysis using cally significant; thus, students with differing working
regression to test the apparent increasing pattern. For memory capacity benefitted equivalently from retrieval
each subject, we obtained a slope using simple linear practice, either with or without feedback.
regression predicting the mean difference score between At the 2-day retention interval (Figure 2, bottom panels),
study–study and study–test trials as a function of lag. At there was no significant correlation in the no feedback con-
the 10-min retention interval, the slopes did not signifi- dition, r = −.02, t(37) = 0.09, p = .926; however, there was a
cantly differ from zero, M = .006, SD = .03, t(77) = 1.65, significant negative correlation in the feedback condition,
p = .102, d = 0.21. At the 2-day retention interval, r = −.42, t(37) = 2.79, p = .008. Note that this result repli-
however, the slopes were significantly positive, M = .010, cated across our first sample (n = 26, r = −.45) and our
SD = .03, t(77) = 3.09, p = .003, d = 0.35, indicating that second sample (n = 13, r = −.40), increasing our confidence
retrieval practice benefits indeed increased as lag in the result. Thus, for a 2-day retention interval, the lower a
increased after a 2-day delay. This outcome is consistent student’s working memory capacity, the more s/he bene-
with prior findings that testing effects often emerge on fited from retrieval practice with feedback. We note that
delayed tests more than on immediate tests (Roediger & these specific conditions (retrieval with feedback after a
Karpicke, 2006a, 2006b), and that more difficult retrieval 2-day delay) may be of particular relevance in applied
yields greater benefits (Bjork, 1994; Finley, Benjamin, settings, where the provision of feedback and a delay
Hays, Bjork, & Kornell, 2011; Pyc & Rawson, 2009). In before the final test are practical and ideal for enhancing
summary, retrieval practice improved final performance learning.
compared to restudying, both immediately (after 10 Finally, we conducted an analysis to determine whether
minutes) and after a delay (at 2 days); further, the benefit the relationship between trial type and lag varied as a
after a 2-day delay increased as the lag, or number of function of working memory capacity. We restrict this
intervening items between study and retrieval trials, analysis to the 2-day retention interval group in which
increased. feedback was given during learning (Figure 2, bottom-
right panel), as this is the group in which a significant cor-
relation was observed between working memory capacity
Associations with working memory capacity
and the effect of retrieval practice. A 2 × 6 ANCOVA (trial
Is there a relationship between working memory capacity type × lag), using working memory span as a covariate
and the potency of retrieval practice? To address this and difference scores (study–test vs. study–study) as the
issue, we first examined correlations between initial and dependent variable, revealed no significant interactions
final test performance and individual differences in between lag and working memory capacity, F(5, 185) =
working memory capacity, as measured by the automatic 0.66, MSE = .036, p = .655, v̂p2 < .001, or between trial
operation span task (Unsworth et al., 2005). In keeping type, lag, and working memory, F(5, 185) = 1.48, MSE
with Unsworth et al., we used subjects’ total number of = .031, p = .197, v̂p2 < .001. Follow-up t-tests at each lag
letters recalled in the correct serial position (for trials in showed that difference scores were greater for the lower
which all letters in the sequence were correctly recalled) capacity group than the higher capacity group at lags 0
in the span task for all analyses. Subjects’ performance and 9, t(37) = 3.28, p = .002, d = 1.05 and t(37) = 3.84,
on the working memory task ranged from 10 to 75 (M = p < .001, d = 1.23, but did not significantly differ at any of
60.3, Mdn = 65.0, SD = 14.3). The maximum score for the the other lags (Bonferroni adjusted alpha level of .0083).
working memory task is 75; thus, subjects in our sample Thus, although all subjects benefitted from retrieval prac-
demonstrated working memory capacities toward the tice, there was no obvious pattern of optimal lag between
higher end of the scale. As such, “lower” working study and retrieval trials as a function of working memory
memory in our study refers to lower task performance capacity.
6 P. K. AGARWAL ET AL.
Figure 2. Difference in final test performance (study–test items minus study–study items) as a function of working memory span score, retention interval (10
minutes vs. 2 days), and feedback condition. Black lines represent the least squares linear regression.
knowing ratings. Journal of Verbal Learning and Verbal Behavior, 19, Tse, C.-S., & Pu, X. (2012). The effectiveness of test-enhanced learning
338–368. depends on trait test anxiety and working-memory capacity.
Pan, S. C., Pashler, H., Potter, Z. E., & Rickard, T. C. (2015). Testing Journal of Experimental Psychology: Applied, 18, 253–264.
enhances learning across a range of episodic memory abilities. Unsworth, J. (2016). Working memory capacity and recall from long-
Journal of Memory and Language, 83, 53–61. term memory: Examining the influence of encoding strategies,
Pyc, M. A., & Rawson, K. A. (2007). Examining the efficiency of sche- study time allocation, search efficiency, and monitoring abilities.
dules of distributed retrieval practice. Memory & Cognition, 35, Journal of Experimental Psychology: Learning, Memory, and
1917–1927. Cognition, 42, 50–61.
Pyc, M. A., & Rawson, K. A. (2009). Testing the retrieval effort hypothesis: Unsworth, N., & Engle, R. W. (2007). The nature of individual differences
Does greater difficulty correctly recalling information lead to higher in working memory capacity: Active maintenance in primary
levels of memory? Journal of Memory and Language, 60, 437–447. memory and controlled search from secondary memory.
Pyc, M. A., & Rawson, K. A. (2010). Why testing improves memory: Psychological Review, 114, 104–132.
Mediator effectiveness hypothesis. Science, 330, 335. Unsworth, N., Heitz, R. P., Schrock, J. C., & Engle, R. W. (2005). An auto-
Roediger, H. L., & Karpicke, J. D. (2006a). The power of testing memory: mated version of the operation span task. Behavior Research
Basic research and implications for educational practice. Methods, 37, 498–505.
Perspectives on Psychological Science, 1, 181–210. Unsworth, N., Spillers, G. J., & Brewer, G. A. (2012). Working memory
Roediger, H. L., & Karpicke, J. D. (2006b). Test-enhanced learning: capacity and retrieval limitations from long-term memory: An
Taking memory tests improve long-term retention. Psychological examination of differences in accessibility. Quarterly Journal of
Science, 17, 249–255. Experimental Psychology, 65, 2397–2410.
Roediger, H. L., & Karpicke, J. D. (2011). Intricacies of spaced retrieval: A Whitten, W. B., & Bjork, R. A. (1977). Learning from tests: The effects
resolution. In A. S. Benjamin (Ed.), Successful remembering and suc- of spacing. Journal of Verbal Learning and Verbal Behavior, 16,
cessful forgetting: Essays in honor of Robert A. Bjork (pp. 23–48). 465–478.
New York, NY: Psychology Press. Wiklund-Hörnqvist, C., Jonsson, B., & Nyberg, L. (2014). Strengthening
Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-prime user’s concept learning by repeated testing. Scandinavian Journal of
guide. Pittsburgh, PA: Psychology Software Tools. Psychology, 55, 10–16.