Article Research Hanisa Insyarani
Article Research Hanisa Insyarani
Article Research Hanisa Insyarani
System
journal homepage: www.elsevier.com/locate/system
A R T I C L E I N F O A B S T R A C T
Keywords: The benefits of multimodal input on foreign language listening comprehension and vocabulary
Multimodal input enhancement learning are well-established, but only recently has its impact on pronunciation been explored. In
Authentic L2 input this study, we audio-synchronized the highlighting of target words in captions to promote the
L2 captioned video
activation of their phonolexical representations at the time they are auditorily processed and
Audio-visual synchrony
Phonolexical representations
improve phonological updating in the mental lexicon. We recorded the eye movements of 58 L1-
L2 pronunciation Spanish/Catalan learners of English as they watched two videos with target words (TWs) high
Auditory form recognition lighted 500 ms or 300 ms before auditory onset, highlighted from caption onset or, alternatively,
Eye-tracking under one of two control conditions (unenhanced and uncaptioned). We assessed updating of
phonolexical forms in terms of more accurate and faster rejection of mispronunciations of the
TWs from pre-to post-test. Results showed that 300 ms synchronized enhancement and unsyn
chronized enhancement led to longer fixation duration, unsynchronized enhancement reduced
TW skipping probability, and both synchronized conditions promoted higher audio-visual syn
chrony in learners’ caption reading. While only the unsynchronized condition resulted in more
accurate responses at post-test, all enhancement conditions led to significantly faster rejection of
mispronunciations. These initial findings call for further research on audio-synchronized
enhancement and its potential benefits for L2 pronunciation learning.
1. Introduction
The scarcity of exposure to authentic spoken input in the foreign language (FL) classroom has a detrimental effect on the devel
opment of listening and speaking skills among second language (L2)1 learners (Muñoz, 2008, 2014). This can be partially compensated
for by carrying out leisure activities in the target language, such as watching television and movies, the preferred source of FL input
among young learners (Lindgren & Muñoz, 2013). L2 captioned video, unlike real spoken interaction, offers the possibility of having
orthographic input available as spoken input is being processed, a feature that promotes auditory word recognition, speech seg
mentation, and the mapping of L2 orthography to phonological form (Bird & Williams, 2002; Charles & Trenkic, 2015; Mitterer &
McQueen, 2009). However, second language acquisition research has only recently started to explore the potential of captioned video
* Corresponding author.
E-mail addresses: [email protected] (V. Galimberti), [email protected] (J.C. Mora), [email protected] (R. Gilabert).
1
The term foreign language (FL) learning refers to formal instruction in a language not commonly spoken in the country of the speaker. In this
paper, second language (L2) learning is used to refer to all learners of a language, including foreign language learners, when the focus is on the
learning process rather than the context.
https://doi.org/10.1016/j.system.2023.103078
Received 14 June 2022; Received in revised form 15 May 2023; Accepted 30 May 2023
Available online 16 June 2023
0346-251X/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
V. Galimberti et al. System 116 (2023) 103078
for pronunciation teaching and learning (Wisniewska & Mora, 2020), a still severely underresearched area. Important issues that
remained unanswered are whether captioned video without any kind of manipulation can draw the viewers’ attention to phonological
form and what sorts of input transformations can enhance attention to phonological form.
In this study, we investigated whether textual enhancement of words in L2 captions, synchronized with their auditory onset, directs
viewers’ attention to auditory target forms, potentially leading to the updating of phonological word forms. To this end, we manip
ulated the captions in L2 videos by highlighting a selection of English words that present pronunciation difficulties for L1 Spanish/
Catalan learners of English and are typically mispronounced. The textual highlighting was temporally synchronized with the auditory
onset of the target words so that words would highlight in yellow immediately before they were heard. This enhancement procedure
was expected to promote a comparison of the phonolexical representations of the target words, activated in the learners’ mental
lexicons through textual enhancement, with their auditory forms as produced by the speakers in the video. We analyzed learners’ eye
gaze behavior as an online measure of attention allocation during the viewing, which is considered the first step towards deeper
processing of target forms (Conklin et al., 2018; Montero Pérez et al., 2015). We used an auditory-only speeded lexical decision task to
test whether audio-synchronized textual enhancement promoted the updating of L2 phonolexical representations.
2. Background
From a psycholinguistic point of view, the interpretation of spoken language involves several levels of processing and represen
tation (Ramus et al., 2010). As the input speech signal unfolds in time, speech sounds and their acoustic properties are encoded into
abstract language-specific sub-lexical phonological representations that provide access to the phonological representation of word
forms (phonolexical representations) stored in the mental lexicon together with their semantic and orthographic representations. For
proficient speakers, these phonetic, phonological, and semantic decoding processes are automatic, making speech perception and
auditory word recognition seemingly effortless. In a similar way, the speech production process involves the efficient operation of
semantic, phonological, and phonetic encoding mechanisms that make the selection, retrieval, and articulation of word forms auto
matic for proficient speakers. For L2 learners, however, L1-based perception and the imprecision of their phonolexical representations
make L2 speech perception and L2 word recognition effortful and inefficient, posing processing difficulties that severely hinder
automaticity in L2 speech perception and production (Darcy et al., 2013).
Cook et al. (2016) distinguish between difficulties in L2 phonological encoding, when a learner cannot distinguish two words which
differ by a single phoneme, and problems in L2 phonolexical encoding, when a word is confused with others despite not involving a
difficult phonological contrast, because its phonolexical representation is “fuzzy” or imprecise. Issues with L2 phonolexical encoding
lead to low speed and accuracy of lexical access and unstable form-to-meaning mappings, which may impair language development.
Therefore, the ability to perceive and produce L2 speech efficiently depends not only on the correct identification and categorization of
L2 sounds, but also on the extent to which learners’ phonolexical representations faithfully reflect the lexical and sub-lexical properties
of the speech input to be processed.
A growing body of research has used lexical decision tasks to test the lexical encoding of L2 phonological contrasts (e.g., Darcy
et al., 2013; Darcy & Holliday, 2019; Llompart & Reinisch, 2019). Lexical decision tasks which require learners to classify stimuli as
words or nonwords as fast as possible test learners’ speed and accuracy in auditorily recognizing L2 word forms (Harrington, 2006).
Performance in this type of speeded tasks has been shown to reflect the degree of automaticity in L2 lexical access resulting from lexical
acquisition (Williams & Paciorek, 2016). In a lexical decision task where nonwords reflect L1-based mispronunciations of real L2
words, lower speed and accuracy rates could reflect the instability of imprecise L2 phonolexical representations (Cook et al., 2016) or
the higher processing cost of the larger lexical competition associated with imprecise phonolexical representations (Broersma & Cutler,
2008). On the other hand, higher speed and accuracy rates in rejecting mispronounced forms as words would indicate higher stability
of lexical representations and higher automaticity in lexical access (Pellicer-Sánchez, 2015). In the lexical decision task used in the
current study, higher response accuracy and latency gains in rejecting the mispronounced auditory forms of the target words in the L2
captioned videos were deemed to be indicative of updatings in the mental lexicon.
In the foreign language learning context, learners are exposed to L1-accented L2 speech from peers and teachers, which can lead to
the establishment of L1-accented phonolexical representations for L2 words (Llompart & Reinisch, 2018). Pronunciation-focused
instruction and exposure to target-like models, including L2 captioned TV shows and films containing authentic speech at a natu
rally fast pace, might be effective at making these representations more target-like (Darcy & Holliday, 2019). The presence of verbatim
captions offers processing support to language learners, allowing for the simultaneous processing and integration of visual (the text in
the captions) and auditory (the corresponding soundtrack) information (Mayer, 2014). This hypothesis is consistent with the findings
of the few studies to date which have investigated the effects of captioned video on L2 speech learning. L2 captioned video has been
shown to help listeners segment the continuous stream of speech into auditory word forms, improving auditory word recognition and
real-time L2 listening comprehension (Charles & Trenkic, 2015). In addition, L2 captioned video (unlike L1-captioned or uncaptioned
video) has been hypothesized to support lexically-guided retuning of L2 perceptual categories, as shown by the processing advantage
of advanced L2 learners when exposed to an unfamiliar regional accent in this viewing mode (Mitterer & McQueen, 2009). Even less
advanced learners have been found to incidentally develop their L2 speech perception skills when watching videos with L2 captions,
2
V. Galimberti et al. System 116 (2023) 103078
but not with L1 captions, confirming the language learning potential of this viewing modality (Birulés-Muntané & Soto-Faraco, 2016).
However, large individual differences exist in how learners allocate attention to visual (written) and auditory input when viewing
L2 captioned video, as reflected by learners’ differences in visual behavior when processing captioned video (Kam et al., 2020).
Reading behavior in the context of audiovisual dynamic texts (captions) is qualitatively different from reading static text in that the
processing of fleeting text occurs while auditory information from the soundtrack and on-screen movement compete for attention,
potentially interfering with reading and leading to cognitive overload as the allocation of attention switches between the captions, the
moving image, and the auditory input (Kruger & Steyn, 2014). As a result, text and auditory processing are often misaligned, probably
hindering the potential benefits of the simultaneous processing of auditory and textual input. The temporal synchronization of textual
enhancement with the auditory onset of words has the potential of guiding learners’ attention to the auditory form of words at the
moment of visual word recognition.
The use of enhancement techniques such as bold-facing or highlighting to augment the salience of linguistic features is underpinned
by Schmidt’s (1990) Noticing Hypothesis, which claimed that only the L2 input that has been noticed is converted into intake for
further processing, and Sharwood Smith’s (1991, 1993) Input Enhancement Hypothesis, which stated that learners can be guided
towards noticing a language feature by manipulating the feature’s visual salience and frequency in the input. Eye-tracking studies have
shown that textual enhancement is effective at drawing learners’ attention to form while processing meaning in the L2 input (Winke,
2013), and that increased attention on the enhanced structures may lead to better performance in language tests. For example, Lee and
Révész (2020) found gains in how accurately learners used the past simple and present perfect to be related to increased attention
allocation to form with enhanced, but not with unenhanced captions. Similarly, eye-tracking measures of fixation duration during
exposure to textually enhanced input have been found to correlate with vowel encoding accuracy gains in an orthographic form
recognition task (Alsadoon & Heift, 2015).
The benefits of input enhancement for language learning have been investigated for various grammatical and lexical aspects, but
surprisingly little research has focused on pronunciation. In a study on the influence of orthography in Russian phonolexical acqui
sition, Showalter (2019) found that textual input enhancement helped learners associate auditory forms with picture stimuli more
effectively than explicit instruction. The author suggests that textually enhancing difficult graphene-phoneme correspondences in
multimodal input may have forced learners to attend more carefully to auditory input and select important information in an attempt
to figure out phonological rules, whereas receiving the rules in advance and keeping them in mind during the listening task may have
been too taxing. Our prediction is that textual enhancement would similarly promote updating of phonolexical forms during multi
modal exposure to authentic input through L2 captioned video. Since readers tend to read ahead of the audio in the L1 and L2, both in
reading-while-listening (Conklin et al., 2020) and captioned video viewing (Wisniewska & Mora, 2018), temporally synchronizing the
highlighting of target words in captions with their auditory onset should enable learners to direct their attention to the auditory form of
words once they have been activated through their highlighted orthographic form.
Previous research on reading-while-listening (Bailly & Barbour, 2011; Gerbier et al., 2018) has shown that visual enhancement
occurring before the word auditory onset effectively promotes the simultaneous processing of orthographic and auditory word forms.
For example, Gerbier et al. (2018) found that highlighting each word in a text 300 ms before its auditory onset triggered fewer but
longer fixations per word and fewer regressive saccades, generating a more fluent reading trajectory that helped readers keep the pace
with the voice reading the text. The 300 ms time-lag interval between enhancement and auditory onset, which increased reading
fluency in Gerbier et al.’s (2018) reading-while-listening study, may be equally as effective in the context of video. However, viewers
may be mostly focused on the moving image at the center of the screen, or reading captions ahead of the audio, until the appearance of
text attracts their gaze to the enhanced word. In this case, a longer time lag of 500 ms would allow them to notice the enhanced word in
the peripheral visual field, plan and execute the saccade towards the word, which takes about 200 ms (Godfroid, 2019), and activate
the stored phonological representation before hearing the word’s auditory onset. In the current study we explore the differential effects
of two time-lag intervals (300 ms vs. 500 ms) between highlighting and auditory word onsets and predict that exposure to syn
chronized L2 captioned video will promote audio-visual synchrony, enhancing the updating of the target words’ phonolexical
representations.
3. Research questions
The present study explores the effects of audio-synchronized textual enhancement in L2 captioned videos on the processing of
auditory word forms and their phonological update, as measured by a lexical decision task requiring accurate and timely rejection of
non-targetlike auditory forms.
The study addresses the following research questions.
1. Which type of textual enhancement (audio-synchronized at 300 ms, 500 ms, unsynchronized or unenhanced) is more effective at
enhancing the simultaneous processing of target visual and auditory word forms in L2 captioned videos?
2. Does the simultaneous processing of visual and auditory word forms in L2 captioned videos promote the updating of phonolexical
representations?
3
V. Galimberti et al. System 116 (2023) 103078
1. Textual enhancement right before auditory onset will more effectively enhance the simultaneous processing of visual and auditory
word forms, compared to unsynchronized enhancement and no enhancement. In particular, we expect the 500 ms time-lag interval
between enhancement and auditory onset to promote closer audio-visual synchrony than the other conditions.
2. The simultaneous processing of visual and auditory word forms generated by the synchronized enhancement of words in L2
captions will promote phonolexical update to a larger extent than exposure to captions containing unsynchronized enhancement
and no enhancement. In particular, we expect the 500 ms time-lag interval between enhancement and auditory onset to promote
larger gains in lexical decision accuracy and response times than the other conditions.
4. Methodology
4.1. Participants
Fifty-eight first-year university students (female = 51) pursuing an English degree at a public university in Spain were recruited.
They were bilingual L1 speakers of Spanish and Catalan and reported a B1–B2 English level based on language certificates and/or self-
assessment. As a measure of global L2 proficiency (Kostromitina & Plonsky, 2021), we administered an elicited imitation task (Ortega
et al., 2002). In this task, participants listened to and repeated 30 sentence items of increasing word length and structural complexity.
Following Ortega et al.’s (2002) rubric available in the IRIS digital repository (Marsden et al., 2016), sentence productions were given
0 to 4 points based on repetition accuracy to a maximum score of 120. Participants obtained a mean score of 97 out of 120 (range
64–118, SD = 13.2), indicating an upper intermediate level of proficiency. Participants were randomly assigned to four groups of
comparable proficiency (F (3, 54) = 0.981, p = .41) according to the viewing conditions described in section 4.3. Most participants
were familiar with watching English language TV and videos (an average of 5 h a week) with and without captions (see Table 1 for
participants’ demographics).
4.2. Materials
4.2.1. Clips
Four clips were selected from the first episode of the TV series ‘The Good Place’, which had been previously used in a language
acquisition study with a similar sample of learners (Pattemore & Muñoz, 2020). The genre of TV series was selected for containing
highly contextualized and predictable dialogues on familiar topics (Ghia, 2012) that allow learners to allocate attentional resources to
linguistic form. One clip (1′ 40′′ ) had unenhanced captions and contained 19 target words (TWs), followed by a sample clip (00′ 35′′ ),
not included in the analyses, used to familiarize participants in the experimental groups with caption enhancement. The last two clips
(1′ 50′′ each) contained 9 TWs each that were highlighted in yellow at different time-lags with auditory onset, depending on the
experimental condition (see section 4.3). The captions were edited with Aegisub, a software that allows for manual synchronization
thanks to an in-built spectrum analyzer, and hardcoded as one- or two-line captions in Arial font size 20. An analysis with the program
Vocabprofile Compleat (Cobb 2015) showed that the most frequent 1000-, 2000- and 3000-word families provided 90%, 95% and 96%
coverage of the script, respectively. Following Van Zeeland and Schmitt (2013) and Rodgers (2013), we expected a coverage level of
Table 1
Participants’ demographics by group.
500 ms synchronized 300 ms synchronized Unenhanced captions Uncaptioned
(N = 21) (N = 21) (N = 8) (N = 8)
Age at testing 20.65 3.03 [19.23, 20.26 2.50 [19.09, 19.56 0.85 [18.85, 23.70 10.25 [15.13,
22.07] 21.43 20.27] 32.27]
L2 proficiency (0–120 97.80 13.47 [91.50, 98.50 13.78 [92.05, 89.50 9.55 [81.52, 97.38 15.85 [84.13,
points) 104.10] 104.95] 97.48] 110.62]
Age of onset of L2 learning 4.90 1.37 [4.26, 5.50 1.32 [4.88, 6.25 3.45 [3.36, 4.88 1.96 [3.24,
in school 5.54] 6.12] 9.14] 6.51]
Extracurricular classes 2.95 3.91 [1.12, 2.28 3.70 [0.54, 5.63 4.14 [2.17, 4.13 4.12 [0.68,
(years) 4.78] 4.01] 9.08] 7.57]
Estimated spoken L2 inputa 28.55 16.75 [20.71, 22.35 11.23 [17.09, 18.50 7.95 [11.86, 25.00 10.00 [16.64,
36.39] 27.61] 25.14] 33.36]
b
Estimated L2 output 10.90 12.48 [5.06, 10.48 11.57 [5.06, 5.75 1.98 [4.09, 12.63 12.76 [1.96,
16.74] 15.89] 7.41] 23.29]
Exposure to L2 videos and 8.27 5.99 [5.47, 5.26 3.19 [3.76, 4.21 1.56 [2.90, 5.71 3.81 [2.52,
TV (hours per week) 11.08] 6.75] 5.51] 8.89]
Self-estimated L2 6.41 1.66 [5.63, 6.49 1.42 [5.82, 6.47 0.81 [5.80, 5.40 2.54 [3.27,
proficiency (1 = very 7.19] 7.16] 7.14] 7.53]
poor – 9 = proficient)c
a
English input from L1 and L2 speakers in hours per week.
b
Oral L2 use with L1 and L2 speakers in hours per week.
c
Averaged self-estimated reading, writing, listening, speaking and pronunciation proficiency.
4
V. Galimberti et al. System 116 (2023) 103078
Table 2
Linguistic properties of the target words in the enhanced subset.
Word class Orthographic length Phonological lengtha Occurrences in clips Lexical frequencyb Error category
5
V. Galimberti et al. System 116 (2023) 103078
Table 3
Presentation properties of the target words in the enhanced subset.
Caption lines Line with TW TW position Presentation time (ms) AOI positiona (px) AOI size (px) Auditory duration (ms)
English word. Two L1 speakers of English (different from the one who recorded the stimuli) achieved ceiling performance on the task
(92% and 94%). Nonword rejection rates based on the accuracy and response latencies of correctly identifying the mispronounced TWs
as nonwords were used as dependent measures.
4.3. Procedure
All testing and exposure took place individually and in one session of approximately 1 h. Upon entering the research laboratory,
each participant signed a consent form and was randomly assigned to one of the viewing conditions. Participants did the lexical
decision task, then watched the clips as their eye-movements were recorded with a Tobii T120 eye-tracker integrated into a 17′′
monitor, which has a sampling rate of 120 Hz, an accuracy of 0.5◦ and 0.2◦ resolution. With the participant seated at a distance
between 60 and 64 cm from the screen, Tobii T120 was expected to appropriately keep track of fixations within the study’s areas of
interest, as AOI height (50–55 pixel) subtended a visual angle of ~2◦ , and their width was larger. Before the viewing, a 9-point
calibration and validation procedure was performed. After the first clip with unenhanced captions and the sample clip with target
words highlighted at different time intervals, participants in the experimental groups watched one clip under one of the synchronized
conditions (500 ms or 300 ms), and the other clip with TWs highlighting at caption onset (henceforth “unsynchronized enhancement”)
as a within-subject control condition (Fig. 1). To keep the focus on meaning, a multiple choice comprehension question was included
after each clip, and participants were not told that they would repeat the LDT as a post-test. After the LDT post-test, they did the elicited
imitation task. The participants were not informed of the study aim until the end of the session.
6
V. Galimberti et al. System 116 (2023) 103078
4.4. Analyses
The eye-tracking data was extracted from Tobii Studio using the I-VT fixation filter in Tobii Pro. Less than 75% of the data was
available for three participants, so their eye-tracking data were excluded from the analyses. A fourth participant who did not have
fixations in the caption area was excluded. On average, 92.6% of eye-tracking data was available from the participants included in the
analyses (n = 46). Following Godfroid (2019), fixations shorter than 50 ms and longer than 800 ms were removed from the analysis of
fixation duration and fixation distance (but not skipping probability), causing further exclusion of 3.5% of the fixations recorded and
leaving 1043 fixations on a total of 1702 fixated and skipped items. The statistical models were built in RStudio using the glmer and
lmer functions of the lme4 package, and Bonferroni-adjusted significance tests for pairwise contrasts were obtained using the lsmeans
function of the emmeans package. The package performance was used to assess model performance and obtain effect sizes (marginal and
conditional R-squared values) for linear mixed models. The function r.squaredGLMM in the package MuMIn was used to obtain
pseudo-R-squared values based on the delta method for generalized linear mixed models. Non-parametric bootstrapping with
replacement (n = 1e4 simulations) was used to calculate basic confidence intervals from the empirical distribution of the parameter
estimate and independently from model assumptions. Since the assumptions underlying the computation of the asymptotic 95% CIs
(and therefore of the p values) did not hold for some of the models, we decided to use bootstrapping on the regression coefficients of all
the models to provide theoretically valid estimates of the true population parameter, even when there was no evidence against the
validity of the model assumptions.
We first analyzed the subset of enhanced target words. A mixed effects gamma regression with a log-link function was used to
compare the effects of viewing condition on total fixation duration, the sum of the duration for all fixations within an AOI. The effects
of viewing condition on skipping probability (the proportion of unfixated words relative to the total number of words in the subset),
were analyzed with a fixed effect logistic regression based on a binomial distribution and logit link function. As an indicator of the
degree of synchronization between visual and auditory input processing, we built a linear mixed model to assess the effects of viewing
condition on fixation distance, which was defined as the difference between the timestamp of the first fixation on a target word in the
caption and the onset of its auditory form in the soundtrack (Wisniewska & Mora, 2018). First fixations that did not happen within
3000 ms of word auditory onset (n = 12) were replaced by system missing values in the analysis of fixation distance. To control for
possible confounds, all eye-tracking models were run including Presentation Time (see section 4.2.2.) and Frequency of Occurrence (in
the four clips) as fixed effects besides Viewing Condition, and random intercepts for participants and items. If the effects of a covariate
did not reach significance, the model was re-run excluding the covariate.
The lexical decision task analysis was carried out on the responses of all 58 participants. For the subset of enhanced nonwords, a
logistic mixed model based on a binomial distribution and logit link function was run to compare the effects of viewing condition and
testing time (T1-T2) on participants’ accuracy in the LDT. The accuracy gains reported in the descriptive tables were computed by
assigning a gain score for each item eliciting an inaccurate response at T1 and an accurate response at T2. A mixed effects gamma
regression was run to measure the effects of viewing condition and testing time on reaction times (RTs). Both the model targeting
accuracy as a dependent variable and the model targeting reaction times were run including Viewing Condition, Time and the interaction
of Viewing Condition * Time as fixed effects, and random intercepts for participants and items. RT gains were obtained for items eliciting
accurate responses at T2 only, by subtracting each absolute RT at T1 from the correspondent one at T2 and discarding negative gains.
The same models described for the subset of enhanced words were run on the control subset of unenhanced words, with Group (4-
level variable) as a fixed effect instead of Viewing Condition (5-level variable), because words in this subset were watched under the
same unenhanced condition by all participants. This was done to check whether the participants belonging to each group exhibited, in
the absence of enhancement, a different behavior from the other groups (e.g., skipped more words or fixated on words for longer).
Additionally, we ran Pearson or Kendall tau correlations (for continuous and categorical variables, respectively) between participants’
proficiency and their eye gaze behavior, and between proficiency and LDT gains.
5. Results
RQ1. investigated the effectiveness of different types of textual enhancement to enhance the simultaneous processing of target visual
and auditory word forms in L2 captioned videos.
7
V. Galimberti et al. System 116 (2023) 103078
enhancement conditions, indicating that these variables had a significant and positive effect on total fixation duration (Table 4).
Pairwise contrasts revealed no difference between any of the conditions (table A.2).
The gamma regression on the control subset of unenhanced words, for which we used Group as a predictor instead of Viewing
Condition because participants in all groups watched these words under unenhanced condition, showed that Presentation Time had a
significant effect on total fixation duration, but Group and Frequency of Occurrence did not (table A.3). Pairwise contrasts on the
unenhanced subset revealed no difference between any of the groups.
Table 4
Fixed coefficients for the model examining total fixation duration on the enhanced subset of TWs.
95% Confidence Intervals
Asymptotic Bootstrapped
Intercept 5.72 0.12 46.16 <.001a 0.04 0.22 5.47 5.96 5.63 5.84
500 ms synchronization 0.03 0.13 0.26 .79 − 0.21 0.28 − 0.10 0.16
300 ms synchronization 0.14 0.13 1.07 .29 − 0.11 0.38 0.00 0.26
Unsynchronized enhancement 0.13 0.12 1.10 .27 − 0.10 0.37 0.01 0.24
Presentation time 0.10 0.06 1.81 .07 − 0.01 0.21 0.07 0.14
a
p < .001.
8
V. Galimberti et al. System 116 (2023) 103078
Table 5
Fixed coefficients for the logistic regression examining skipping probability (enhanced subset).
95% Confidence Intervals
Asymptotic Bootstrapped
Intercept − 0.84 0.46 − 1.81 .07 0.03 0.32 − 1.74 0.07 − 1.15 − 0.29
500 ms synchronization − 0.58 0.54 − 1.07 .29 − 1.64 0.48 − 1.14 0.16
300 ms synchronization − 0.54 0.54 − 0.99 .32 − 1.60 0.52 − 1.08 0.18
Unsynchronized enhancement − 1.04 0.52 − 2.02 .04a − 2.06 − 0.03 − 1.46 − 0.37
a
p < .05.
Fig. 2. Distribution of pre-fixations (positive values) and post-fixations in relation to word auditory onset.
Table 6
Fixed coefficients for the linear mixed model examining fixation distance.
95% Confidence Intervals
Asymptotic Bootstrapped
Intercept 93.74 131.71 39.69 0.71 0.48 0.10 0.55 − 164.42 351.90 − 33.33 223.94
300 ms synchronization 220.27 93.38 588.20 2.36 .02* 37.25 403.30 31.73 398.69
Unsynchronized enhancement 410.47 67.89 599.81 6.05 <.001*** 277.40 543.50 266.09 554.42
Unenhanced captions 159.29 174.16 51.02 0.91 .37 − 182.06 500.60 − 30.45 332.04
Presentation time 221.74 102.94 15.96 2.15 .05* 19.98 423.50 165.69 277.96
visual synchronization compared to the unsynchronized condition, although not compared to the unenhanced condition. In addition,
the participants’ proficiency had a significant effect on fixation distance only under the unenhanced condition. Presentation time
affected total fixation duration and fixation duration, but not skipping probability (for the enhanced subset of TWs), whereas frequency
of occurrence did not have a significant effect on eye gaze behavior. The analysis of the subset of words that were watched under
enhanced condition by all participants showed no between group differences for total fixation duration and fixation distance, con
firming that the results obtained for the subset of enhanced TWs depended on the enhancement condition. However, a higher skipping
probability was found for the 500 ms synchronized group on the unenhanced subset of words, indicating that participants in this group
may have naturally skipped more TWs than the others, regardless of enhancement.
RQ2. asked whether the simultaneous processing of visual and auditory word forms in L2 captioned videos promoted the updating of
9
V. Galimberti et al. System 116 (2023) 103078
phonolexical representations.
10
V. Galimberti et al. System 116 (2023) 103078
Table 7
Fixed coefficients for the fixed effects gamma regression examining reaction time.
95% Confidence Intervals
Asymptotic Bootstrapped
Intercept 7.45 0.09 85.04 <.001*** 0.10 0.32 7.28 7.62 7.35 7.55
500 ms synchronization 0.12 0.10 1.27 .20 − 0.07 0.31 − 0.02 0.27
300 ms synchronization 0.04 0.10 0.40 .69 − 0.15 0.23 − 0.09 0.17
Unsynchronized enhancement − 0.01 0.09 − 0.10 .92 − 0.19 0.17 − 0.13 0.12
Unenhanced captions 0.00 0.12 − 0.02 .98 − 0.23 0.22 − 0.15 0.13
Time − 0.06 0.03 − 2.08 .04* − 0.12 0.00 − 0.12 − 0.01
500 ms synchronization*time − 0.12 0.04 − 2.99 .003** − 0.20 − 0.04 − 0.20 − 0.04
300 ms synchronization*time − 0.05 0.04 − 1.27 .20 − 0.13 0.03 − 0.12 0.03
Unsynchronized enhancement*time − 0.03 0.03 − 0.98 .33 − 0.10 0.03 − 0.10 0.03
Unenhanced captions*time 0.04 0.04 0.87 .39 − 0.05 0.12 − 0.04 0.13
6. Discussion
This study explored the effects of viewing L2 videos with synchronized and unsynchronized caption enhancement, unenhanced
captions, or no captions on eye-gaze behavior and phonolexical update. Skipping probability, a measure of whether any visual
attention was paid to the area of interest, and total fixation duration, a measure of initial word retrieval and subsequent integration
with the context (Godfroid, 2019) were used, as in other studies on input enhancement (e.g., Montero Pérez et al., 2015; Winke, 2013),
to assess attention allocation. Fixation distance was used as an indicator of the degree of learners’ synchronization between the visual
and the auditory processing of the target words. To assess the effect of synchronized textual input enhancement on the update of
phonolexical representations, we tested learners’ ability to recognize mispronunciations of the TWs in an auditory lexical decision task
before and after viewing the captioned clips. Since reducing the activation of non-targetlike lexical representations facilitates lexical
access (Darcy et al., 2013), learners’ accuracy and reaction times in response to mispronunciations of the TWs were analyzed as an
indicator of phonolexical update. We discuss our two research questions in light of the results obtained.
Our first research question asked which type of textual enhancement (audio-synchronized at 300 ms, 500 ms, unsynchronized or
unenhanced) is more effective at enhancing the simultaneous processing of target visual and auditory word forms in L2 captioned
videos. The 300 ms synchronized and the unsynchronized enhancement conditions were associated with longer total fixation duration
and the unsynchronized enhancement condition with a lower skipping probability, a finding in line with previous research on the use
of input enhancement to draw learners’ attention to specific L2 words and constructions during exposure to multimodal input
(Alsadoon & Heift, 2015; Lee & Révész, 2020). However, the robustness of the results for total fixation duration is weakened by the
discrepancy between the bootstrap and asymptotic confidence intervals and the small amount of variance explained by viewing
condition in the model. In addition, since the 500 ms group skipped words more frequently at baseline (unenhanced subset of words), a
different behavior from the other groups could not be excluded in the analysis of skipping probability data for the enhanced subset of
TWs. Overall, assuming participants paid more attention to the areas they actually fixated and those they fixated for longer (Leow,
2015), we would expect deeper processing of the TWs under unsynchronized enhancement.
However, a greater distance between first fixation on a word and its auditory onset was found under unsynchronized enhancement
than under the synchronized enhancement conditions, regardless of participants’ proficiency. This finding suggests that, under the
unsynchronized enhancement condition, participants looked at TWs as soon as the caption appeared and then either went back to
viewing (the image) while listening, which would not result in audio-visual synchrony, or went back to reading the caption after
having looked at the TWs way earlier than their auditory onset. It is impossible to exclude that they would have managed to read the
rest of the caption (including the TW) in synchrony with its auditory onset. However, due to the dynamic nature of captions, which
remained onscreen for only about 3000 ms on average, and the short duration of the auditory form of the TWs (m = 506.66 ms, 95% CI
[445.34, 559.36]), the effort required to saccade towards the enhanced word, process it, and go back to reading the caption is unlikely
to have promoted audio-visual synchrony.
While unsynchronized textual enhancement provided the advantage of attracting the viewers’ gaze to a larger number of target
words and increasing attention in terms of total duration of fixations on these words, the synchronization of textual enhancement (both
300 ms and 500 ms before auditory onset) seemed to be more effective in terms of audio-visual synchrony in the allocation of attention
to the TWs. The timely visual processing of the orthographic form of the TWs may have generated the activation of correspondent
representations immediately before hearing their spoken form, allowing for a comparison with stored phonolexical representations.
Increased visual attention and audio-visual synchrony were expected to provide an advantage in the lexical decision post-test, as
learners moved from input processing to intake processing. This stage of learning involves hypothesis testing about language properties
and generates an initial product, held in working memory and mainly accessible via receptive testing, that can be later incorporated
into the learner’s internal system (Leow, 2015).
Our second research question asked whether the simultaneous processing of visual and auditory word forms in L2 captioned videos
promotes the updating of phonolexical representations. The significant accuracy gains of the unsynchronized condition, combined with
the lower skipping rate and longer fixation duration on the TWs, provide support to the hypothesis that highlighting target words from
11
V. Galimberti et al. System 116 (2023) 103078
the caption onset may direct the viewer’s attention to their auditory form, promoting phonological development. However, the validity
of this finding is limited by the strong effect of time, consistent with a practice effect, and by the significant accuracy gains of the
participants in the 500 ms synchronized group in response to the unenhanced subset of words. The analysis of reaction times showed
that only the viewing conditions involving caption enhancement promoted faster rejection of mispronunciations from pre-to post-test.
Together with the smaller fixation distance observed in the analysis of the eye-tracking data and, for the 300 ms condition, for the
longer total fixation duration on the TWs, the LDT findings provide partial support to our hypothesis that synchronized enhancement
would promote a comparison of the target realization of L2 words with the learners’ stored representations, leading to their update.
The unsynchronized enhancement condition also led to significant reaction time gains, in line with the assumption that less skipping
and longer fixations on the orthographic form of words would increase attention to the correspondent auditory forms.
The lack of reaction times gains on the enhanced subset of words by the unenhanced and uncaptioned control conditions points at a
‘speed–accuracy trade-off’ affecting decision-taking under time pressure (Heitz, 2014). In other words, participants who had watched
the L2 videos with unenhanced captions and no captions may have resorted to more explicit and less easily accessible knowledge, at the
cost of automaticity (Williams & Paciorek, 2016). The significant reaction time gains of the 300 ms group and the unenhanced group on
the unenhanced subset of words represent a confound, as the former group may have employed a different response strategy that
prioritized speed over accuracy compared to the other three groups. For the unenhanced group, it appears that in the absence of
enhancement directing their attention to the target words, participants may have naturally focused on the unenhanced subset of words
during exposure to the video, resulting in quicker reaction times to those words compared to the other groups.
Overall, this study has found evidence that exposure to multimodal input impacts pronunciation positively, and that pronunciation
learning from L2 video can be further stimulated by a planned form-focused intervention. Our results provide initial support for the use
of synchronized and unsynchronized caption enhancement to direct learners’ attention to auditory forms and promote the update of L2
phonolexical representations. In line with our predictions, synchronized textual enhancement led to greater audio-visual synchrony
during the viewing of L2 video than unsynchronized enhancement, and promoted significant gains in word recognition response times,
an indication of phonolexical update. The enhancement of TWs from the onset of the caption line attracted long fixations, reduced
skipping probability and was also associated with significant gains in word recognition accuracy and response times, although this
condition seemed to negatively impact audio-visual synchrony during the viewing.
The lack of a significant language learning advantage of the audio-visually synchronized conditions compared to unsynchronized
enhancement contradicts our hypotheses and represents a limitation of this study. Another limitation is the significant effect of time in
the analysis of accuracy and reaction time data for the unenhanced subset of words, which indicates that practice effect might
represent a confound. In addition, the small number of participants in the unenhanced and uncaptioned groups may have affected the
results, despite the large number of test items and the inclusion of random effects in the statistical models. The short average length of
watching materials (less than 2 min per clip) represents another limitation, as the eye-tracking data collected during this short span
might not faithfully represent learners’ natural viewing behavior. Although a sampling rate of 120 Hz (refresh rate of ~8.33ms) was
deemed appropriate to analyze fixations ranging between 50 ms and 800 ms, the relatively low sampling rate of the eye-tracker may
not have provided enough resolution in the context of viewing, where the text appears onscreen for a short time (but see Lee & Révész,
2020 for a discussion of this limitation). Finally, it would have been useful to collect stimulated recall data, in order to investigate
participants’ level of processing of enhanced words and of the correspondent auditory forms. To address these limitations, a follow up
study should involve a larger number of participants, distributed into balanced groups of equivalent size, and a longer treatment, with
longer time spans between pre- and post-test. Future studies may also combine synchronized enhancement with explicit instruction,
which has been found to promote larger gains than incidental exposure to enhanced input (Han et al., 2008).
Author contributions
The first author designed the study, collected the data and drafted the manuscript. The second author provided support on the
conceptualization and design of the study, data collection and analysis, and revised the manuscript. The third author provided support
on research design and data analysis and revised the manuscript.
Informed consent
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Acknowledgments
This work was supported by the Spanish Ministry of Science, Innovation and Universities [grant PID 2019-107814 GB-I00] and by
12
V. Galimberti et al. System 116 (2023) 103078
the Secretary of Universities and Research of the Government of Catalonia and the European Social Fund [grant FI_2019]. We would
like to thank Giacomo Bizzarrini for their assistance with statistical analyses during the revision of this manuscript.
Table A.1
Total fixation duration by viewing condition (enhanced subset).
Table A.2
Results of pairwise contrasts for total fixation duration (enhanced target word subset).
500 ms synchronization - 300 ms synchronization − 0.10 0.08 − 1.36 1.00 − 0.30 0.10
500 ms synchronization - unsynchronized enhancement − 0.10 0.06 − 1.79 0.44 − 0.24 0.05
300 ms synchronization - unsynchronized enhancement 0.00 0.06 0.06 1.00 − 0.14 0.15
Unenhanced captions - 500 ms synchronization − 0.03 0.13 − 0.26 1.00 − 0.37 0.30
Unenhanced captions - 300 ms synchronization − 0.14 0.13 − 1.07 1.00 − 0.47 0.20
Unenhanced captions - Unsynchronized enhancement − 0.13 0.12 − 1.10 1.00 − 0.45 0.19
Table A.3
Fixed coefficients for the gamma regression examining total fixation duration (unenhanced subset).
Asymptotic Bootstrapped
Intercept 5.63 0.12 46.08 <.001*** 0.10 0.29 5.39 5.87 5.51 5.79
500 ms synchronization − 0.08 0.11 − 0.73 .47 − 0.31 0.14 − 0.24 0.08
300 ms synchronization − 0.14 0.11 − 1.21 .23 − 0.36 0.09 − 0.28 0.03
Presentation time 0.19 0.08 2.46 .01* 0.04 0.35 0.12 0.27
***p < .001, *p < .05.
Table A.4
Skipping probability by viewing condition (enhanced subset).
N M SD Lower Upper
Table A.5
Results of pairwise contrasts for skipping probability (enhanced subset).
500 ms synchronization - 300 ms synchronization − 0.04 0.36 − 0.11 1.00 − 0.98 0.91
500 ms synchronization - unsynchronized enhancement 0.47 0.27 1.74 .49 − 0.24 1.17
300 ms synchronization - unsynchronized enhancement 0.50 0.27 1.84 .39 − 0.22 1.23
Unenhanced captions - 500 ms synchronization 0.58 0.54 1.07 1.00 − 0.85 2.01
Unenhanced captions - 300 ms synchronization 0.54 0.54 0.99 1.00 − 0.89 1.97
Unenhanced captions - Unsynchronized enhancement 1.04 0.52 2.02 .26 − 0.32 2.41
13
V. Galimberti et al. System 116 (2023) 103078
Table A.6
Fixed coefficients for the logistic regression examining skipping probability (unenhanced subset).
Asymptotic Bootstrapped
Intercept − 0.11 0.54 − 0.20 .84 0.06 0.42 − 1.17 0.95 − 0.52 0.29
500 ms synchronization 0.53 0.65 0.82 .41 − 0.74 1.80 0.03 0.97
300 ms synchronization 0.31 0.65 0.47 .64 − 0.96 1.57 − 0.18 0.80
Presentation time − 0.57 0.09 − 6.69 <.001*** − 0.74 − 0.40 − 0.71 − 0.33
***
p < .001.
Table A.7
Fixation distance by viewing condition (enhanced subset).
500 ms synchronization - 300 ms synchronization − 220.00 94 590 − 2.35 .12 − 469.00 28.20
500 ms synchronization - unsynchronized enhancement − 410.00 68 600 − 6.03 <.001*** − 591.00 − 230.30
500 ms synchronization - unenhanced captions − 159.00 174 55 − 0.91 1.00 − 636.00 317.80
300 ms synchronization - unsynchronized enhancement − 190.00 69 600 − 2.77 .03* − 372.00 − 8.60
300 ms synchronization - unenhanced captions 61.00 175 55 0.35 1.00 − 417.00 538.50
unsynchronized enhancement - unenhanced captions 251.00 168 48 1.50 .84 − 210.00 712.40
***p < .001, *p < .05.
Table A.9
Fixed coefficients for the linear model examining fixation distance (unenhanced subset).
Asymptotic Bootstrapped
Intercept 17.03 140.83 30.9 0.12 .91 0.11 0.56 − 258.98 293.00 − 87.32 144.60
500 ms synchronization 9.24 98.01 30.28 0.09 .93 − 182.85 201.30 − 154.03 131.70
300 ms synchronization 25.81 97.29 29.88 0.27 .79 − 164.86 216.50 − 120.70 175.50
Presentation time 272.01 119.78 17.73 2.27 .04* 37.25 506.80 217.86 326.60
*
p < .05.
Table B.1
Average scores at time 1 for accurately pronounced target words.
14
V. Galimberti et al. System 116 (2023) 103078
Table B.2
Accuracy averaged scores (max 1) and gains for enhanced target nonwords.
Table B.3
Fixed coefficients for the logistic regression examining accuracy (enhanced subset).
Asymptotic Bootstrapped
Intercept − 1.33 0.52 − 2.54 .01* 0.03 0.46 − 2.36 − 0.30 − 2.16 − 0.37
300 ms synchronization − 0.18 0.60 − 0.31 .76 − 1.36 0.99 − 1.29 1.05
Unsynchronized enhancement − 0.32 0.50 − 0.63 .53 − 1.31 0.67 − 1.38 0.76
Unenhanced captions − 0.67 0.78 − 0.86 .39 − 2.20 0.86 − 1.92 0.67
Uncaptioned 0.61 0.76 0.80 .42 − 0.88 2.09 − 0.80 1.87
Time 0.78 0.25 3.05 .002** 0.28 1.28 0.19 1.24
300 ms synchronization*time − 0.15 0.36 − 0.41 .68 − 0.86 0.56 − 0.86 0.56
Unsynchronized enhancement*time 0.00 0.31 0.01 .99 − 0.61 0.61 − 0.62 0.66
Unenhanced captions*time 0.08 0.39 0.20 .84 − 0.68 0.84 − 0.71 0.86
Uncaptioned*time − 0.47 0.38 − 1.25 .21 − 1.20 0.27 − 1.27 0.39
**p < .01, *p < .05.
15
V. Galimberti et al. System 116 (2023) 103078
Table B.4
Results of pairwise contrasts for accuracy (enhanced subset).
500 ms synchronization 1–2 − 0.78 0.25 Inf − 3.05 .10 − 1.61 0.05
300 ms synchronization 1–2 − 0.63 0.26 Inf − 2.45 .65 − 1.47 0.21
Unsynchronized enhancement 1–2 − 0.78 0.18 Inf − 4.35 <.001*** − 1.36 − 0.20
Unenhanced captions 1–2 − 0.86 0.29 Inf − 2.91 .16 − 1.81 0.10
Uncaptioned 1–2 − 0.31 0.28 Inf − 1.12 1.00 − 1.21 0.59
***
p < .001.
Table B.5
Fixed coefficients for the logistic regression examining accuracy (unenhanced subset).
Asymptotic Bootstrapped
Intercept 0.91 − 0.42 2.15 .03* 0.02 0.41 0.08 1.74 − 1.42 − 0.25
300 ms synchronization − 0.62 − 0.48 − 1.30 .19 − 1.56 0.31 − 0.21 1.39
Unenhanced captions 0.33 − 0.65 0.51 .61 − 0.94 1.59 − 1.41 0.73
Uncaptioned*time − 0.16 − 0.63 − 0.26 .80 − 1.41 1.08 − 0.97 1.15
Time − 0.55 − 0.17 − 3.24 <.001*** − 0.88 − 0.22 0.15 0.89
300 ms synchronization*time − 0.24 0.53 .59 − 0.34 0.60 − 0.62 0.37
Unenhanced captions*time − 0.09 − 0.32 − 0.28 .78 − 0.72 0.54 − 0.57 0.77
Uncaptioned*time − 0.02 − 0.31 − 0.05 .96 − 0.63 0.60 − 0.61 0.67
***p < .001, *p < .05.
Table B.6
Results of pairwise contrasts for accuracy (unenhanced subset).
500 ms synchronization 1–2 − 0.55 0.17 Inf − 3.24 0.03* − 1.08 − 0.02
300 ms synchronization 1–2 − 0.42 0.17 Inf − 2.48 0.37 − 0.95 0.11
Unenhanced captions 1–2 − 0.64 0.28 Inf − 2.34 0.55 − 1.50 0.22
Uncaptioned 1–2 − 0.57 0.27 Inf − 2.13 0.92 − 1.40 0.26
*
p < .05.
Table B.7
Reaction time averages and gains for enhanced target nonwords.
Table B.8
Results of pairwise contrasts for reaction time (enhanced subset).
500 ms synchronization 1–2 0.18 0.03 Inf 6.80 <.001*** 0.09 0.27
300 ms synchronization 1–2 0.11 0.03 Inf 4.23 <.001*** 0.03 0.20
(continued on next page)
16
V. Galimberti et al. System 116 (2023) 103078
Unsynchronized enhancement 1–2 0.10 0.02 Inf 5.07 <.001*** 0.03 0.16
Unenhanced captions 1–2 0.02 0.03 Inf 0.72 1.00 − 0.08 0.13
Uncaptioned 1–2 0.06 0.03 Inf 2.08 1.00 − 0.03 0.16
***
p < .001.
Table B.9
Fixed coefficients for the RT gamma regression (unenhanced subset).
Asymptotic Bootstrapped
Intercept 7.49 0.09 86.85 <.001*** 0.08 0.28 7.32 7.66 7.40 7.58
500 ms synchronization − 0.03 0.09 − 0.37 .71 − 0.21 0.14 − 0.14 0.08
300 ms synchronization − 0.01 0.09 − 0.15 .88 − 0.19 0.16 − 0.11 0.10
Unenhanced captions 0.03 0.11 0.29 .77 − 0.18 0.24 − 0.10 0.16
Time − 0.08 0.03 − 2.90 <.001*** − 0.14 − 0.03 − 0.14 − 0.03
500 ms synchronization*time − 0.03 0.03 − 0.93 .35 − 0.10 0.03 − 0.10 0.03
300 ms synchronization*time − 0.01 0.03 − 0.16 .87 − 0.07 0.06 − 0.07 0.06
Unenhanced captions*time 0.00 0.04 0.09 .93 − 0.08 0.08 − 0.07 0.09
***
p < .001.
Table B.10
Results of pairwise contrasts for reaction time (unenhanced subset).
500 ms synchronization 1–2 0.08 0.03 Inf 2.90 .10 − 0.01 0.17
300 ms synchronization 1–2 0.11 0.02 Inf 6.26 <.001*** 0.06 0.17
Unenhanced captions 1–2 0.09 0.02 Inf 5.17 <.001*** 0.03 0.14
Uncaptioned 1–2 0.08 0.03 Inf 2.55 .30 − 0.02 0.17
***
p < .001.
References
Alsadoon, R., & Heift, T. (2015). Textual input enhancement for vowel blindness: A study with Arabic ESL learners. The Modern Language Journal, 99(1), 57–79.
https://doi.org/10.1111/modl.12188
Bailly, G., & Barbour, W. (2011). Synchronous reading: Learning French orthography by audiovisual training. Proceedings of the Annual Conference of the International
Speech Communication Association, Interspeech, (May), 1153–1156.
Bird, S., & Williams, J. N. (2002). The effect of bimodal input on implicit and explicit memory: An investigation into the benefits of within-language subtitling. Applied
PsychoLinguistics, 23, 509–533.
Birulés-Muntané, J., & Soto-Faraco, S. (2016). Watching subtitled films can help learning foreign languages. PLoS One, 11(6), Article e0158409.
Broersma, M., & Cutler, A. (2008). Phantom word activation in L2. System, 36, 22–34. https://doi.org/10.1016/j.system.2007.11.003
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and
improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. https://doi.org/10.3758/BRM.41.4.977
Charles, T., & Trenkic, D. (2015). The effect of bi-modal input presentation on second language listening: The focus on speech segmentation. In Y. Gambier, A. Caimi,
& C. Mariotti (Eds.), Subtitles and language learning (pp. 173–197) (Bern: Peter Lang).
Cobb, T.. Compleat Web VP v.2.1 [computer program] Accessed March 2019 at https://www.lextutor.ca/vp/comp/.
Conklin, K., Alotaibi, S., Pellicer-Sánchez, A., & Vilkaitė-Lozdienė, L. (2020). What eye-tracking tells us about reading-only and reading-while-listening in a first and
second language. Second Language Research, 36(3), 257–276. https://doi.org/10.1177/0267658320921496
Conklin, K., Pellicer-Sánchez, A., & Carrol, G. (2018). Eye-tracking: A guide for applied linguistics research. Cambridge: Cambridge University Press. https://doi.org/
10.1017/9781108233279
Cook, S. V., Pandža, N. B., Lancaster, A. K., & Gor, K. (2016). Fuzzy nonnative phonolexical representations lead to fuzzy form-to-meaning mappings. Frontiers in
Psychology, 7(Sep), 1–17. https://doi.org/10.3389/fpsyg.2016.01345
Darcy, I., Daidone, D., & Kojima, C. (2013). Asymmetric lexical access and fuzzy lexical representations in second language learners. The Mental Lexicon, 8(3),
372–420. https://doi.org/10.1075/ml.8.3.06dar
Darcy, I., & Holliday, J. J. (2019). Teaching an old work new tricks: Phonological updates in the l2 mental lexicon. In J. Levis, C. Nagle, & E. Todey (Eds.), Proceedings
of the 10th pronunciation in second language learning and teaching conference (pp. 10–26). Ames: Iowa State University.
Gerbier, E., Bailly, G., & Bosse, M. L. (2018). Audio–visual synchronization in reading while listening to texts: Effects on visual behavior and verbal learning. Computer
Speech & Language, 47, 74–92.
Ghia, E. (2012). Subtitling matters. New perspectives on subtitling and foreign language learning. Oxford: Peter Lang.
Godfroid, A. (2019). Eye-tracking in second language acquisition and bilingualism: A research synthesis and methodological guide. https://doi.org/10.4324/9781315775616
Han, Z., Park, E. S., & Combs, C. (2008). Textual enhancement of input: Issues and possibilities. Applied Linguistics, 29(4), 597–618. https://doi.org/10.1093/applin/
amn010
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical proficiency. EUROSLA Yearbook, 6, 147–168. https://doi.org/10.1075/eurosla.6.10har
17
V. Galimberti et al. System 116 (2023) 103078
Heitz, R. P. (2014). The speed-accuracy tradeoff: History, physiology, methodology, and behavior. Frontiers in Neuroscience, 8, 150. https://doi.org/10.3389/
fnins.2014.00150
Jenkins, J. (2002). A sociolinguistically based, empirically researched pronunciation syllabus for English as an international language. Applied Linguistics, 23(1),
83–103. https://doi.org/10.1093/applin/23.1.83
Kam, E. F., Liu, Y. T., & Tseng, W. T. (2020). Effects of modality preference and working memory capacity on captioned videos in enhancing L2 listening outcomes.
ReCALL, 32(2), 213–230. https://doi.org/10.1017/S0958344020000014
Kostromitina, M., & Plonsky, L. (2021). Elicited imitation tasks as a measure of l2 proficiency: A meta-analysis. Studies in Second Language Acquisition, 1–26. https://
doi.org/10.1017/S0272263121000395
Kruger, J. L., & Steyn, F. (2014). Subtitles and eye tracking: Reading and performance. Reading Research Quarterly, 49(1), 105–120.
Lee, M., & Révész, A. (2020). Promoting grammatical development through captions and textual enhancement in multimodal input-based tasks. Studies in Second
Language Acquisition, 1–27. https://doi.org/10.1017/S0272263120000108
Leow, R. P. (2015). Explicit learning in the L2 classroom. New York: Routledge.
Levis, J. (2018). Intelligibility, oral communication, and the teaching of pronunciation. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108241564
Lindgren, E., & Muñoz, C. (2013). The influence of exposure, parents, and linguistic distance on young European learners’ foreign language comprehension.
International Journal of Multilingualism, 10(1), 105–129. https://doi.org/10.1080/14790718.2012.679275
Llompart, M., & Reinisch, E. (2018). Robustness of phonolexical representations relates to phonetic flexibility for difficult second language sound contrasts.
Bilingualism: Language and Cognition, 1, 16. https://doi.org/10.1017/S1366728918000925
Marian, V., Bartolotti, J., Chabal, S., & Shook, A. (2012). Clearpond: Cross-linguistic easy-access resource for phonological and orthographic neighborhood densities.
PLoS One, 7(8). https://doi.org/10.1371/journal.pone.0043230
Marsden, E., Mackey, A., & Plonsky, L. (2016). The IRIS Repository: Advancing research practice and methodology. In A. Mackey, & E. Marsden (Eds.), Advancing
methodology and practice: The IRIS repository of instruments for research into second languages (pp. 1–21). New York: Routledge.
Mayer, R. (2014). Cognitive theory of multimedia learning. In R. Mayer (Ed.), The cambridge Handbook of multimedia learning(cambridge handbooks in psychology (pp.
43–71). Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139547369.005.
Mitterer, H., & McQueen, J. M. (2009). Foreign subtitles help but native-language subtitles harm foreign speech perception. PLoS One, 4(11), Article e7785.
Montero Pérez, M., Peters, E., & Desmet, P. (2015). Enhancing vocabulary learning through captioned video: An eye-tracking study. The Modern Language Journal, 99
(2), 308–328. https://doi.org/10.1111/modl.12215
Muñoz, C. (2008). Symmetries and asymmetries of age effects in naturalistic and instructed L2 learning. Applied Linguistics, 29(4), 578–596. https://doi.org/10.1093/
applin/amm056
Muñoz, C. (2014). Contrasting effects of starting age and input on the oral performance of foreign language learners. Applied Linguistics, 35(4), 463–482.
Ortega, L., Iwashita, N., Norris, J. M., & Rabie, S. (2002). An investigation of elicited imitation tasks in crosslinguistic SLA research (Presentation given at Second Language
Research Forum, Toronto).
Pattemore, A., & Muñoz, C. (2020). Learning L2 constructions from captioned audio-visual exposure: The effect of learner-related factors. System, 93, Article 102303.
https://doi.org/10.1016/j.system.2020.102303
Pellicer-Sánchez, A. (2015). Developing automaticity and speed of lexical access: The effects of incidental and explicit teaching approaches. Journal of Spanish
Language Teaching, 2(2), 126–139. https://doi.org/10.1080/23247797.2015.1104029
Ramus, F., Peperkamp, S., Christophe, A., Jacquemot, C., Kouider, S., & Dupoux, E. (2010). A psycholinguistic perspective on the acquisition of phonology. Laboratory
Phonology 10: Variation, Phonetic Detail and Phonological Representation, 311–340. Retrieved from papers3://publication/uuid/DBE40588-7CFD-4A08-A723-
CC16C3497081.
Rodgers, M. (2013). English language learning through viewing television: An investigation of comprehension, incidental vocabulary acquisition, lexical coverage, attitudes, and
captions [Doctoral dissertation, Victoria University of Wellington] http://hdl.handle.net/10063/2870.
Schmidt, R. (1990). The role of consciousness in second language learning. Applied Linguistics, 11(2), 129–158. https://doi.org/10.1093/applin/11.2.129
Sharwood Smith, M. (1991). Speaking to many minds: On the relevance of different types of language information for the L2 learner. Second Language Research, 7(2),
118–132. https://doi.org/10.1177/026765839100700204
Sharwood Smith, M. (1993). Input enhancement in instructed SLA. Studies in Second Language Acquisition, 15(2), 165–179. https://doi.org/10.1017/
S0272263100011943
Showalter, C. (2019). Russian phonolexical acquisition and orthographic input: Naïve learners, experienced learners, and interventions. Studies in Second Language
Acquisition, 1–23. https://doi.org/10.1017/S0272263119000585
Van Zeeland, H., & Schmitt, N. (2013). Incidental vocabulary acquisition through L2 listening: A dimensions approach. System, 41(3), 609–624.
Williams, J. N., & Paciorek, A. (2016). Indirect tests of implicit linguistic knowledge. In Advancing methodology and practice: The IRIS repository of instruments for
research into second languages (pp. 23–42). Taylor and Francis. https://doi-org.sire.ub.edu/10.4324/9780203489666.
Winke, P. M. (2013). The effects of input enhancement on grammar learning and comprehension. Studies in Second Language Acquisition, 35(2), 323–352. https://doi.
org/10.1017/S0272263112000903
Wisniewska, N., & Mora, J. C. (2018). Pronunciation learning through captioned videos. In J. Levis (Ed.), Proceedings of the 9th annual pronunciation in second language
learning and teaching conference (pp. 204–215). Ames: Iowa State University.
Wisniewska, N., & Mora, J. C. (2020). Can captioned video benefit second language pronunciation? Studies in Second Language Acquisition, 42(3), 599–624.
Further reading
Bisson, M. J., Van Heuven, W. J. B., Conklin, K., & Tunney, R. J. (2014). Processing of native and foreign language subtitles in films: An eye tracking study. Applied
PsychoLinguistics, 35(2), 399–418. https://doi.org/10.1017/S0142716412000434
18