Article Research Hanisa Insyarani

System 116 (2023) 103078
Contents lists available at ScienceDirect
System
journal homepage: www.elsevier.com/locate/system
Audio-synchronized textual enhancement in foreign language

pronunciation learning from videos
Valeria Galimberti *, Joan C. Mora, Roger Gilabert
Department of Modern Languages and Literatures and of English Studies, Universitat de Barcelona, 585 Gran Via de Les Corts Catalanes, 08007,
Barcelona, Spain
A R T I C L E I N F O A B S T R A C T
Keywords: The benefits of multimodal input on foreign language listening comprehension and vocabulary
Multimodal input enhancement learning are well-established, but only recently has its impact on pronunciation been explored. In
Authentic L2 input this study, we audio-synchronized the highlighting of target words in captions to promote the
L2 captioned video
activation of their phonolexical representations at the time they are auditorily processed and
Audio-visual synchrony
Phonolexical representations
improve phonological updating in the mental lexicon. We recorded the eye movements of 58 L1-
L2 pronunciation Spanish/Catalan learners of English as they watched two videos with target words (TWs) high
Auditory form recognition lighted 500 ms or 300 ms before auditory onset, highlighted from caption onset or, alternatively,
Eye-tracking under one of two control conditions (unenhanced and uncaptioned). We assessed updating of
phonolexical forms in terms of more accurate and faster rejection of mispronunciations of the
TWs from pre-to post-test. Results showed that 300 ms synchronized enhancement and unsyn
chronized enhancement led to longer fixation duration, unsynchronized enhancement reduced
TW skipping probability, and both synchronized conditions promoted higher audio-visual syn
chrony in learners’ caption reading. While only the unsynchronized condition resulted in more
accurate responses at post-test, all enhancement conditions led to significantly faster rejection of
mispronunciations. These initial findings call for further research on audio-synchronized
enhancement and its potential benefits for L2 pronunciation learning.
1. Introduction
The scarcity of exposure to authentic spoken input in the foreign language (FL) classroom has a detrimental effect on the devel
opment of listening and speaking skills among second language (L2)1 learners (Muñoz, 2008, 2014). This can be partially compensated
for by carrying out leisure activities in the target language, such as watching television and movies, the preferred source of FL input
among young learners (Lindgren & Muñoz, 2013). L2 captioned video, unlike real spoken interaction, offers the possibility of having
orthographic input available as spoken input is being processed, a feature that promotes auditory word recognition, speech seg
mentation, and the mapping of L2 orthography to phonological form (Bird & Williams, 2002; Charles & Trenkic, 2015; Mitterer &
McQueen, 2009). However, second language acquisition research has only recently started to explore the potential of captioned video
* Corresponding author.
E-mail addresses: [email protected] (V. Galimberti), [email protected] (J.C. Mora), [email protected] (R. Gilabert).
1
The term foreign language (FL) learning refers to formal instruction in a language not commonly spoken in the country of the speaker. In this
paper, second language (L2) learning is used to refer to all learners of a language, including foreign language learners, when the focus is on the
learning process rather than the context.
https://doi.org/10.1016/j.system.2023.103078
Received 14 June 2022; Received in revised form 15 May 2023; Accepted 30 May 2023
Available online 16 June 2023
0346-251X/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
V. Galimberti et al. System 116 (2023) 103078
for pronunciation teaching and learning (Wisniewska & Mora, 2020), a still severely underresearched area. Important issues that
remained unanswered are whether captioned video without any kind of manipulation can draw the viewers’ attention to phonological
form and what sorts of input transformations can enhance attention to phonological form.
In this study, we investigated whether textual enhancement of words in L2 captions, synchronized with their auditory onset, directs
viewers’ attention to auditory target forms, potentially leading to the updating of phonological word forms. To this end, we manip
ulated the captions in L2 videos by highlighting a selection of English words that present pronunciation difficulties for L1 Spanish/
Catalan learners of English and are typically mispronounced. The textual highlighting was temporally synchronized with the auditory
onset of the target words so that words would highlight in yellow immediately before they were heard. This enhancement procedure
was expected to promote a comparison of the phonolexical representations of the target words, activated in the learners’ mental
lexicons through textual enhancement, with their auditory forms as produced by the speakers in the video. We analyzed learners’ eye
gaze behavior as an online measure of attention allocation during the viewing, which is considered the first step towards deeper
processing of target forms (Conklin et al., 2018; Montero Pérez et al., 2015). We used an auditory-only speeded lexical decision task to
test whether audio-synchronized textual enhancement promoted the updating of L2 phonolexical representations.
2. Background
2.1. Phonolexical representations in language processing
From a psycholinguistic point of view, the interpretation of spoken language involves several levels of processing and represen
tation (Ramus et al., 2010). As the input speech signal unfolds in time, speech sounds and their acoustic properties are encoded into
abstract language-specific sub-lexical phonological representations that provide access to the phonological representation of word
forms (phonolexical representations) stored in the mental lexicon together with their semantic and orthographic representations. For
proficient speakers, these phonetic, phonological, and semantic decoding processes are automatic, making speech perception and
auditory word recognition seemingly effortless. In a similar way, the speech production process involves the efficient operation of
semantic, phonological, and phonetic encoding mechanisms that make the selection, retrieval, and articulation of word forms auto
matic for proficient speakers. For L2 learners, however, L1-based perception and the imprecision of their phonolexical representations
make L2 speech perception and L2 word recognition effortful and inefficient, posing processing difficulties that severely hinder
automaticity in L2 speech perception and production (Darcy et al., 2013).
Cook et al. (2016) distinguish between difficulties in L2 phonological encoding, when a learner cannot distinguish two words which
differ by a single phoneme, and problems in L2 phonolexical encoding, when a word is confused with others despite not involving a
difficult phonological contrast, because its phonolexical representation is “fuzzy” or imprecise. Issues with L2 phonolexical encoding
lead to low speed and accuracy of lexical access and unstable form-to-meaning mappings, which may impair language development.
Therefore, the ability to perceive and produce L2 speech efficiently depends not only on the correct identification and categorization of
L2 sounds, but also on the extent to which learners’ phonolexical representations faithfully reflect the lexical and sub-lexical properties
of the speech input to be processed.
A growing body of research has used lexical decision tasks to test the lexical encoding of L2 phonological contrasts (e.g., Darcy
et al., 2013; Darcy & Holliday, 2019; Llompart & Reinisch, 2019). Lexical decision tasks which require learners to classify stimuli as
words or nonwords as fast as possible test learners’ speed and accuracy in auditorily recognizing L2 word forms (Harrington, 2006).
Performance in this type of speeded tasks has been shown to reflect the degree of automaticity in L2 lexical access resulting from lexical
acquisition (Williams & Paciorek, 2016). In a lexical decision task where nonwords reflect L1-based mispronunciations of real L2
words, lower speed and accuracy rates could reflect the instability of imprecise L2 phonolexical representations (Cook et al., 2016) or
the higher processing cost of the larger lexical competition associated with imprecise phonolexical representations (Broersma & Cutler,
2008). On the other hand, higher speed and accuracy rates in rejecting mispronounced forms as words would indicate higher stability
of lexical representations and higher automaticity in lexical access (Pellicer-Sánchez, 2015). In the lexical decision task used in the
current study, higher response accuracy and latency gains in rejecting the mispronounced auditory forms of the target words in the L2
captioned videos were deemed to be indicative of updatings in the mental lexicon.
2.2. Multimodal input processing and pronunciation learning
In the foreign language learning context, learners are exposed to L1-accented L2 speech from peers and teachers, which can lead to
the establishment of L1-accented phonolexical representations for L2 words (Llompart & Reinisch, 2018). Pronunciation-focused
instruction and exposure to target-like models, including L2 captioned TV shows and films containing authentic speech at a natu
rally fast pace, might be effective at making these representations more target-like (Darcy & Holliday, 2019). The presence of verbatim
captions offers processing support to language learners, allowing for the simultaneous processing and integration of visual (the text in
the captions) and auditory (the corresponding soundtrack) information (Mayer, 2014). This hypothesis is consistent with the findings
of the few studies to date which have investigated the effects of captioned video on L2 speech learning. L2 captioned video has been
shown to help listeners segment the continuous stream of speech into auditory word forms, improving auditory word recognition and
real-time L2 listening comprehension (Charles & Trenkic, 2015). In addition, L2 captioned video (unlike L1-captioned or uncaptioned
video) has been hypothesized to support lexically-guided retuning of L2 perceptual categories, as shown by the processing advantage
of advanced L2 learners when exposed to an unfamiliar regional accent in this viewing mode (Mitterer & McQueen, 2009). Even less
advanced learners have been found to incidentally develop their L2 speech perception skills when watching videos with L2 captions,
2
but not with L1 captions, confirming the language learning potential of this viewing modality (Birulés-Muntané & Soto-Faraco, 2016).
However, large individual differences exist in how learners allocate attention to visual (written) and auditory input when viewing
L2 captioned video, as reflected by learners’ differences in visual behavior when processing captioned video (Kam et al., 2020).
Reading behavior in the context of audiovisual dynamic texts (captions) is qualitatively different from reading static text in that the
processing of fleeting text occurs while auditory information from the soundtrack and on-screen movement compete for attention,
potentially interfering with reading and leading to cognitive overload as the allocation of attention switches between the captions, the
moving image, and the auditory input (Kruger & Steyn, 2014). As a result, text and auditory processing are often misaligned, probably
hindering the potential benefits of the simultaneous processing of auditory and textual input. The temporal synchronization of textual
enhancement with the auditory onset of words has the potential of guiding learners’ attention to the auditory form of words at the
moment of visual word recognition.
2.3. Multimodal input enhancement and synchronization
The use of enhancement techniques such as bold-facing or highlighting to augment the salience of linguistic features is underpinned
by Schmidt’s (1990) Noticing Hypothesis, which claimed that only the L2 input that has been noticed is converted into intake for
further processing, and Sharwood Smith’s (1991, 1993) Input Enhancement Hypothesis, which stated that learners can be guided
towards noticing a language feature by manipulating the feature’s visual salience and frequency in the input. Eye-tracking studies have
shown that textual enhancement is effective at drawing learners’ attention to form while processing meaning in the L2 input (Winke,
2013), and that increased attention on the enhanced structures may lead to better performance in language tests. For example, Lee and
Révész (2020) found gains in how accurately learners used the past simple and present perfect to be related to increased attention
allocation to form with enhanced, but not with unenhanced captions. Similarly, eye-tracking measures of fixation duration during
exposure to textually enhanced input have been found to correlate with vowel encoding accuracy gains in an orthographic form
recognition task (Alsadoon & Heift, 2015).
The benefits of input enhancement for language learning have been investigated for various grammatical and lexical aspects, but
surprisingly little research has focused on pronunciation. In a study on the influence of orthography in Russian phonolexical acqui
sition, Showalter (2019) found that textual input enhancement helped learners associate auditory forms with picture stimuli more
effectively than explicit instruction. The author suggests that textually enhancing difficult graphene-phoneme correspondences in
multimodal input may have forced learners to attend more carefully to auditory input and select important information in an attempt
to figure out phonological rules, whereas receiving the rules in advance and keeping them in mind during the listening task may have
been too taxing. Our prediction is that textual enhancement would similarly promote updating of phonolexical forms during multi
modal exposure to authentic input through L2 captioned video. Since readers tend to read ahead of the audio in the L1 and L2, both in
reading-while-listening (Conklin et al., 2020) and captioned video viewing (Wisniewska & Mora, 2018), temporally synchronizing the
highlighting of target words in captions with their auditory onset should enable learners to direct their attention to the auditory form of
words once they have been activated through their highlighted orthographic form.
Previous research on reading-while-listening (Bailly & Barbour, 2011; Gerbier et al., 2018) has shown that visual enhancement
occurring before the word auditory onset effectively promotes the simultaneous processing of orthographic and auditory word forms.
For example, Gerbier et al. (2018) found that highlighting each word in a text 300 ms before its auditory onset triggered fewer but
longer fixations per word and fewer regressive saccades, generating a more fluent reading trajectory that helped readers keep the pace
with the voice reading the text. The 300 ms time-lag interval between enhancement and auditory onset, which increased reading
fluency in Gerbier et al.’s (2018) reading-while-listening study, may be equally as effective in the context of video. However, viewers
may be mostly focused on the moving image at the center of the screen, or reading captions ahead of the audio, until the appearance of
text attracts their gaze to the enhanced word. In this case, a longer time lag of 500 ms would allow them to notice the enhanced word in
the peripheral visual field, plan and execute the saccade towards the word, which takes about 200 ms (Godfroid, 2019), and activate
the stored phonological representation before hearing the word’s auditory onset. In the current study we explore the differential effects
of two time-lag intervals (300 ms vs. 500 ms) between highlighting and auditory word onsets and predict that exposure to syn
chronized L2 captioned video will promote audio-visual synchrony, enhancing the updating of the target words’ phonolexical
representations.
3. Research questions
The present study explores the effects of audio-synchronized textual enhancement in L2 captioned videos on the processing of
auditory word forms and their phonological update, as measured by a lexical decision task requiring accurate and timely rejection of
non-targetlike auditory forms.
The study addresses the following research questions.
1. Which type of textual enhancement (audio-synchronized at 300 ms, 500 ms, unsynchronized or unenhanced) is more effective at
enhancing the simultaneous processing of target visual and auditory word forms in L2 captioned videos?
2. Does the simultaneous processing of visual and auditory word forms in L2 captioned videos promote the updating of phonolexical
representations?
Our predictions are that.
3
1. Textual enhancement right before auditory onset will more effectively enhance the simultaneous processing of visual and auditory
word forms, compared to unsynchronized enhancement and no enhancement. In particular, we expect the 500 ms time-lag interval
between enhancement and auditory onset to promote closer audio-visual synchrony than the other conditions.
2. The simultaneous processing of visual and auditory word forms generated by the synchronized enhancement of words in L2
captions will promote phonolexical update to a larger extent than exposure to captions containing unsynchronized enhancement
and no enhancement. In particular, we expect the 500 ms time-lag interval between enhancement and auditory onset to promote
larger gains in lexical decision accuracy and response times than the other conditions.
4. Methodology
4.1. Participants
Fifty-eight first-year university students (female = 51) pursuing an English degree at a public university in Spain were recruited.
They were bilingual L1 speakers of Spanish and Catalan and reported a B1–B2 English level based on language certificates and/or self-
assessment. As a measure of global L2 proficiency (Kostromitina & Plonsky, 2021), we administered an elicited imitation task (Ortega
et al., 2002). In this task, participants listened to and repeated 30 sentence items of increasing word length and structural complexity.
Following Ortega et al.’s (2002) rubric available in the IRIS digital repository (Marsden et al., 2016), sentence productions were given
0 to 4 points based on repetition accuracy to a maximum score of 120. Participants obtained a mean score of 97 out of 120 (range
64–118, SD = 13.2), indicating an upper intermediate level of proficiency. Participants were randomly assigned to four groups of
comparable proficiency (F (3, 54) = 0.981, p = .41) according to the viewing conditions described in section 4.3. Most participants
were familiar with watching English language TV and videos (an average of 5 h a week) with and without captions (see Table 1 for
participants’ demographics).
4.2. Materials
4.2.1. Clips
Four clips were selected from the first episode of the TV series ‘The Good Place’, which had been previously used in a language
acquisition study with a similar sample of learners (Pattemore & Muñoz, 2020). The genre of TV series was selected for containing
highly contextualized and predictable dialogues on familiar topics (Ghia, 2012) that allow learners to allocate attentional resources to
linguistic form. One clip (1′ 40′′ ) had unenhanced captions and contained 19 target words (TWs), followed by a sample clip (00′ 35′′ ),
not included in the analyses, used to familiarize participants in the experimental groups with caption enhancement. The last two clips
(1′ 50′′ each) contained 9 TWs each that were highlighted in yellow at different time-lags with auditory onset, depending on the
experimental condition (see section 4.3). The captions were edited with Aegisub, a software that allows for manual synchronization
thanks to an in-built spectrum analyzer, and hardcoded as one- or two-line captions in Arial font size 20. An analysis with the program
Vocabprofile Compleat (Cobb 2015) showed that the most frequent 1000-, 2000- and 3000-word families provided 90%, 95% and 96%
coverage of the script, respectively. Following Van Zeeland and Schmitt (2013) and Rodgers (2013), we expected a coverage level of
Table 1
Participants’ demographics by group.
500 ms synchronized 300 ms synchronized Unenhanced captions Uncaptioned
(N = 21) (N = 21) (N = 8) (N = 8)
M SD 95% CI M SD 95% CI M SD 95% CI M SD 95% CI
Age at testing 20.65 3.03 [19.23, 20.26 2.50 [19.09, 19.56 0.85 [18.85, 23.70 10.25 [15.13,
22.07] 21.43 20.27] 32.27]
L2 proficiency (0–120 97.80 13.47 [91.50, 98.50 13.78 [92.05, 89.50 9.55 [81.52, 97.38 15.85 [84.13,
points) 104.10] 104.95] 97.48] 110.62]
Age of onset of L2 learning 4.90 1.37 [4.26, 5.50 1.32 [4.88, 6.25 3.45 [3.36, 4.88 1.96 [3.24,
in school 5.54] 6.12] 9.14] 6.51]
Extracurricular classes 2.95 3.91 [1.12, 2.28 3.70 [0.54, 5.63 4.14 [2.17, 4.13 4.12 [0.68,
(years) 4.78] 4.01] 9.08] 7.57]
Estimated spoken L2 inputa 28.55 16.75 [20.71, 22.35 11.23 [17.09, 18.50 7.95 [11.86, 25.00 10.00 [16.64,
36.39] 27.61] 25.14] 33.36]
b
Estimated L2 output 10.90 12.48 [5.06, 10.48 11.57 [5.06, 5.75 1.98 [4.09, 12.63 12.76 [1.96,
16.74] 15.89] 7.41] 23.29]
Exposure to L2 videos and 8.27 5.99 [5.47, 5.26 3.19 [3.76, 4.21 1.56 [2.90, 5.71 3.81 [2.52,
TV (hours per week) 11.08] 6.75] 5.51] 8.89]
Self-estimated L2 6.41 1.66 [5.63, 6.49 1.42 [5.82, 6.47 0.81 [5.80, 5.40 2.54 [3.27,
proficiency (1 = very 7.19] 7.16] 7.14] 7.53]
poor – 9 = proficient)c
a
English input from L1 and L2 speakers in hours per week.
b
Oral L2 use with L1 and L2 speakers in hours per week.
c
Averaged self-estimated reading, writing, listening, speaking and pronunciation proficiency.
4
95% at 2000-word families to provide sufficient comprehension to intermediate L2 learners.
4.2.2. Target words

The selection of target words was based on a pilot survey with six L1 Spanish/Catalan speakers taking a B2 English course at a
language academy. The students recorded themselves while reading aloud the script of the clips that were later used in the inter
vention, and 40 words were selected for being more consistently mispronounced (e.g., /ϵmba′ rəsɪŋ/ for embarrassing, /aʊ′ streɪljə/ for
Australia). Three words were later eliminated from the analysis because their mispronounced versions were too similar to real words in
English (e.g., /′ noθɪŋ/ for nothing). Out of 37 TWs, a subset of 18 were textually enhanced in the clips (Table 2), whereas the remaining
TWs (control unenhanced subset) were unmodified. The five error categories identified in the final list of 37 TWs were: misplaced word
stress (n = 8), inaccurate realization of vowel (n = 9), diphthong (n = 11) or consonant sounds (n = 4), and insertion of epenthetic
vowel in regular past -ed realization (n = 5). These features are typically problematic among Spanish learners of English and may affect
mutual intelligibility, based on Levis’ (2018) selection of L2 pronunciation features for intelligibility, and Jenkins’ (2002) Lingua
Franca Core. Despite the variability naturally occurring in the materials, an attempt was made to have comparable datasets for
enhanced and unenhanced target words in terms of word class, with the majority of the words being nouns (9 in the enhanced and 7 in
the unenhanced subset, respectively), followed by verbs (4 and 8, respectively), adjectives (2 and 4), and 2 adverbs and a conjunction.
Fisher’s exact tests with Monte Carlo estimations of the p values (two-tailed) were run to determine if there were nonrandom
associations between the TW subset and the words’ presentation properties (Table 3). There was no significant difference between the
two subsets in terms of number of caption lines (1 or 2) on-screen at the time of TW presentation (p = .41), whether the TW occurred in
line 1 or 2 (p = .52), and the TW position (initial, medial or final) (p = .16). The same was true regarding the number of target words
included in each line (p = .18), which was always one for the enhanced subset of words, except for one that appeared in the same line
(although at opposite ends) as a word of the unenhanced subset. A T-test found no differences between the two subsets in presentation
time, i.e., the time-lag between the onset and the offset of the caption containing the TW (t (35) = − 1.67, p = .11). Table 3 reports the
auditory duration of the words, as well as the size and position of the rectangular areas of interest (AOIs) that were drawn manually
around each TW.
4.2.3. Lexical decision task

To avoid the interference of orthography and following previous studies on the updating of L2 phonolexical representations (e.g.,
Darcy & Holliday, 2019; Llompart & Reinisch, 2018), we used a speeded auditory lexical decision task (LDT). The task included 8
practice items and 157 test items, 40 correctly pronounced target words (e.g.,/ə′ laʊ/for allow), 37 corresponding nonwords (L1-biased
versions, e.g.,/ə′ loʊ/for allow), 40 unrelated word distractors and 40 nonword distractors. Word distractors did not appear in any of
the video clips and were chosen to match the same orthographic and phonological length and lexical frequency as TWs. Nonword
distractors were selected from the CLEARPOND database (Marian et al., 2012) so that each distractor would match the orthographic
and phonological length and neighborhood size of a TW. An ANOVA found no significant differences between real and nonword
distractors and TWs in orthographical length (F (3, 116) = 0.385, p = .76) and phonological length (F (3, 116) = 0.524, p = .66). An
English L1 speaker of the same variety of American English spoken by the characters in the clips recorded all stimuli twice in a quiet
room. One recording of each item was saved into a separate sound file and normalized for amplitude using the software Praat (version
6.0.13). The stimuli were presented randomly and once only, with an inter-stimulus-interval of 2000 ms and a time out of 2500 ms via
the software DMDX 6.0.0.1. Participants were instructed to press a key with their right index finger when they believed the stimulus
they heard was an English word, and to press another key with their left index finger when they believed the stimulus was not an
Table 2
Linguistic properties of the target words in the enhanced subset.
Word class Orthographic length Phonological lengtha Occurrences in clips Lexical frequencyb Error category
Actually adverb 8 6 2 322.33 consonant

Adorable adjective 8 8 1 10.53 stress
Arizona noun 7 7 3 11.06 vowel
Basically adverb 9 7 1 26.02 diphthong
Clown noun 5 4 4 15.82 diphthong
Happened verb 8 6 1 490.08 past -ed
Interior noun 8 8 1 5.24 vowel
Language noun 8 7 1 35.1 vowel
Lawyer noun 6 3 3 79.51 diphthong
Mission noun 7 5 1 47.06 consonant
Nigeria noun 7 6 1 0.71 diphthong
Overwhelming adjective 12 9 1 4.92 vowel
Phoenix noun 7 6 2 10.88 vowel
Promise verb 7 6 2 153.12 diphthong
Pursuit noun 7 6 1 7.04 stress
Rescued verb 7 7 1 5.41 stress
Review verb 6 5 1 14.8 stress
Whereas conjunction 7 5 1 3.55 stress
a
IPA notation system for American English.
b
Frequency per million words in the SUBTLEXUS database (Brysbaert & New, 2009).
5
Table 3
Presentation properties of the target words in the enhanced subset.
Caption lines Line with TW TW position Presentation time (ms) AOI positiona (px) AOI size (px) Auditory duration (ms)
Adorable 2 1 Medial 3220 481, 593 184 × 52 580

Basically 2 2 Initial 3820 375, 644 184 × 55 560
Clown 2 2 Medial 4700 602, 647 145 × 50 500
Happened 2 2 Medial 2650 470, 648 207 × 52 350
Interior 2 2 Medial 3100 401, 646 148 × 51 620
Lawyer 2 1 Medial 4500 593, 597 141 × 53 470
Mission 2 2 Initial 3000 409, 648 162 × 50 410
Review 2 1 Medial 2650 549, 594 141 × 52 380
Whereas 2 2 Initial 3220 415, 645 186 × 53 390
Actually 1 1 Medial 2500 415, 643 164 × 55 460
Arizona 2 2 Initial 1600 369, 647 167 × 50 640
Language 2 1 Medial 2250 435, 594 196 × 54 350
Nigeria 2 1 Final 3420 746, 592 151 × 55 670
Overwhelming 2 2 Final 4960 524, 646 296 × 55 680
Phoenix 1 1 Medial 3650 684, 642 174 × 55 590
Promise 2 1 Medial 4950 512, 596 171 × 55 420
Pursuit 2 1 Final 3400 728, 595 142 × 51 520
Rescued 2 2 Medial 3260 505, 647 171 × 50 530
a
Horizontal and vertical coordinates (respectively) of the area of interest containing the target word, with reference to the upper-left corner of the
screen.
English word. Two L1 speakers of English (different from the one who recorded the stimuli) achieved ceiling performance on the task
(92% and 94%). Nonword rejection rates based on the accuracy and response latencies of correctly identifying the mispronounced TWs
as nonwords were used as dependent measures.
4.3. Procedure
All testing and exposure took place individually and in one session of approximately 1 h. Upon entering the research laboratory,
each participant signed a consent form and was randomly assigned to one of the viewing conditions. Participants did the lexical
decision task, then watched the clips as their eye-movements were recorded with a Tobii T120 eye-tracker integrated into a 17′′
monitor, which has a sampling rate of 120 Hz, an accuracy of 0.5◦ and 0.2◦ resolution. With the participant seated at a distance
between 60 and 64 cm from the screen, Tobii T120 was expected to appropriately keep track of fixations within the study’s areas of
interest, as AOI height (50–55 pixel) subtended a visual angle of ~2◦ , and their width was larger. Before the viewing, a 9-point
calibration and validation procedure was performed. After the first clip with unenhanced captions and the sample clip with target
words highlighted at different time intervals, participants in the experimental groups watched one clip under one of the synchronized
conditions (500 ms or 300 ms), and the other clip with TWs highlighting at caption onset (henceforth “unsynchronized enhancement”)
as a within-subject control condition (Fig. 1). To keep the focus on meaning, a multiple choice comprehension question was included
after each clip, and participants were not told that they would repeat the LDT as a post-test. After the LDT post-test, they did the elicited
imitation task. The participants were not informed of the study aim until the end of the session.
Fig. 1. Viewing phase flowchart.
6
4.4. Analyses
The eye-tracking data was extracted from Tobii Studio using the I-VT fixation filter in Tobii Pro. Less than 75% of the data was
available for three participants, so their eye-tracking data were excluded from the analyses. A fourth participant who did not have
fixations in the caption area was excluded. On average, 92.6% of eye-tracking data was available from the participants included in the
analyses (n = 46). Following Godfroid (2019), fixations shorter than 50 ms and longer than 800 ms were removed from the analysis of
fixation duration and fixation distance (but not skipping probability), causing further exclusion of 3.5% of the fixations recorded and
leaving 1043 fixations on a total of 1702 fixated and skipped items. The statistical models were built in RStudio using the glmer and
lmer functions of the lme4 package, and Bonferroni-adjusted significance tests for pairwise contrasts were obtained using the lsmeans
function of the emmeans package. The package performance was used to assess model performance and obtain effect sizes (marginal and
conditional R-squared values) for linear mixed models. The function r.squaredGLMM in the package MuMIn was used to obtain
pseudo-R-squared values based on the delta method for generalized linear mixed models. Non-parametric bootstrapping with
replacement (n = 1e4 simulations) was used to calculate basic confidence intervals from the empirical distribution of the parameter
estimate and independently from model assumptions. Since the assumptions underlying the computation of the asymptotic 95% CIs
(and therefore of the p values) did not hold for some of the models, we decided to use bootstrapping on the regression coefficients of all
the models to provide theoretically valid estimates of the true population parameter, even when there was no evidence against the
validity of the model assumptions.
We first analyzed the subset of enhanced target words. A mixed effects gamma regression with a log-link function was used to
compare the effects of viewing condition on total fixation duration, the sum of the duration for all fixations within an AOI. The effects
of viewing condition on skipping probability (the proportion of unfixated words relative to the total number of words in the subset),
were analyzed with a fixed effect logistic regression based on a binomial distribution and logit link function. As an indicator of the
degree of synchronization between visual and auditory input processing, we built a linear mixed model to assess the effects of viewing
condition on fixation distance, which was defined as the difference between the timestamp of the first fixation on a target word in the
caption and the onset of its auditory form in the soundtrack (Wisniewska & Mora, 2018). First fixations that did not happen within
3000 ms of word auditory onset (n = 12) were replaced by system missing values in the analysis of fixation distance. To control for
possible confounds, all eye-tracking models were run including Presentation Time (see section 4.2.2.) and Frequency of Occurrence (in
the four clips) as fixed effects besides Viewing Condition, and random intercepts for participants and items. If the effects of a covariate
did not reach significance, the model was re-run excluding the covariate.
The lexical decision task analysis was carried out on the responses of all 58 participants. For the subset of enhanced nonwords, a
logistic mixed model based on a binomial distribution and logit link function was run to compare the effects of viewing condition and
testing time (T1-T2) on participants’ accuracy in the LDT. The accuracy gains reported in the descriptive tables were computed by
assigning a gain score for each item eliciting an inaccurate response at T1 and an accurate response at T2. A mixed effects gamma
regression was run to measure the effects of viewing condition and testing time on reaction times (RTs). Both the model targeting
accuracy as a dependent variable and the model targeting reaction times were run including Viewing Condition, Time and the interaction
of Viewing Condition * Time as fixed effects, and random intercepts for participants and items. RT gains were obtained for items eliciting
accurate responses at T2 only, by subtracting each absolute RT at T1 from the correspondent one at T2 and discarding negative gains.
The same models described for the subset of enhanced words were run on the control subset of unenhanced words, with Group (4-
level variable) as a fixed effect instead of Viewing Condition (5-level variable), because words in this subset were watched under the
same unenhanced condition by all participants. This was done to check whether the participants belonging to each group exhibited, in
the absence of enhancement, a different behavior from the other groups (e.g., skipped more words or fixated on words for longer).
Additionally, we ran Pearson or Kendall tau correlations (for continuous and categorical variables, respectively) between participants’
proficiency and their eye gaze behavior, and between proficiency and LDT gains.
5. Results
5.1. Simultaneous visual and auditory processing
RQ1. investigated the effectiveness of different types of textual enhancement to enhance the simultaneous processing of target visual
and auditory word forms in L2 captioned videos.
5.1.1. Preliminary analysis

No areas of interest attracted any fixations under the uncaptioned condition, confirming that, under the captioned conditions,
learners’ attention had been attracted to the areas of interest by the presence of enhanced or unenhanced captions only. Responses to
the comprehension questions as an indicator of the participants’ understanding of the clips were 87% correct on average, indicating
that overall participants were focusing on meaning.
5.1.2. Total fixation duration

Table A.1 in appendix A reports the total fixation duration descriptive data for the enhanced subset of target words by enhancement
condition. Although there was no effect of Viewing Condition or Presentation Time on total fixation duration based on asymptotic CIs, the
bootstrapped confidence intervals did not contain 0 for Presentation Time, as well as the 300 ms synchronized and the unsynchronized
7
enhancement conditions, indicating that these variables had a significant and positive effect on total fixation duration (Table 4).
Pairwise contrasts revealed no difference between any of the conditions (table A.2).
The gamma regression on the control subset of unenhanced words, for which we used Group as a predictor instead of Viewing
Condition because participants in all groups watched these words under unenhanced condition, showed that Presentation Time had a
significant effect on total fixation duration, but Group and Frequency of Occurrence did not (table A.3). Pairwise contrasts on the
unenhanced subset revealed no difference between any of the groups.
5.1.3. Skipping probability

The skipping probability data for the enhanced subset of target words by enhancement condition is available in appendix table A.4.
The logistic mixed-effects model on the subset of enhanced target words yielded a significant effect of Viewing Condition on skipping
probability for the unsynchronized enhancement condition (p = .04), with the unenhanced condition as baseline of the fixed effect
(Table 5). Bootstrapping confirmed that the unsynchronized enhancement condition had a significant and positive effect on skipping
probability (i.e., less skipping compared to baseline). Pairwise contrasts revealed no difference between any of the conditions
(table A.5). There was no effect of Presentation Time or Frequency of Occurrence on skipping probability.
The logistic regression on the unenhanced subset of words yielded a significant effect of Presentation Time on skipping probability,
but not of Frequency of Occurrence and, according to asymptotic confidence intervals, of Group (table A.6). However, the bootstrapped
confidence intervals pointed at a significant effect for the 500 ms synchronized group, i.e., participants in this group may have been
generally more likely to skip words. Pairwise contrasts showed no significant between group differences (ps = 1.00).
5.1.4. Fixation distance

Due to the way fixation distance was computed, positive values indicate pre-fixation on the target words, meaning target words
were generally pre-fixated (see table A.7 and Fig. 2). Regarding the enhanced subset of words, the linear regression performed with the
unenhanced condition as baseline did not find significant effects of Viewing Condition on fixation distance. However, when the 500 ms
synchronized enhancement condition was used as the reference level of the predictor (Table 6), the model yielded a significant effect of
the 300 ms synchronized enhancement and the unsynchronized enhancement condition on fixation distance (p = .02 and p < .001,
respectively), as well as of Presentation Time (p = .05). Pairwise contrasts (table A.8) indicated that both the 500 ms and 300 ms
synchronized conditions were associated with a smaller fixation distance than the unsynchronized condition (p < .001 and p = .03,
respectively). There was no effect of Frequency of Occurrence on fixation distance. The linear model on the unenhanced subset of words
yielded a significant effect of Presentation Time on fixation distance, but not of Frequency of Occurrence or Group (table A.9), and
pairwise contrasts revealed no between group differences (ps = 1.00).
5.1.5. Effects of proficiency on gaze behavior

There were no significant correlations between participants’ proficiency and total fixation duration (r (1702) = − 0.02, p = .35),
and skipping probability (τb (1702) = − 0.01, p = .58). However, the significant correlation found for fixation distance (r (990) =
− 0.07, p = .02) warranted further consideration. We split the dataset by word subset and found that the correlation between profi
ciency and fixation distance was significant for the unenhanced control subset (r (393) = 0.15, p = .002), meaning these words were
fixated earlier by more proficient participants, but this was not the case for the enhanced target subset (r (597) = 0.01, p = .99). Further,
for the enhanced subset, proficiency and fixation distance were actually correlated for words watched under the unenhanced condition
(r (144) = 0.37, p = .001) but not under the 500 ms, 300 ms or unsynchronized condition (p = .16, p = .31 and p = .34, respectively),
indicating that textual enhancement may have mitigated the effects of proficiency on fixation distance.
5.1.6. Summary of RQ1 results

To sum up, the findings suggest that the 300 ms and unsynchronized enhancement conditions led to longer total fixation duration
on the target words compared to the unenhanced condition. However, significance should be interpreted with caution, in light of the
inconsistencies between the asymptotic and bootstrap confidence intervals, and on the small proportion of variance (R2m and R2c)
explained by the predictor variables. The unsynchronized enhancement condition was also the only condition associated with a lower
skipping rate. However, under the unsynchronized enhancement condition, the gap between first fixation and auditory onset was
larger than in the other enhanced conditions, therefore both synchronized enhancement conditions seemed to promote stricter audio-
Table 4
Fixed coefficients for the model examining total fixation duration on the enhanced subset of TWs.
95% Confidence Intervals
Asymptotic Bootstrapped
Estimate SE z p R2m R2c Lower Upper Lower Upper
Intercept 5.72 0.12 46.16 <.001a 0.04 0.22 5.47 5.96 5.63 5.84
500 ms synchronization 0.03 0.13 0.26 .79 − 0.21 0.28 − 0.10 0.16
300 ms synchronization 0.14 0.13 1.07 .29 − 0.11 0.38 0.00 0.26
Unsynchronized enhancement 0.13 0.12 1.10 .27 − 0.10 0.37 0.01 0.24
Presentation time 0.10 0.06 1.81 .07 − 0.01 0.21 0.07 0.14
a
p < .001.
8
Table 5
Fixed coefficients for the logistic regression examining skipping probability (enhanced subset).
Intercept − 0.84 0.46 − 1.81 .07 0.03 0.32 − 1.74 0.07 − 1.15 − 0.29
500 ms synchronization − 0.58 0.54 − 1.07 .29 − 1.64 0.48 − 1.14 0.16
Unsynchronized enhancement − 1.04 0.52 − 2.02 .04a − 2.06 − 0.03 − 1.46 − 0.37
a
p < .05.
Fig. 2. Distribution of pre-fixations (positive values) and post-fixations in relation to word auditory onset.
Table 6
Fixed coefficients for the linear mixed model examining fixation distance.
Estimate SE df t value p R2m R2c Lower Upper Lower Upper
Intercept 93.74 131.71 39.69 0.71 0.48 0.10 0.55 − 164.42 351.90 − 33.33 223.94
300 ms synchronization 220.27 93.38 588.20 2.36 .02* 37.25 403.30 31.73 398.69
Unsynchronized enhancement 410.47 67.89 599.81 6.05 <.001*** 277.40 543.50 266.09 554.42
Unenhanced captions 159.29 174.16 51.02 0.91 .37 − 182.06 500.60 − 30.45 332.04
Presentation time 221.74 102.94 15.96 2.15 .05* 19.98 423.50 165.69 277.96
***p < .001, *p < .05.
visual synchronization compared to the unsynchronized condition, although not compared to the unenhanced condition. In addition,
the participants’ proficiency had a significant effect on fixation distance only under the unenhanced condition. Presentation time
affected total fixation duration and fixation duration, but not skipping probability (for the enhanced subset of TWs), whereas frequency
of occurrence did not have a significant effect on eye gaze behavior. The analysis of the subset of words that were watched under
enhanced condition by all participants showed no between group differences for total fixation duration and fixation distance, con
firming that the results obtained for the subset of enhanced TWs depended on the enhancement condition. However, a higher skipping
probability was found for the 500 ms synchronized group on the unenhanced subset of words, indicating that participants in this group
may have naturally skipped more TWs than the others, regardless of enhancement.
5.2. Phonolexical update
RQ2. asked whether the simultaneous processing of visual and auditory word forms in L2 captioned videos promoted the updating of
9
phonolexical representations.
5.2.1. Preliminary analysis

Participants’ knowledge of the meaning of target words was not tested before the intervention to avoid interference with the re
sults, as all testing was administered in one sitting. However, accurate recognition of the correctly pronounced version of each word at
pre-test indicated that, regardless of any uncertainty regarding the pronunciation of the target words, our participants were familiar
with each item (see table B.1 in Appendix B), to the extent that the lexical decision response accuracy for correctly pronounced words
reached ceiling for the enhanced (M = 93%, SD = 26) and unenhanced subset (M = 90%, SD = 30). The very high accuracy on correctly
pronounced target words was expected and did not invalidate the gains obtained for target nonwords, because L2 speakers who accept a
mispronounced version of a word are believed to have stored “unstable” representations of those words, regardless of whether they can
recognize the correctly pronounced version or not (Cook et al., 2016). Therefore, accepting accurately pronounced words with a 90%
accuracy rate is no guarantee that incorrect mispronunciations will also be rejected (as confirmed by the low accuracy rates reported in
section 5.2.2). The average reaction time gain in responding to distractor items, obtained as an indicator of test practice effect, was
230.88 ms (SD = 209.42, 95% CI [224.58, 237.19]).
5.2.2. Accuracy rate in the lexical decision task

Table B.2 reports the accuracy descriptive data for the enhanced subset of target words by time and word enhancement condition.
The logistic regression failed to converge when the baseline was the uncaptioned condition, but not when it was the 500 ms synchronized
condition, therefore the latter model is reported (table B.3). There was a significant effect of Time (p = .002) but the effect of none of the
conditions and Time × Condition interactions reached significance either considering asymptotic or bootstrapped confidence intervals.
Based on the pairwise comparisons, only the unsynchronized condition achieved significant accuracy gains (table B.4). There was a
significant effect of Time (p < .001) on accuracy for the unenhanced subset of words, but the model yielded no effect of Group or the
interaction between Group and Time (table B.5). T1-T2 pairwise contrasts for the unenhanced subset only found significant gains for the
500 ms synchronized group (table B.6).
5.2.3. Reaction times in the lexical decision task

Descriptives for the enhanced subset are available in the appendix (table B.7). Fig. 3 reports the average RTs and 95% confidence
intervals by condition and testing time, with the minimum set at 1000 ms for clarity. The gamma regression showed an effect of Time (p
= .04) and of the interaction between the 500 ms condition and Time (p = .003) on response times to words in the enhanced subset, and
these results were confirmed through bootstrapping (Table 7). Pairwise contrasts showed that only the enhancement conditions led to
significant RT gains, with ps < .001 (table B.8). The analysis of RT for the unenhanced word subset (table B.9) found no effect of Group
or the interaction between Group and Time, but there was a significant effect of Time (p < .001). T1-T2 pairwise comparisons for the
unenhanced subset found significant RT gains for the 300 ms group and unenhanced group only (table B.10).
5.2.4. Effects of proficiency on LDT gains

There were no significant correlations between participants’ proficiency and accuracy gains (τb (4292) = − 0.01, p = .37) or re
action times gains (r (1386) = − 0.02, p = .38).
5.2.5. Summary of RQ2 results

To sum up, only the unsynchronized condition led to significant accuracy gains in participants’ rejection of mispronounced target
words, and only the viewing conditions involving enhancement were associated with significant gains in reaction time to the mis
pronounced TWs, suggesting that lexical access was facilitated by the previous exposure to enhanced captioned video. However, the
strong effect of time (for both the enhanced TWs and the control subset of unenhanced words) was consistent with the possibility of a
practice effect, and the significant gains achieved by some of the groups on the unenhanced subset represent a confound.
Fig. 3. Average reaction times by time by condition.
10
Table 7
Fixed coefficients for the fixed effects gamma regression examining reaction time.
Intercept 7.45 0.09 85.04 <.001*** 0.10 0.32 7.28 7.62 7.35 7.55
Unsynchronized enhancement − 0.01 0.09 − 0.10 .92 − 0.19 0.17 − 0.13 0.12
Unenhanced captions 0.00 0.12 − 0.02 .98 − 0.23 0.22 − 0.15 0.13
Time − 0.06 0.03 − 2.08 .04* − 0.12 0.00 − 0.12 − 0.01
500 ms synchronization*time − 0.12 0.04 − 2.99 .003** − 0.20 − 0.04 − 0.20 − 0.04
300 ms synchronization*time − 0.05 0.04 − 1.27 .20 − 0.13 0.03 − 0.12 0.03
Unsynchronized enhancement*time − 0.03 0.03 − 0.98 .33 − 0.10 0.03 − 0.10 0.03
Unenhanced captions*time 0.04 0.04 0.87 .39 − 0.05 0.12 − 0.04 0.13
***p < .001, **p < .01, *p < .05.
6. Discussion
This study explored the effects of viewing L2 videos with synchronized and unsynchronized caption enhancement, unenhanced
captions, or no captions on eye-gaze behavior and phonolexical update. Skipping probability, a measure of whether any visual
attention was paid to the area of interest, and total fixation duration, a measure of initial word retrieval and subsequent integration
with the context (Godfroid, 2019) were used, as in other studies on input enhancement (e.g., Montero Pérez et al., 2015; Winke, 2013),
to assess attention allocation. Fixation distance was used as an indicator of the degree of learners’ synchronization between the visual
and the auditory processing of the target words. To assess the effect of synchronized textual input enhancement on the update of
phonolexical representations, we tested learners’ ability to recognize mispronunciations of the TWs in an auditory lexical decision task
before and after viewing the captioned clips. Since reducing the activation of non-targetlike lexical representations facilitates lexical
access (Darcy et al., 2013), learners’ accuracy and reaction times in response to mispronunciations of the TWs were analyzed as an
indicator of phonolexical update. We discuss our two research questions in light of the results obtained.
Our first research question asked which type of textual enhancement (audio-synchronized at 300 ms, 500 ms, unsynchronized or
unenhanced) is more effective at enhancing the simultaneous processing of target visual and auditory word forms in L2 captioned
videos. The 300 ms synchronized and the unsynchronized enhancement conditions were associated with longer total fixation duration
and the unsynchronized enhancement condition with a lower skipping probability, a finding in line with previous research on the use
of input enhancement to draw learners’ attention to specific L2 words and constructions during exposure to multimodal input
(Alsadoon & Heift, 2015; Lee & Révész, 2020). However, the robustness of the results for total fixation duration is weakened by the
discrepancy between the bootstrap and asymptotic confidence intervals and the small amount of variance explained by viewing
condition in the model. In addition, since the 500 ms group skipped words more frequently at baseline (unenhanced subset of words), a
different behavior from the other groups could not be excluded in the analysis of skipping probability data for the enhanced subset of
TWs. Overall, assuming participants paid more attention to the areas they actually fixated and those they fixated for longer (Leow,
2015), we would expect deeper processing of the TWs under unsynchronized enhancement.
However, a greater distance between first fixation on a word and its auditory onset was found under unsynchronized enhancement
than under the synchronized enhancement conditions, regardless of participants’ proficiency. This finding suggests that, under the
unsynchronized enhancement condition, participants looked at TWs as soon as the caption appeared and then either went back to
viewing (the image) while listening, which would not result in audio-visual synchrony, or went back to reading the caption after
having looked at the TWs way earlier than their auditory onset. It is impossible to exclude that they would have managed to read the
rest of the caption (including the TW) in synchrony with its auditory onset. However, due to the dynamic nature of captions, which
remained onscreen for only about 3000 ms on average, and the short duration of the auditory form of the TWs (m = 506.66 ms, 95% CI
[445.34, 559.36]), the effort required to saccade towards the enhanced word, process it, and go back to reading the caption is unlikely
to have promoted audio-visual synchrony.
While unsynchronized textual enhancement provided the advantage of attracting the viewers’ gaze to a larger number of target
words and increasing attention in terms of total duration of fixations on these words, the synchronization of textual enhancement (both
300 ms and 500 ms before auditory onset) seemed to be more effective in terms of audio-visual synchrony in the allocation of attention
to the TWs. The timely visual processing of the orthographic form of the TWs may have generated the activation of correspondent
representations immediately before hearing their spoken form, allowing for a comparison with stored phonolexical representations.
Increased visual attention and audio-visual synchrony were expected to provide an advantage in the lexical decision post-test, as
learners moved from input processing to intake processing. This stage of learning involves hypothesis testing about language properties
and generates an initial product, held in working memory and mainly accessible via receptive testing, that can be later incorporated
into the learner’s internal system (Leow, 2015).
Our second research question asked whether the simultaneous processing of visual and auditory word forms in L2 captioned videos
promotes the updating of phonolexical representations. The significant accuracy gains of the unsynchronized condition, combined with
the lower skipping rate and longer fixation duration on the TWs, provide support to the hypothesis that highlighting target words from
11
the caption onset may direct the viewer’s attention to their auditory form, promoting phonological development. However, the validity
of this finding is limited by the strong effect of time, consistent with a practice effect, and by the significant accuracy gains of the
participants in the 500 ms synchronized group in response to the unenhanced subset of words. The analysis of reaction times showed
that only the viewing conditions involving caption enhancement promoted faster rejection of mispronunciations from pre-to post-test.
Together with the smaller fixation distance observed in the analysis of the eye-tracking data and, for the 300 ms condition, for the
longer total fixation duration on the TWs, the LDT findings provide partial support to our hypothesis that synchronized enhancement
would promote a comparison of the target realization of L2 words with the learners’ stored representations, leading to their update.
The unsynchronized enhancement condition also led to significant reaction time gains, in line with the assumption that less skipping
and longer fixations on the orthographic form of words would increase attention to the correspondent auditory forms.
The lack of reaction times gains on the enhanced subset of words by the unenhanced and uncaptioned control conditions points at a
‘speed–accuracy trade-off’ affecting decision-taking under time pressure (Heitz, 2014). In other words, participants who had watched
the L2 videos with unenhanced captions and no captions may have resorted to more explicit and less easily accessible knowledge, at the
cost of automaticity (Williams & Paciorek, 2016). The significant reaction time gains of the 300 ms group and the unenhanced group on
the unenhanced subset of words represent a confound, as the former group may have employed a different response strategy that
prioritized speed over accuracy compared to the other three groups. For the unenhanced group, it appears that in the absence of
enhancement directing their attention to the target words, participants may have naturally focused on the unenhanced subset of words
during exposure to the video, resulting in quicker reaction times to those words compared to the other groups.
7. Conclusion and limitations
Overall, this study has found evidence that exposure to multimodal input impacts pronunciation positively, and that pronunciation
learning from L2 video can be further stimulated by a planned form-focused intervention. Our results provide initial support for the use
of synchronized and unsynchronized caption enhancement to direct learners’ attention to auditory forms and promote the update of L2
phonolexical representations. In line with our predictions, synchronized textual enhancement led to greater audio-visual synchrony
during the viewing of L2 video than unsynchronized enhancement, and promoted significant gains in word recognition response times,
an indication of phonolexical update. The enhancement of TWs from the onset of the caption line attracted long fixations, reduced
skipping probability and was also associated with significant gains in word recognition accuracy and response times, although this
condition seemed to negatively impact audio-visual synchrony during the viewing.
The lack of a significant language learning advantage of the audio-visually synchronized conditions compared to unsynchronized
enhancement contradicts our hypotheses and represents a limitation of this study. Another limitation is the significant effect of time in
the analysis of accuracy and reaction time data for the unenhanced subset of words, which indicates that practice effect might
represent a confound. In addition, the small number of participants in the unenhanced and uncaptioned groups may have affected the
results, despite the large number of test items and the inclusion of random effects in the statistical models. The short average length of
watching materials (less than 2 min per clip) represents another limitation, as the eye-tracking data collected during this short span
might not faithfully represent learners’ natural viewing behavior. Although a sampling rate of 120 Hz (refresh rate of ~8.33ms) was
deemed appropriate to analyze fixations ranging between 50 ms and 800 ms, the relatively low sampling rate of the eye-tracker may
not have provided enough resolution in the context of viewing, where the text appears onscreen for a short time (but see Lee & Révész,
2020 for a discussion of this limitation). Finally, it would have been useful to collect stimulated recall data, in order to investigate
participants’ level of processing of enhanced words and of the correspondent auditory forms. To address these limitations, a follow up
study should involve a larger number of participants, distributed into balanced groups of equivalent size, and a longer treatment, with
longer time spans between pre- and post-test. Future studies may also combine synchronized enhancement with explicit instruction,
which has been found to promote larger gains than incidental exposure to enhanced input (Han et al., 2008).
Author contributions
The first author designed the study, collected the data and drafted the manuscript. The second author provided support on the
conceptualization and design of the study, data collection and analysis, and revised the manuscript. The third author provided support
on research design and data analysis and revised the manuscript.
Informed consent
The participants signed informed consent forms before data collection.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Acknowledgments
This work was supported by the Spanish Ministry of Science, Innovation and Universities [grant PID 2019-107814 GB-I00] and by
12
the Secretary of Universities and Research of the Government of Catalonia and the European Social Fund [grant FI_2019]. We would
like to thank Giacomo Bizzarrini for their assistance with statistical analyses during the revision of this manuscript.
Appendix A. Research question 1 models
Table A.1
Total fixation duration by viewing condition (enhanced subset).
N M (ms) SD (ms) Lower Upper
500 ms synchronization 129 338.14 193.35 304.46 371.82

Unsynchronized enhancement 281 377.39 195.81 354.40 400.38
Unenhanced captions 94 344.70 215.61 300.54 388.86
Table A.2
Results of pairwise contrasts for total fixation duration (enhanced target word subset).
Contrast estimate SE z ratio p Lower Upper
500 ms synchronization - 300 ms synchronization − 0.10 0.08 − 1.36 1.00 − 0.30 0.10
500 ms synchronization - unsynchronized enhancement − 0.10 0.06 − 1.79 0.44 − 0.24 0.05
300 ms synchronization - unsynchronized enhancement 0.00 0.06 0.06 1.00 − 0.14 0.15
Unenhanced captions - 500 ms synchronization − 0.03 0.13 − 0.26 1.00 − 0.37 0.30
Unenhanced captions - 300 ms synchronization − 0.14 0.13 − 1.07 1.00 − 0.47 0.20
Unenhanced captions - Unsynchronized enhancement − 0.13 0.12 − 1.10 1.00 − 0.45 0.19
Table A.3
Fixed coefficients for the gamma regression examining total fixation duration (unenhanced subset).
Intercept 5.63 0.12 46.08 <.001*** 0.10 0.29 5.39 5.87 5.51 5.79
Presentation time 0.19 0.08 2.46 .01* 0.04 0.35 0.12 0.27
***p < .001, *p < .05.
Table A.4
Skipping probability by viewing condition (enhanced subset).
N M SD Lower Upper
500 ms synchronization 171 24.56% 43.17 18.04 31.08

300 ms synchronization 171 25.15% 43.51 18.58 31.71
Unsynchronized enhancement 342 17.84% 38.34 13.76 21.91
Unenhanced captions 144 34.72% 47.77 26.85 42.59
Table A.5
Results of pairwise contrasts for skipping probability (enhanced subset).
Contrast estimate SE z ratio p Lower Upper
500 ms synchronization - 300 ms synchronization − 0.04 0.36 − 0.11 1.00 − 0.98 0.91
500 ms synchronization - unsynchronized enhancement 0.47 0.27 1.74 .49 − 0.24 1.17
300 ms synchronization - unsynchronized enhancement 0.50 0.27 1.84 .39 − 0.22 1.23
Unenhanced captions - 500 ms synchronization 0.58 0.54 1.07 1.00 − 0.85 2.01
Unenhanced captions - 300 ms synchronization 0.54 0.54 0.99 1.00 − 0.89 1.97
Unenhanced captions - Unsynchronized enhancement 1.04 0.52 2.02 .26 − 0.32 2.41
13
Table A.6
Fixed coefficients for the logistic regression examining skipping probability (unenhanced subset).
Intercept − 0.11 0.54 − 0.20 .84 0.06 0.42 − 1.17 0.95 − 0.52 0.29
500 ms synchronization 0.53 0.65 0.82 .41 − 0.74 1.80 0.03 0.97
Presentation time − 0.57 0.09 − 6.69 <.001*** − 0.74 − 0.40 − 0.71 − 0.33
***
p < .001.
Table A.7
Fixation distance by viewing condition (enhanced subset).
N Mdn1 (ms) IQR2 (ms) Lower Upper

Unsynchronized enhancement 277 368.80 831.90 419.81 609.33
Unenhanced captions 93 307.60 687.60 187.79 506.14
1
Since the standard deviation was larger than the mean values, we report median and IQR instead.
2
Interquartile range, the range of values that lies between the upper quartile and lower quartile.
Table A.8
Results for pairwise contrasts for fixation distance (enhanced subset).
Contrast estimate SE df t ratio p Lower Upper
500 ms synchronization - 300 ms synchronization − 220.00 94 590 − 2.35 .12 − 469.00 28.20
500 ms synchronization - unsynchronized enhancement − 410.00 68 600 − 6.03 <.001*** − 591.00 − 230.30
500 ms synchronization - unenhanced captions − 159.00 174 55 − 0.91 1.00 − 636.00 317.80
300 ms synchronization - unsynchronized enhancement − 190.00 69 600 − 2.77 .03* − 372.00 − 8.60
300 ms synchronization - unenhanced captions 61.00 175 55 0.35 1.00 − 417.00 538.50
unsynchronized enhancement - unenhanced captions 251.00 168 48 1.50 .84 − 210.00 712.40
***p < .001, *p < .05.
Table A.9
Fixed coefficients for the linear model examining fixation distance (unenhanced subset).
Estimate SE df t p R2m R2c Lower Upper Lower Upper
Intercept 17.03 140.83 30.9 0.12 .91 0.11 0.56 − 258.98 293.00 − 87.32 144.60
500 ms synchronization 9.24 98.01 30.28 0.09 .93 − 182.85 201.30 − 154.03 131.70
300 ms synchronization 25.81 97.29 29.88 0.27 .79 − 164.86 216.50 − 120.70 175.50
Presentation time 272.01 119.78 17.73 2.27 .04* 37.25 506.80 217.86 326.60
*
p < .05.
Appendix B. Research question 2 data and models
Table B.1
Average scores at time 1 for accurately pronounced target words.
Item Subset Mean Std. Deviation
Allow Unenhanced .93 .256

Area Unenhanced .90 .307
Come Unenhanced .91 .283
Control Unenhanced .98 .131
Earth Unenhanced .91 .283
Embarrassing Unenhanced .90 .307
Ended Unenhanced .95 .223
Existence Unenhanced 1.00 .000
(continued on next page)
14
Table B.1 (continued )

Item Subset Mean Std. Deviation
Plowed Unenhanced .64 .485

Question Unenhanced .98 .131
Returned Unenhanced 1.00 .000
Rolled Unenhanced .66 .479
Traumatic Unenhanced .95 .223
Cottage Unenhanced .64 .485
Special Unenhanced .98 .131
Australia Unenhanced .97 .184
Betray Unenhanced .93 .256
Fundamental Unenhanced .95 .223
Raised Unenhanced .88 .329
Adorable Enhanced .98 .131
Basically Enhanced .98 .131
Clown Enhanced .93 .256
Happened Enhanced 1.00 .000
Interior Enhanced .91 .283
Lawyer Enhanced .97 .184
Mission Enhanced 1.00 .000
Review Enhanced .97 .184
Whereas Enhanced .83 .381
Actually Enhanced .95 .223
Arizona Enhanced .88 .329
Language Enhanced .98 .131
Nigeria Enhanced .72 .451
Overwhelming Enhanced .93 .256
Phoenix Enhanced .72 .451
Promise Enhanced 1.00 .000
Pursuit Enhanced .95 .223
Rescued Enhanced .98 .131
Table B.2
Accuracy averaged scores (max 1) and gains for enhanced target nonwords.
Time N M SD Lower Upper
500 ms synchronization 1 189 0.37 0.48 0.30 0.44

500 ms synchronization Gains 378 0.21 0.41 0.17 0.25
Unsynchronized enhancement 1 378 0.37 0.48 0.32 0.41
Unsynchronized enhancement Gains 756 0.21 0.40 0.18 0.24
Unenhanced captions 1 144 0.31 0.46 0.23 0.38
Unenhanced captions Gains 288 0.22 0.41 0.17 0.26
Uncaptioned 1 144 0.42 0.49 0.34 0.50
Uncaptioned 2 144 0.47 0.50 0.39 0.55
Uncaptioned Gains 288 0.17 0.38 0.13 0.22
Table B.3
Fixed coefficients for the logistic regression examining accuracy (enhanced subset).
Intercept − 1.33 0.52 − 2.54 .01* 0.03 0.46 − 2.36 − 0.30 − 2.16 − 0.37
Unsynchronized enhancement − 0.32 0.50 − 0.63 .53 − 1.31 0.67 − 1.38 0.76
Unenhanced captions − 0.67 0.78 − 0.86 .39 − 2.20 0.86 − 1.92 0.67
Uncaptioned 0.61 0.76 0.80 .42 − 0.88 2.09 − 0.80 1.87
Time 0.78 0.25 3.05 .002** 0.28 1.28 0.19 1.24
Unsynchronized enhancement*time 0.00 0.31 0.01 .99 − 0.61 0.61 − 0.62 0.66
Uncaptioned*time − 0.47 0.38 − 1.25 .21 − 1.20 0.27 − 1.27 0.39
**p < .01, *p < .05.
15
Table B.4
Results of pairwise contrasts for accuracy (enhanced subset).
Testing time Contrast estimate SE df z p Lower Upper
500 ms synchronization 1–2 − 0.78 0.25 Inf − 3.05 .10 − 1.61 0.05
300 ms synchronization 1–2 − 0.63 0.26 Inf − 2.45 .65 − 1.47 0.21
Unsynchronized enhancement 1–2 − 0.78 0.18 Inf − 4.35 <.001*** − 1.36 − 0.20
Unenhanced captions 1–2 − 0.86 0.29 Inf − 2.91 .16 − 1.81 0.10
Uncaptioned 1–2 − 0.31 0.28 Inf − 1.12 1.00 − 1.21 0.59
***
p < .001.
Table B.5
Fixed coefficients for the logistic regression examining accuracy (unenhanced subset).
Intercept 0.91 − 0.42 2.15 .03* 0.02 0.41 0.08 1.74 − 1.42 − 0.25
300 ms synchronization − 0.62 − 0.48 − 1.30 .19 − 1.56 0.31 − 0.21 1.39
Unenhanced captions 0.33 − 0.65 0.51 .61 − 0.94 1.59 − 1.41 0.73
Uncaptioned*time − 0.16 − 0.63 − 0.26 .80 − 1.41 1.08 − 0.97 1.15
Time − 0.55 − 0.17 − 3.24 <.001*** − 0.88 − 0.22 0.15 0.89
300 ms synchronization*time − 0.24 0.53 .59 − 0.34 0.60 − 0.62 0.37
Unenhanced captions*time − 0.09 − 0.32 − 0.28 .78 − 0.72 0.54 − 0.57 0.77
Uncaptioned*time − 0.02 − 0.31 − 0.05 .96 − 0.63 0.60 − 0.61 0.67
***p < .001, *p < .05.
Table B.6
Results of pairwise contrasts for accuracy (unenhanced subset).
500 ms synchronization 1–2 − 0.55 0.17 Inf − 3.24 0.03* − 1.08 − 0.02
300 ms synchronization 1–2 − 0.42 0.17 Inf − 2.48 0.37 − 0.95 0.11
Unenhanced captions 1–2 − 0.64 0.28 Inf − 2.34 0.55 − 1.50 0.22
Uncaptioned 1–2 − 0.57 0.27 Inf − 2.13 0.92 − 1.40 0.26
*
p < .05.
Table B.7
Reaction time averages and gains for enhanced target nonwords.
Time N M SD Lower Upper

Unsynchronized enhancement Gains 236 314.39 278.59 278.67 350.12
Unenhanced captions Gains 74 243.74 214.83 193.96 293.51
Uncaptioned 1 60 1466.55 356.43 1374.47 1558.62
Uncaptioned 2 68 1373.97 290.32 1303.70 1444.24
Uncaptioned Gains 76 278.78 248.12 222.09 335.48
Table B.8
Results of pairwise contrasts for reaction time (enhanced subset).
500 ms synchronization 1–2 0.18 0.03 Inf 6.80 <.001*** 0.09 0.27
(continued on next page)
16
Table B.8 (continued )

Unsynchronized enhancement 1–2 0.10 0.02 Inf 5.07 <.001*** 0.03 0.16
Unenhanced captions 1–2 0.02 0.03 Inf 0.72 1.00 − 0.08 0.13
Uncaptioned 1–2 0.06 0.03 Inf 2.08 1.00 − 0.03 0.16
***
p < .001.
Table B.9
Fixed coefficients for the RT gamma regression (unenhanced subset).
Intercept 7.49 0.09 86.85 <.001*** 0.08 0.28 7.32 7.66 7.40 7.58
Unenhanced captions 0.03 0.11 0.29 .77 − 0.18 0.24 − 0.10 0.16
Time − 0.08 0.03 − 2.90 <.001*** − 0.14 − 0.03 − 0.14 − 0.03
***
p < .001.
Table B.10
Results of pairwise contrasts for reaction time (unenhanced subset).
500 ms synchronization 1–2 0.08 0.03 Inf 2.90 .10 − 0.01 0.17
Unenhanced captions 1–2 0.09 0.02 Inf 5.17 <.001*** 0.03 0.14
Uncaptioned 1–2 0.08 0.03 Inf 2.55 .30 − 0.02 0.17
***
p < .001.
References
Alsadoon, R., & Heift, T. (2015). Textual input enhancement for vowel blindness: A study with Arabic ESL learners. The Modern Language Journal, 99(1), 57–79.
https://doi.org/10.1111/modl.12188
Bailly, G., & Barbour, W. (2011). Synchronous reading: Learning French orthography by audiovisual training. Proceedings of the Annual Conference of the International
Speech Communication Association, Interspeech, (May), 1153–1156.
Bird, S., & Williams, J. N. (2002). The effect of bimodal input on implicit and explicit memory: An investigation into the benefits of within-language subtitling. Applied
PsychoLinguistics, 23, 509–533.
Birulés-Muntané, J., & Soto-Faraco, S. (2016). Watching subtitled films can help learning foreign languages. PLoS One, 11(6), Article e0158409.
Broersma, M., & Cutler, A. (2008). Phantom word activation in L2. System, 36, 22–34. https://doi.org/10.1016/j.system.2007.11.003
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and
improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. https://doi.org/10.3758/BRM.41.4.977
Charles, T., & Trenkic, D. (2015). The effect of bi-modal input presentation on second language listening: The focus on speech segmentation. In Y. Gambier, A. Caimi,
& C. Mariotti (Eds.), Subtitles and language learning (pp. 173–197) (Bern: Peter Lang).
Cobb, T.. Compleat Web VP v.2.1 [computer program] Accessed March 2019 at https://www.lextutor.ca/vp/comp/.
Conklin, K., Alotaibi, S., Pellicer-Sánchez, A., & Vilkaitė-Lozdienė, L. (2020). What eye-tracking tells us about reading-only and reading-while-listening in a first and
second language. Second Language Research, 36(3), 257–276. https://doi.org/10.1177/0267658320921496
Conklin, K., Pellicer-Sánchez, A., & Carrol, G. (2018). Eye-tracking: A guide for applied linguistics research. Cambridge: Cambridge University Press. https://doi.org/
10.1017/9781108233279
Cook, S. V., Pandža, N. B., Lancaster, A. K., & Gor, K. (2016). Fuzzy nonnative phonolexical representations lead to fuzzy form-to-meaning mappings. Frontiers in
Psychology, 7(Sep), 1–17. https://doi.org/10.3389/fpsyg.2016.01345
Darcy, I., Daidone, D., & Kojima, C. (2013). Asymmetric lexical access and fuzzy lexical representations in second language learners. The Mental Lexicon, 8(3),
372–420. https://doi.org/10.1075/ml.8.3.06dar
Darcy, I., & Holliday, J. J. (2019). Teaching an old work new tricks: Phonological updates in the l2 mental lexicon. In J. Levis, C. Nagle, & E. Todey (Eds.), Proceedings
of the 10th pronunciation in second language learning and teaching conference (pp. 10–26). Ames: Iowa State University.
Gerbier, E., Bailly, G., & Bosse, M. L. (2018). Audio–visual synchronization in reading while listening to texts: Effects on visual behavior and verbal learning. Computer
Speech & Language, 47, 74–92.
Ghia, E. (2012). Subtitling matters. New perspectives on subtitling and foreign language learning. Oxford: Peter Lang.
Godfroid, A. (2019). Eye-tracking in second language acquisition and bilingualism: A research synthesis and methodological guide. https://doi.org/10.4324/9781315775616
Han, Z., Park, E. S., & Combs, C. (2008). Textual enhancement of input: Issues and possibilities. Applied Linguistics, 29(4), 597–618. https://doi.org/10.1093/applin/
amn010
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical proficiency. EUROSLA Yearbook, 6, 147–168. https://doi.org/10.1075/eurosla.6.10har
17
Heitz, R. P. (2014). The speed-accuracy tradeoff: History, physiology, methodology, and behavior. Frontiers in Neuroscience, 8, 150. https://doi.org/10.3389/
fnins.2014.00150
Jenkins, J. (2002). A sociolinguistically based, empirically researched pronunciation syllabus for English as an international language. Applied Linguistics, 23(1),
83–103. https://doi.org/10.1093/applin/23.1.83
Kam, E. F., Liu, Y. T., & Tseng, W. T. (2020). Effects of modality preference and working memory capacity on captioned videos in enhancing L2 listening outcomes.
ReCALL, 32(2), 213–230. https://doi.org/10.1017/S0958344020000014
Kostromitina, M., & Plonsky, L. (2021). Elicited imitation tasks as a measure of l2 proficiency: A meta-analysis. Studies in Second Language Acquisition, 1–26. https://
doi.org/10.1017/S0272263121000395
Kruger, J. L., & Steyn, F. (2014). Subtitles and eye tracking: Reading and performance. Reading Research Quarterly, 49(1), 105–120.
Lee, M., & Révész, A. (2020). Promoting grammatical development through captions and textual enhancement in multimodal input-based tasks. Studies in Second
Language Acquisition, 1–27. https://doi.org/10.1017/S0272263120000108
Leow, R. P. (2015). Explicit learning in the L2 classroom. New York: Routledge.
Levis, J. (2018). Intelligibility, oral communication, and the teaching of pronunciation. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108241564
Lindgren, E., & Muñoz, C. (2013). The influence of exposure, parents, and linguistic distance on young European learners’ foreign language comprehension.
International Journal of Multilingualism, 10(1), 105–129. https://doi.org/10.1080/14790718.2012.679275
Llompart, M., & Reinisch, E. (2018). Robustness of phonolexical representations relates to phonetic flexibility for difficult second language sound contrasts.
Bilingualism: Language and Cognition, 1, 16. https://doi.org/10.1017/S1366728918000925
Marian, V., Bartolotti, J., Chabal, S., & Shook, A. (2012). Clearpond: Cross-linguistic easy-access resource for phonological and orthographic neighborhood densities.
PLoS One, 7(8). https://doi.org/10.1371/journal.pone.0043230
Marsden, E., Mackey, A., & Plonsky, L. (2016). The IRIS Repository: Advancing research practice and methodology. In A. Mackey, & E. Marsden (Eds.), Advancing
methodology and practice: The IRIS repository of instruments for research into second languages (pp. 1–21). New York: Routledge.
Mayer, R. (2014). Cognitive theory of multimedia learning. In R. Mayer (Ed.), The cambridge Handbook of multimedia learning(cambridge handbooks in psychology (pp.
43–71). Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139547369.005.
Mitterer, H., & McQueen, J. M. (2009). Foreign subtitles help but native-language subtitles harm foreign speech perception. PLoS One, 4(11), Article e7785.
Montero Pérez, M., Peters, E., & Desmet, P. (2015). Enhancing vocabulary learning through captioned video: An eye-tracking study. The Modern Language Journal, 99
(2), 308–328. https://doi.org/10.1111/modl.12215
Muñoz, C. (2008). Symmetries and asymmetries of age effects in naturalistic and instructed L2 learning. Applied Linguistics, 29(4), 578–596. https://doi.org/10.1093/
applin/amm056
Muñoz, C. (2014). Contrasting effects of starting age and input on the oral performance of foreign language learners. Applied Linguistics, 35(4), 463–482.
Ortega, L., Iwashita, N., Norris, J. M., & Rabie, S. (2002). An investigation of elicited imitation tasks in crosslinguistic SLA research (Presentation given at Second Language
Research Forum, Toronto).
Pattemore, A., & Muñoz, C. (2020). Learning L2 constructions from captioned audio-visual exposure: The effect of learner-related factors. System, 93, Article 102303.
https://doi.org/10.1016/j.system.2020.102303
Pellicer-Sánchez, A. (2015). Developing automaticity and speed of lexical access: The effects of incidental and explicit teaching approaches. Journal of Spanish
Language Teaching, 2(2), 126–139. https://doi.org/10.1080/23247797.2015.1104029
Ramus, F., Peperkamp, S., Christophe, A., Jacquemot, C., Kouider, S., & Dupoux, E. (2010). A psycholinguistic perspective on the acquisition of phonology. Laboratory
Phonology 10: Variation, Phonetic Detail and Phonological Representation, 311–340. Retrieved from papers3://publication/uuid/DBE40588-7CFD-4A08-A723-
CC16C3497081.
Rodgers, M. (2013). English language learning through viewing television: An investigation of comprehension, incidental vocabulary acquisition, lexical coverage, attitudes, and
captions [Doctoral dissertation, Victoria University of Wellington] http://hdl.handle.net/10063/2870.
Schmidt, R. (1990). The role of consciousness in second language learning. Applied Linguistics, 11(2), 129–158. https://doi.org/10.1093/applin/11.2.129
Sharwood Smith, M. (1991). Speaking to many minds: On the relevance of different types of language information for the L2 learner. Second Language Research, 7(2),
118–132. https://doi.org/10.1177/026765839100700204
Sharwood Smith, M. (1993). Input enhancement in instructed SLA. Studies in Second Language Acquisition, 15(2), 165–179. https://doi.org/10.1017/
S0272263100011943
Showalter, C. (2019). Russian phonolexical acquisition and orthographic input: Naïve learners, experienced learners, and interventions. Studies in Second Language
Acquisition, 1–23. https://doi.org/10.1017/S0272263119000585
Van Zeeland, H., & Schmitt, N. (2013). Incidental vocabulary acquisition through L2 listening: A dimensions approach. System, 41(3), 609–624.
Williams, J. N., & Paciorek, A. (2016). Indirect tests of implicit linguistic knowledge. In Advancing methodology and practice: The IRIS repository of instruments for
research into second languages (pp. 23–42). Taylor and Francis. https://doi-org.sire.ub.edu/10.4324/9780203489666.
Winke, P. M. (2013). The effects of input enhancement on grammar learning and comprehension. Studies in Second Language Acquisition, 35(2), 323–352. https://doi.
org/10.1017/S0272263112000903
Wisniewska, N., & Mora, J. C. (2018). Pronunciation learning through captioned videos. In J. Levis (Ed.), Proceedings of the 9th annual pronunciation in second language
learning and teaching conference (pp. 204–215). Ames: Iowa State University.
Wisniewska, N., & Mora, J. C. (2020). Can captioned video benefit second language pronunciation? Studies in Second Language Acquisition, 42(3), 599–624.
Further reading
Bisson, M. J., Van Heuven, W. J. B., Conklin, K., & Tunney, R. J. (2014). Processing of native and foreign language subtitles in films: An eye tracking study. Applied
PsychoLinguistics, 35(2), 399–418. https://doi.org/10.1017/S0142716412000434
18

Article Research Hanisa Insyarani

Uploaded by

Copyright:

Available Formats

Article Research Hanisa Insyarani

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Article Research Hanisa Insyarani

Uploaded by

Copyright:

Available Formats

System 116 (2023) 103078

Contents lists available at ScienceDirect

Audio-synchronized textual enhancement in foreign language

2.1. Phonolexical representations in language processing

2.2. Multimodal input processing and pronunciation learning

2.3. Multimodal input enhancement and synchronization

Our predictions are that.

M SD 95% CI M SD 95% CI M SD 95% CI M SD 95% CI

95% at 2000-word families to provide sufficient comprehension to intermediate L2 learners.

4.2.2. Target words

4.2.3. Lexical decision task

Actually adverb 8 6 2 322.33 consonant

Adorable 2 1 Medial 3220 481, 593 184 × 52 580

Fig. 1. Viewing phase flowchart.

5.1. Simultaneous visual and auditory processing

5.1.1. Preliminary analysis

5.1.2. Total fixation duration

5.1.3. Skipping probability

5.1.4. Fixation distance

5.1.5. Effects of proficiency on gaze behavior

5.1.6. Summary of RQ1 results

Estimate SE z p R2m R2c Lower Upper Lower Upper

Estimate SE z p R2m R2c Lower Upper Lower Upper

Estimate SE df t value p R2m R2c Lower Upper Lower Upper

***p < .001, *p < .05.

5.2. Phonolexical update

5.2.1. Preliminary analysis

5.2.2. Accuracy rate in the lexical decision task

5.2.3. Reaction times in the lexical decision task

5.2.4. Effects of proficiency on LDT gains

5.2.5. Summary of RQ2 results

Fig. 3. Average reaction times by time by condition.

Estimate SE z p R2m R2c Lower Upper Lower Upper

***p < .001, **p < .01, *p < .05.

7. Conclusion and limitations

The participants signed informed consent forms before data collection.

Declaration of competing interest

Appendix A. Research question 1 models

95% Confidence Intervals

N M (ms) SD (ms) Lower Upper

500 ms synchronization 129 338.14 193.35 304.46 371.82

95% Confidence Intervals

Contrast estimate SE z ratio p Lower Upper

95% Confidence Intervals

Estimate SE z p R2m R2c Lower Upper Lower Upper

95% Confidence Intervals

500 ms synchronization 171 24.56% 43.17 18.04 31.08

95% Confidence Intervals

Contrast estimate SE z ratio p Lower Upper

95% Confidence Intervals

Estimate SE z p R2m R2c Lower Upper Lower Upper

95% Confidence Intervals

N Mdn1 (ms) IQR2 (ms) Lower Upper

500 ms synchronization 128 136.90 615.95 4.48 264.11

95% Confidence Intervals

Contrast estimate SE df t ratio p Lower Upper

95% Confidence Intervals

Estimate SE df t p R2m R2c Lower Upper Lower Upper

Appendix B. Research question 2 data and models

p < .001, p < .01, p < .05.