Talrobot
Talrobot
Talrobot
of technology
Academic Dissertation which, with due permission of the KTH Royal Institute of Technology,
is submitted for public defence for the Degree of Doctor of Philosophy on Friday the 22nd
March 2024, at 10:00 a.m. in F3, Lindstedtsvägen 26, Stockholm.
ISBN 978-91-8040-858-5
TRITA-EECS-AVL-2024:23
Abstract
This thesis investigates how social robots can support adult second language
(L2) learners in improving conversational skills. It recognizes the challenges
inherent in adult L2 learning, including increased cognitive demands and the
unique motivations driving adult education. While social robots hold po-
tential for natural interactions and language education, research into conver-
sational skill practice with adult learners remains underexplored. Thus, the
thesis contributes to understanding these conversational dynamics, enhancing
speaking practice, and examining cultural perspectives in this context.
To begin, this thesis investigates robot-led conversations with L2 learners,
examining how learners respond to moments of uncertainty. The research re-
veals that when faced with uncertainty, learners frequently seek clarification,
yet many remain unresponsive. As a result, effective strategies are required
from robot conversational partners to address this challenge. These interac-
tions are then used to evaluate the performance of off-the-shelf Automatic
Speech Recognition (ASR) systems. The assessment highlights that speech
recognition for L2 speakers is not as effective as for L1 speakers, with perfor-
mance deteriorating for both groups during social conversations. Addressing
these challenges is imperative for the successful integration of robots in con-
versational practice with L2 learners.
The thesis then explores the potential advantages of employing social
robots in collaborative learning environments with multi-party interactions.
It delves into strategies for improving speaking practice, including the use of
non-verbal behaviors to encourage learners to speak. For instance, a robot’s
adaptive gazing behavior is used to effectively balance speaking contributions
between L1 and L2 pairs of participants. Moreover, an adaptive use of encour-
aging backchannels significantly increases the speaking time of L2 learners.
Finally, the thesis highlights the importance of further research on cultural
aspects in human-robot interactions. One study reveals distinct responses
among various socio-cultural groups in interaction between L1 and L2 partic-
ipants. For example, factors such as gender, age, extroversion, and familiarity
with robots influence conversational engagement of L2 speakers. Addition-
ally, another study investigates preconceptions related to the appearance and
accents of nationality-encoded (virtual and physical) social robots. The re-
sults indicate that initial perceptions may lead to negative preconceptions,
but that these perceptions diminish after actual interactions.
Despite technical limitations, social robots provide distinct benefits in
supporting educational endeavors. This thesis emphasizes the potential of
social robots as effective facilitators of spoken language practice for adult
learners, advocating for continued exploration at the intersection of language
education, human-robot interaction, and technology.
Sammanfattning
iii
iv LIST OF PAPERS
Other contributions by the author that are not included in the thesis:
vii
viii ACKNOWLEDGEMENT
ix
Contents
Acknowledgement vii
Acronyms ix
Contents 1
I Overview 3
1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1
2 CONTENTS
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Technical Framework 29
5.1 Robot Furhat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Initial Dialogue System . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Taboo Game System . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 General Interactive Components . . . . . . . . . . . . . . . . . . . . 34
6 Understanding Conversations 37
6.1 Uncertainty, Confusion or Doubts . . . . . . . . . . . . . . . . . . . . 37
6.2 Speech Recognition with L2 Speakers . . . . . . . . . . . . . . . . . . 43
6.3 Pathways to Explore . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8 Cultural Perspectives 55
8.1 Cultural Effects on Social Robots . . . . . . . . . . . . . . . . . . . . 55
8.2 Cultural Stereotypes and Social Robots . . . . . . . . . . . . . . . . 56
9 Paper Contributions 59
9.1 Paper A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.2 Paper B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.3 Paper C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.4 Paper D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.5 Paper E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.6 Paper F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.7 Paper G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.8 Paper H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
References 73
II Included Papers 91
Part I
Overview
3
Chapter 1
Introduction
1.1 Motivation
In the domain of adult education, whether at the initial stages or for periods
of extended learning, there are complexities that make this educational journey
different from that of younger learners. One of the most notable differences lies
in the motivation that drives adults to embark on this journey. Adults often
enroll in new educational phases because of intricate life events, such as changes
in their professional trajectories, the desire for career advancement, or more sig-
nificantly, driven by evolving migration patterns. Over the past years, there has
been a steady increase in the number of individuals relocating to different coun-
tries, signaling a shift in population dynamics and societal structures (McAuliffe
and Khadria, 2019). Within this evolving landscape, one of the most common
educational objectives for adults is the process of learning a second language (L2),
as the ability to learn or speak the language of destination is one of the crucial
components for successful integration into a new society (Adsera and Pytlikova,
2012; McAuliffe and Khadria, 2019). However, the task of learning a second
language in adulthood does not come without its complications. Primarily, this
process presents a significant cognitive challenge for adults, given the considerable
reduction in language learning rate that occurs after the period of adolescence
(Steber and Rossi, 2021). Additionally, migrants L2 learners may face limited
social interaction within the local community (Barraja-Rohan, 2011; Baynham,
2006; Li and Kaye, 1998) which can hinder their learning process. Consequently,
providing support for adult learners, especially those engaged in learning a second
language, and navigating migration patterns, becomes particularly important for
the thrive and well-being of modern societies (Kloubert and Hoggan, 2021).
In this context, it is essential to identify the areas where adult language learn-
ers could benefit most from additional assistance and, in particular, whether
technology could offer this support. For this population, there’s a greater empha-
sis on practicing speaking skills to achieve interactive competence —the ability to
5
6 CHAPTER 1. INTRODUCTION
Figure 1.1: Areas of research within robot assisted language learning. Values sum-
marized from the survey of robot-assisted language learning by Randall (2019).
3 https://www.duolingo.com
4 https://learn.microsoft.com/en-us/azure/ai-services/speech-service/
how-to-custom-speech-evaluate-data
8 CHAPTER 1. INTRODUCTION
2006), even in situations involving L2 learners (Engwall et al., 2022; Johnson and
Valente, 2009). However, errors in recognition might still place learners in an
unfavorable position, particularly when feedback is expected. This becomes im-
portant as they often need sufficient assistance to tackle challenges throughout a
practice conversation. The effectiveness of these qualities in social robots, hence,
may still fall short of convincingly reproducing the characteristics of L1 speak-
ers, or educators, who possess the ability to evaluate conversational speaking or
provide adequate feedback.
Nevertheless, this thesis argues that social robots can assume a supportive
role in enhancing settings where learners practice conversational skills. This idea
advocates for:
These propositions serve as the foundation for the thesis and are formalized
in the contributions outlined in the following section.
1.2 Contributions
It seem evident then that social robots have distinct characteristics that can
substantially support the development of communicative skills in second language
practice. However, this proposition raises several questions. This thesis, hence,
aims to embark on the initial exploration of:
Figure 1.2: Visual representation of the contributions described in this thesis. For
detailed contributions of each published academic paper, refer to Chapter 9.
Cultural Perspectives
This process then presents another aspect that this thesis aims to address, that of
cultural richness of second language learning. Undoubtedly, language functions as
a transparent window into one’s cultural background, transforming the process
of learning a new language into a rich exchange of cultural intricacies. While
this phenomenon is well-explored within social sciences, especially in linguistic
studies (Genc and Bada, 2005; Kuo and Lai, 2006), the understanding remains
somewhat limited when it comes to the involvement of social robots. The final
phase of this thesis thus offers a look into how cultural characteristics shape
the interaction in language diverse settings. Through an exploration of this
phenomenon, the objective is to offer valuable insights into comprehending the
interplay between cultural nuances in language practice with a social robot, as
evidenced in Paper F. Additionally, the thesis delves into the preconceptions that
individuals may develop regarding culturally-encoded social robots, as illustrated
in Paper G.
1.3 Outline
In the chapters that follow, we will present the three main theoretical frameworks
that have played a crucial role in providing a conceptual structure for understand-
ing how social robots can enhance the practice of a second language with adult
learners. The first framework, presented in Chapter 2, focuses on the elements
surrounding the development and utilization of social robots. This framework ex-
1.3. OUTLINE 11
Background: Robot-assisted L2
learning
This chapter explores the evolution of Robot Assisted Language Learning (RALL),
tracing its origins back to studies within Computer-Assisted Language Learning
(CALL). The chapter examines the inherent advantages of social robots, empha-
sizing their role in multi-party interaction. Towards the conclusion, a summary
is provided regarding people’s attitudes and perceptions towards social robots.
These insights will serve as a foundation to inform the suggested role that so-
cial robots can take in enhancing spoken language practice for second language
learners.
13
14 CHAPTER 2. BACKGROUND: ROBOT-ASSISTED L2 LEARNING
From the review conducted by Randall (2019), the overall benefits of having
a robot involved in a language learning process include: (1) robots can aid in
learning when used as accompaniment to human instructions, (2) they have a
positive effect on learner’s affective states (e.g. confidence, anxiety, and motiva-
tion) and (3) they may offer advantages when used to foster speaking ability. To
optimize these benefits, human-robot interactions must be carefully designed, as
suggested by Belpaeme et al. (2018b). These recommendations include focusing
on meaningful interactions to actively engage the learner, adapting interactions
to the learner and domain, and considering the duration and intensity of the
intervention. Furthermore, Belpaeme et al. (2018b) highlights the importance
of factors such as the robot’s role, type of feedback, and verbal and non-verbal
behaviors. Engwall and Lopes (2022) reaffirm these principles for adult learners.
While the embodiment of the robot may be assumed to inherently benefit learn-
ing, research suggests that evaluating only the effects of embodiment may not
necessary lead to substantial learning outcomes (Gordon et al., 2015; Westlund
et al., 2015). These principles can be extended to all learning contexts, despite
being primarily derived from research focused on younger participants.
Multi-party Conversations
Given the benefits that social robots convey for group conversation, in comparison
to other technologies, there has been research devoted to explore this setting
within RALL. For instance, Khalifa et al. (2017) presented a joining-in RALL
system with two humanoid robots playing a teacher and an “advanced” peer
role. The interaction between the robots and learner was designed to smoothly
switch between tutoring and implicit learning. The results revealed that repetitive
queries of specific grammatical expressions prompted by the robot consistently
16 CHAPTER 2. BACKGROUND: ROBOT-ASSISTED L2 LEARNING
improved correct usage, with substantially greater improvement when the peer
learner robot provided assistance for implicit learning, compared to scenarios
without robot assistance. In a follow-up work, Khalifa et al. (2019) applied the
same approach to improve practical communication skills. The authors found that
repetitive implicit learning sessions increased appropriate grammatical pattern
usage. Post-presentation of a reference proved as effective as pre-presentation,
especially for retention, functioning as corrective feedback for learners.
However, robots can take a much more influencing role in group interactions.
Sebo et al. (2020) highlights the profound impact that robots can have on group
dynamics through their behaviors, their assigned roles within the groups, and
their appearance and capabilities. For example, a robot’s verbal and nonverbal
behaviors can actively shape interactions among group members, extending be-
yond the direct interaction with the robot itself. Importantly, Sebo et al. (2020)
note that the effectiveness of these results is strongly related to how most of these
robots —highly anthropomorphic and with human-like modalities of interaction—
fulfill the role of a human member of the group. These considerations, however,
are beneficial for the purpose of supporting L2 practice conversations in group
settings.
2.4 Conclusions
Existing computer-assisted and robot-assisted language learning systems often
offer learners constrained speaking interactions which may not fully address the
development of broader conversational skills. Technical challenges, including re-
duced speech recognition performance for L2 speakers, can further impede the
introduction of more interactive activities. Despite these obstacles, the reported
reduction in anxiety levels and the positive impact on learners’ emotional states
highlight the potential benefits of these interactive technologies. Particularly,
considering the advantages associated with social robots in facilitating insightful
real-world interactions, they emerge as promising option for advancing language
learning, with a focus on enhancing speaking and conversational skills.
Chapter 3
Since practice conversations are a central focus of this work, it’s crucial to begin by
establishing foundational concepts for analyzing human conversations. In doing
so, this thesis adopts the perspective of Conversation Analysis (CA) (Seedhouse,
2005), using it not as a methodology, but rather as a framework for interpreting
social interactions. Within this framework, this chapter highlights key factors
contributing to fluent spoken communication and how these elements are used
to understand and support the process of conversation practice with L2 learners.
Finally, in line with the underlying motivation of this work, this chapter provides
a comprehensive background on communicative and interactive competence as
the central objective in learning a second language for adult (migrant) learners.
3.1 Conversations
Building upon the background presented in Chapter 2, which discusses how robots
can improve conversational practice, particularly in multi-party settings, this sec-
tion introduces the concepts of common ground, turn-taking, and conversational
cues. A special focus is directed towards potential challenges arising in the form
of speaking or understanding during spoken interactions, as these challenges are
expected to manifest themselves in conversations with L2 learners.
Common Ground
Paul Grice, in his 1967 William James lectures, introduced the idea of “common
ground” —without explicitly using this term— that would become central to
the field of pragmatics (Geurts, 2019). As originally proposed, common ground
refers to the shared knowledge, beliefs, and assumptions between interlocutors
that facilitate successful communication. Grice emphasized that speakers convey
19
20 CHAPTER 3. BACKGROUND: CONVERSATION AND L2 LEARNING
meaning beyond the literal interpretation of their words, noting the importance
of cooperative principles and conversational maxims to communicate efficiently
(Neale, 1992). Over time, this concept has been explored from various angles, such
as Lewis’s “common knowledge” (Lewis, 1969) and Schiffer’s “mutual knowledge”
(Schiffer, 1972). Clark and Schaefer (1989) defined it as the mutual agreement
among conversation participants that they have understood each other “to a
criterion sufficient for current purposes”.
Stalnaker (2002) coined the term common ground, describing it as the sum
of interlocutors’ mutual, common, or joint beliefs and knowledge. Once informa-
tion achieves common ground status, participants need not invest further efforts
into redefining or clarifying it (Knutsen and Le Bigot, 2012). Nonetheless, com-
plications may occur in the form of misunderstandings and non-understandings
(Hirst et al., 1994), where the former denotes an incorrect interpretation of the
speaker’s intention, while the latter signifies a complete absence or minimal con-
fidence in any interpretation (Skantze, 2007). In such instances, problems are
usually corrected through repair mechanisms initiated by the participants of the
conversation as the dialogue unfolds. Preferences regarding which interlocutor is
expected to manage these repairs vary (Skantze, 2005), although there is a slight
inclination towards self-repair. Self-repairs occur when a speaker spontaneously
corrects or revises their own utterance during a conversation, while other-repairs,
on the other hand, involve one participant in a conversation initiating a correction
or clarification for something said by another participant. For example, Kendrick
(2015) showed that the time interval before other-initiated repairs is longer than
the typical delay in turn-taking, suggesting a deliberate communicative act aimed
at prompting the speaker to engage in self-repair.
Understanding common ground and the mechanisms for resolving miscommu-
nications are essential in the proposed context of L2 speaking practice. As will be
elaborated on Chapter 6, from results of Paper A and Paper B, learners may
request clarification from a robot but could also choose to remain silent, either
because they are trying to decode what the robot said or because this reaction
could signal an implicit request for clarification.
Turn-taking
Moving forward, the concept of turn-taking encompasses the dynamics sur-
rounding the organization of speaking turns in a conversation. Among the earli-
est models proposed, Sacks et al. (1974) suggested that the coordination of turns
is not pre-planned but evolves dynamically during the dialogue. Their model
primarily adheres to the principle of “one part speaks at a time” even though
transitions can easily, and frequently, occur without a gap, or even with overlap.
Alternative models have instead emphasized the overlapping nature of conversa-
tions, whereby interlocutors can develop “more than one floor” and take turns
as deemed functionally appropriate (Edelsky, 1981; Schegloff, 2000). Notably,
participant of a conversation can use various cues and signals to indicate when a
3.1. CONVERSATIONS 21
Backchannels
The utilization of conversational cues to shape an interaction is not limited to
the participant holding the speaking turn. Usually referred to as active lis-
tening, participants of a conversation often use verbal (e.g. “yeah”, “uh-huh”)
and non-verbal (e.g. head nods or smiles) cues to demonstrate attention to the
speaker. The terminology for these signals tends to differ across literature, in-
cluding “listener responses” (Dittmann and Llewellyn, 1968), “accompaniment
signals” (Kendon, 1967) and “backchannels” (Yngve, 1970). In this work, we
use the latter term: backchannel. As described by Wolf (2008), early research
—focused primarily on American English— proposed a discrete classification of
backchannels, starting as short messages to indicate interest and attention (Yn-
gve, 1970) and further developed to include clarification requests, sentence com-
pletions, brief restatements, and nonverbal messages (Duncan, 1974; Duncan and
Fiske, 1977). Schegloff (1982) further suggested that backchannels “must instead
be analyzed in view of their interactive functions within discourse”, proposing
them to be “continuers” and having a regulative function (Wolf, 2008). However,
in practice, their interpretation, from the perspective of the listener, can be eas-
ily modified due to differences in timing and form (Kawahara et al., 2015). For
instance, the English expressions “oh” or “okay” used at the end or toward the
end of a turn might be interpreted as an attempt to take the floor or even signal
the end of the sequence (Goodwin, 1986; Schegloff, 1982). Conversational cues,
in this manner, assume a multifaceted role, impacting not just the flow of the
dialogue but also molding the dynamics of participation and engagement within
the interaction. These cues are the focus of Paper E that will be discussed in
Chapter 7.
22 CHAPTER 3. BACKGROUND: CONVERSATION AND L2 LEARNING
Gaze
Expanding on non-verbal cues, gaze occupies a distinct role in shaping conver-
sations further. As eloquently summarized by Mutlu et al. (2012), gaze serves
as a valuable signal for defining conversation roles (Goodwin, 1981), facilitat-
ing turn-taking, and providing information on the speaker’s discourse (through
gaze shifts primarily). This thesis explores the two aforementioned functions. In
the determination of participants’ roles within a conversation, especially in those
involving more than two individuals, interlocutors use gaze directed towards spe-
cific participants to clarify who is being addressed (Goodwin, 1981; Sacks et al.,
1974; Schegloff, 1982). The absence of this mechanism, i.e. not gazing towards an
intended addressee, may potentially lead to breakdowns in the organization of the
conversation (Schegloff, 1982). Gaze serves a crucial cue in aiding turn-taking,
providing clarity on which speaker holds the turn (Kendon, 1967) or facilitat-
ing turn exchanges, whether through simple single-floor turns Goodwin (1981);
Goodwin et al. (1980); Sacks et al. (1974) or with overlapping speech (Schegloff,
2000). Chapter 7 will describe the exploration of an adaptive gazing robot be-
havior presented in Paper D.
rized by Ferreira et al. (2007), the are several factors affecting the effectiveness
of these strategies, including the “specific aspects of language being corrected,
conditions relating to the provision of teacher correction, and characteristics of
the students” (e.g., considering the difference among proficiency levels). Further-
more, educators employ various strategies within corrective feedback, including
the repetition of errors, reformulation of all or part of the student’s answer (re-
cast), explicit correction, or providing the correct answer when uncertain (Ferreira
and Atkinson, 2008). When teachers aim to solicit a response from students, they
may question the correctness of the student’s utterance, requesting clarification
when an utterance is ill-formed (and soliciting a repetition or reformulation), or
directly eliciting a correction by pausing to allow the student to complete the
reformulation (Ferreira and Atkinson, 2008). These findings underscore the in-
tricate nature of corrective feedback in L2 learning and highlight the importance
of considering various factors when implementing feedback strategies in language
education.
Implementing appropriate feedback from the robot poses a challenge due to
these complex requirements. In the context of this thesis work, this challenge
is amplified as the robot leads a practice conversations. Therefore, we suggest
refraining from providing specific feedback in this setting. Instead, using a mul-
tiparty settings may present a potential solution, allowing peers to potentially
fulfill the role of providing some (communicative) feedback.
3.5 Conclusions
The theoretical frameworks that underlie the learning of communicative skills
in a second language create a compelling need for the development of interac-
tions facilitating participants to engage in practice conversations. However, this
requirement encounters challenges when translated into interactive technologies.
As demonstrated in the preceding sections, the use of practice conversations in a
second language involves navigating various dimensions, encompassing compre-
hension and production challenges faced by language learners.
An additional aspect that arises is that of socio-cultural factors within the
domain of L2 learning. Therefore, it becomes crucial to incorporate this contex-
tual dimension into the development of the thesis work. These topics are further
discussed in the next chapter.
Chapter 4
In the preceding chapters, this thesis has described the role of language as a
path for exploring the cultural dimensions inherent in L2 learning. As Baldwin
et al. (2006) highlight, the term “culture” embodies a complex and multifaceted
concept, encompassing a variety of collective beliefs, values, customs, practices,
behaviors, and social institutions defining a specific group of people. Culture
should, furthermore, be recognized as a dynamic concept, evolving over time in
response to internal and external factors. Building on this foundation, the current
chapter explores critical elements that clarify how cultural backgrounds impact
the dynamics shaping spoken interactions and the subsequent development of
social robots within this context.
25
26 CHAPTER 4. BACKGROUND: CULTURE AND L2 LEARNING
interactions with computerized systems, where accents can serve as strong cues
in how users perceive computers as social actors Dahlbäck et al. (2007); Nass and
Brave (2005); Reeves and Nass (1996). Accents in speech, furthermore, contribute
to the categorization of a speaker’s background, affective state and identity, with
accent often prevailing over physical appearance (Krenn et al., 2017; McGinn and
Torre, 2019). For example, in a study involving virtual agents, Khooshabeh et al.
(2017) found that accents in American English were perceived as foreign by indi-
viduals who did not share the agent’s simulated mixed background but increased
perceived shared social identity among those who shared a mixed background.
Indeed, the design of culturally-diverse social agents requires expertise from
various disciplines, including cross-cultural psychology and computer science (De-
gens et al., 2017). While stereotypes based on features like voice and appearance
influence people’s perceptions of social agents, often aligning with their own social
beliefs, it’s essential to acknowledge that stereotypes can carry negative connota-
tions. At times, researchers may inadvertently introduce their own biases into the
design process, making it challenging to recognize this phenomenon (see Yin et al.
(2010)). Hence, designing culturally rich agents should follow strict evaluations
and comprehensive analysis.
4.4 Conclusions
The exploration of culture in the context of second language learning and social
robotics reveals the intricate interplay of linguistic, social, and cultural dimen-
1 Many international bodies, including the UN General Assembly and the European Union,
dismiss the concept of distinct human races. Here it is only used to exemplify erroneous terms in
various academic works.
28 CHAPTER 4. BACKGROUND: CULTURE AND L2 LEARNING
Technical Framework
The contributions presented in this thesis work have all required substantial de-
velopment of technical frameworks that allowed for the generation of multi-party
interactions with a robot. This chapter presents the main components used to
develop these frameworks.
1 https://furhatrobotics.com/
29
30 CHAPTER 5. TECHNICAL FRAMEWORK
Figure 5.1: Photos taken during experiments and demos with the robot Furhat
(left) and virtual version (right).
Figure 5.2: A reduced FSM state tree from a social conversation used in the initial
dialogue system employed in Paper A and B.
The human in charge of controlling the robot, known as wizard 4 , used an interface
that received the list of possible robot responses at each state, as shown in the
lower-left segment of Figure 5.3. Additionally, the wizard always had the option
4 The term originates from the metaphor of the Wizard of Oz, where a human operator sim-
ulates advanced functionalities of a computer-based system, much like the character of the The
Wonderful Wizard of Oz story.
32 CHAPTER 5. TECHNICAL FRAMEWORK
5 https://www.ros.org/
6 https://github.com/ronaldcumbal
Figure 5.3: Wizard control interface. Top left shows the list of (static) default
response. Bottom left has the list of (dynamic) robot responses updated on every
state of the dialogue tree. The corresponding short-cut keys for these responses are
shown at the right side of each option (middle of the image) in grey circles. Bottom
right presents the buttons to control the recording the interaction.
5.3. TABOO GAME SYSTEM 33
ROS Master plays a crucial role by providing naming and registration services
to nodes, as well as tracking publishers and subscribers to topics.
The overall framework used in this doctoral work is illustrated in Figure 5.4.
Figure 5.4: General structure used for experimental setup within the ROS frame-
work. Nodes are shown with dashed-line boxes, while arrows represent Topics.
Dotted boxes denote simple configuration or logging files.
34 CHAPTER 5. TECHNICAL FRAMEWORK
Speech Recognition Following the work presented in Paper C, the main Speech
Recognition service used through these experiments was Microsoft’s Azure
Speech-to-Text7 . Typically, this system was employed with its default set-
tings, modifying the language code when required.
Voice Activity A Voice Activity Detector (VAD) is a model trained or designed
to identify periods of speech and silence in an audio signal. In paper Paper D
and Paper E a version of the py-webrtcvad 8 Voice Activity Detector (VAD)
was employed. The settings for this models were not changed.
Microphones Paper A and B used USB headsets to capture participants’ voice
signals during interactions with the robot. However, this approach introduced
undesirable levels of environmental noise and cross-talking. In Paper D a
different strategy was adopted by employing two USB headsets to handle the
VAD and ASR modules with separate inputs. Paper E, instead used head-
mounted Shure Model WH20 professional microphones, achieving improved
speaker diarization. Paper H used a microphone array and professional
microphones to enhance both speaker diarization and speech recognition.
7 https://azure.microsoft.com/en-us/products/ai-services/speech-to-text
8 https://github.com/wiseman/py-webrtcvad
5.4. GENERAL INTERACTIVE COMPONENTS 35
Figure 5.5: Scheme of the system architecture, showing the components that gen-
erate confidence scores for their respective outputs.
Chapter 6
This Chapter describes the initial attempts to explore robot-led practice con-
versations with learners of a second language. The information presented here is
extracted from Paper A, Paper B and Paper C, supplemented with additional
insights into the conducted studies.
In particular, this chapter argues that while it is feasible to carefully develop
(constrained) practice conversations between L2 learners and autonomous social
robots, it stresses that engaging with L2 learners introduces greater complexities
that demand thorough evaluation. These challenges, that arise especially at lower
proficiency levels, require specific assistance and current technical constraints may
hinder this objective.
37
38 CHAPTER 6. UNDERSTANDING CONVERSATIONS
Figure 6.1: Dyad social conversation between L2 learners and the robot Furhat.
6.1. UNCERTAINTY, CONFUSION OR DOUBTS 39
Figure 6.2: Video replay with self-reporting questions focused on uncertainty. The
faces were intentionally blurred solely for publication purposes and were not ob-
scured in the actual system.
40 CHAPTER 6. UNDERSTANDING CONVERSATIONS
Figure 6.3: Variation of uncertainty across four different type of learners’ reactions.
This occurred because the video clips presented to the participants displayed
incorrect moments of the conversation, failing to show the instances where the
robot intentionally tried to elicit moments of uncertainty. Hence, the self-reported
values of confusion were attached to random moments of the conversation.
Paper A presents findings from a visual analysis conducted on the recordings
of robot-led conversations. In this opportunity, instead, this section focuses on the
results pertaining to the overall outcomes of our intended manipulation. Among
all the events manipulated to induce uncertainty in the learners, it was observed
that two-thirds of the more challenging robot utterances, characterized by higher
speed or complexity, resulted in confused reactions. Interestingly, many learners
were able to comprehend the modified output of the robot by grounding the
interaction in the recent dialogue context. Out of the thirty-six instances where
uncertainty was successfully triggered, thirty-two led to a clarification request.
However, in four instances, the learners did not provide a response, indicating a
more complex scenario. This non-response aspect hints at potential challenges or
complexities in the learners’ interactions with the robot.
Based on these discoveries, our objective was to comprehensively examine the
entire collection of recordings to enhance our understanding of how moments of
uncertainty manifest themselves in conversation practice. The analysis started by
identifying instances of uncertainty that were effectively generated through mod-
ifications in the robot’s spoken output and progressed through the entire range
of interactions. This analysis led to a clear realization that the concept of uncer-
tainty should not be simplified as a binary phenomenon. Therefore, (un)certainty
was reinterpreted as a continuum ranging from absolute confidence to complete
unresponsiveness, as shown in Figure 6.3. In the initial analysis, and with a few
refinements thereafter, four specific reactions were identified that corresponded
appropriately to the range of confidence displayed in learners’ responses to the
robot’s input during the conversation. These reactions included direct responses,
thoughtful responses, clarification requests, and instances where no response was
provided. A summary of the annotation scheme outlining these reactions is pre-
sented next (reflecting the illustration of Figure 6.3):
6.1. UNCERTAINTY, CONFUSION OR DOUBTS 41
• No Response: Participant does not reply back to the robot and the robot
continues conversations.
At this point, it was also interesting to evaluate whether these same inter-
pretations could be applied to multi-party interactions. Using the data collected
in the initial exploratory experiments, we extended the annotations scheme for
learners’ reactions to uncertainty, incorporating the following:
Figure 6.5: Normalized Confusion Matrix results for (un)certainty detection (Ran-
dom Forest models). BPS: Data before participant’s speech starts, CT: Data of the
complete turn.
6.2. SPEECH RECOGNITION WITH L2 SPEAKERS 43
or certain would be unfair. In particular – aligning with the theory that encap-
sulated the Zone of Proximal Development (ZPD) (Vygotsky and Cole, 1978) –
effective learning occurs when learners receive guidance and support to tackle
tasks slightly beyond their current independent capability. Consequently, pro-
viding assistance only when a learner is excessively confused or entirely certain
might not be optimal. It is argued then that for a robot to lead conversation
practice effectively, it should be aware of all of these complexities and capable of
fluidly react to various degrees of (un)certainty. If this requirement cannot be
fully guaranteed, then different forms of support should be examined.
transcription performance, with a lower Word Error Rate (WER), than those
from second language speakers. However, this difference becomes less obvious
when dealing with utterances in spontaneous speech. Analyzing the results from
the CORALL dataset, the only statistically significant result is noted in the tran-
scriptions generated by Microsoft ASR (L1: 0.36 vs. L2: 0.51, p < 0.05), while
other ASRs perform equally bad (Google L1: 0.41 vs. L2: 0.42 and Hugging-
face L1: 0.64 vs. L2: 0.66). These results highlight that WER increase, nearly
doubling, with L2 speakers for read sentences, but that for spontaneous speech
dataset, the performance of all ASRs deteriorate. The observation that Microsoft
Azure ASR performs better for L1 speakers in conversations compared to Google
and Huggingface may be attributed to its system development, specifically tai-
lored for conversations. Moreover, upon analyzing word errors, it was discovered
that among the most frequently misrecognized utterances for L2 speakers there
were specific words that signal important requests for assistance from the user,
e.g. “understand” and “repeat”.
In recent years, there has been a substantial advancement in state-of-the-art
speech recognition, notably demonstrated by OpenAI’s Whisper model (Rad-
ford et al., 2023). This model has obtained recognition for its substantial im-
provements across various benchmarks and its performance in multiple languages.
Google (Zhang et al., 2023) and Meta (Pratap et al., 2023) have has also con-
tributed with large models, emphasizing their capability to transcribe diverse
languages as well. Despite these achievements, a notable concern persists regard-
ing the ease with which these models can generalize to data beyond their training
distribution. This raises a dual challenge: languages not fully represented in the
training data may exhibit disparate recognition performance, and the phonetic
variability of these languages could further impact the model’s efficacy. With this
idea in mind, the performance of speech recognition for second language learners,
particularly with less-resourced languages, is likely to drop below an adequate
level for effective conversation practice.
As a final remark, though a direct one-to-one comparison may not be entirely
applicable, it is intriguing to note the performance metrics for the Swedish lan-
guage using the Whisper model. In Common Voice 9 (Ardila et al., 2019) it
obtains a WER of 10.6% and in FLEURS (Conneau et al., 2023) it stands at
8.5%. Both of these datasets contain only read speech data. Not too far away,
6.3. PATHWAYS TO EXPLORE 45
the best result we obtained in our experiments was 11.1% WER, with the read
speech dataset as well.
47
48 CHAPTER 7. ENHANCING SPEAKING PRACTICE
Our focus, hence, centers at how a social robots could efficiently and naturally
motivate learners to increase the amount of speaking in L2 practice. An important
consideration in motivating students to engage in speaking tasks is whether they
could feel rushed or surprised, particularly during discussions, usually referred
as “cold-calling” or “random-calling”. While these methods have been proven
to boost participation among typically quiet students (Dallimore et al., 2013),
some studies highlight concerns about their impact on learner and potential to
cause anxiety or discomfort (Cooper et al., 2018; Ishino, 2022). Therefore, a key
characteristic of our approach involved using nonverbal cues, powerful in interac-
tion but subtle enough to avoid inducing anxiety in L2 speakers. Through this
process, different alternatives were evaluated, including backchannels to indicate
when the robot is listening (Skantze et al., 2015), gazing to persuade participants
to consider the robot’s suggestions Chidambaram et al. (2012), or mixed with
other cues to manage turn-taking (Skantze et al., 2015). Changes in speech, in-
cluding intonation (Chidambaram et al., 2012; Kory Westlund et al., 2017), were
also examined, as studies on persuasive vocal tone demonstrated its impact on
compliance (Wainer et al., 2010). Gestures like head nods and face expressions
were also considered (Saerbeck et al., 2010b).
Due to the strength of their effect in group interactions, and to limit the
analysis to only one cue per study, gaze shifting and backchanneling were selected
to shape interactive dynamics of participants in an L2 group practice activity.
Studies have demonstrated that a robot’s gaze can shape the roles of participants
in a conversation (Mutlu et al., 2012) and different backchannels can contribute
to balanced participation in turn-taking behavior (Skantze, 2017).
of speaking aloud, whispering to oneself, or explaining tasks to someone else, among other forms.
7.2. DIFFERENT PAIRINGS 49
Figure 7.1: Photographs captured during the studies in Paper E and Paper D
depicting pairs of participants engaging in the game Taboo with the robot Furhat.
Both L1 and L2 speakers collaborate to describe a word presented on a table or
on the screen, as illustrated in the right bottom corner. The robot Furhat employs
non-verbal behaviors to balance or encourage participation.
Gazing Behavior: During instances when the participant with the majority
of speaking time was active, the robot consistently distributed its gaze equally
between both participants. On the other hand, when the participant with the
lower speaking time assumed the speaking role, the robot adjusted its gaze
proportionally based on the participants’ relative speaking duration. This
adjustment resulted in allocating more gaze time to the participant with the
The detailed results of these studies can be found in Paper D and Paper E.
This section provides an overview of the most important results. As illustrated
in Figure 7.4a, the imbalance in participation between an L1 and L2 speaker
was notably diminished through the implementation of an adaptive gaze behav-
ior. Remarkably, these positive outcomes persisted even as the game’s difficulty
increased. We further observed that part of the reason for a more balanced inter-
action was the simultaneous decrease in participation from L1 speakers, coupled
with an increase in speaking time from L2 speakers. As a result, there was an
interest in exploring whether additional actions from the robot could exclusively
boost the participation of L2 speakers. Turning to the results from employing an
adaptive backchannel strategy, depicted in Figure 7.4b, it became evident that
the amount of speaking time for L2 speakers did, indeed, significantly increase.
Additionally, there was a slight decrease in speaking time for L1 speakers, which
aligns with the context of the game. The game’s setup imposes limits on total
speaking times due to a maximum game time per word and a semi-fixed number
of game words.
Importantly, these results present a highly encouraging outlook on the po-
tential role of social robots, portraying them as a promising force in cultivating
positive (pro-social) environments for human interactions. Especially within the
area of second language learning, the findings suggest that despite the complex-
ities inherent in L2 practice and potential technical limitations that may not
currently offer a robust solution for open practice conversations, there is clear
feasibility in a robot’s ability to orchestrate multi-party interactions to enhance
the practice of a second language.
Furthermore, while exploring the cultural aspects of second language learning,
the interaction between L1 and L2 speakers not only showcased the effectiveness
of social robots in language practice but also reveals a nuanced perspective on
socio-cultural dynamics intricately intertwined with the evolution of social robots.
This intricate interplay accentuates the challenges and opportunities that emerge
in the development of human-robot interactions and underscores their potential
impact on societal development. These considerations are consequently evaluated
more deeply in Chapter 8.
Chapter 8
Cultural Perspectives
In this chapter it is argued that cultural aspects have not been adequately em-
phasized within social robots research, including certain sections of RALL. While
the preceding discussions have primarily addressed the challenges and nuances
associated with implementing robot-led practice conversations and explored the
distinct characteristics that make robots optimal for supporting language prac-
tice in unconstrained settings, there remains a notable gap in the consideration
of cultural perspectives within this discourse. The content of this discussion is
based on Paper F and Paper G.
55
56 CHAPTER 8. CULTURAL PERSPECTIVES
the robot’s backchannels, although limited subject numbers per category result
in few significant differences
The demonstrated results highlight that the effectiveness of participation-
adjustment, i.e., tailoring robot backchannels based on the participants’ speaking
contributions, varies across different socio-cultural groups. Consequently, the
formulation, timing, and frequency of backchannels may need to be tailored for
specific socio-cultural groups to achieve the same intended function across diverse
participants. This underscores the importance of adapting robot interactions
based on the dynamic cultural nuances inherent in human interactions.
Figure 8.2: Mean scores for the results from the online survey measuring easiness
of Understanding, Naturalness, perceived Competence, Likeability and perceived
English Proficiency in different English accented voices. ∧ maker indicates voices
selected for the fallowing physical robot study.
possess traits such as being young adults, higher educational level, exposure to
different cultures, and multilingual competence, that have been associated with
cultural open-mindedness and higher acceptance of accented speech (Boduch-
Grabka and Lev-Ari, 2021; Dekker et al., 2021).
Overall, these findings suggest the necessity of considering cultural nuances
in designing social robots and highlight the potential impact of socio-cultural
factors on human-robot interactions. Further research is warranted to explore
these dynamics in diverse settings and populations, ensuring that social robots
are designed and deployed in a culturally sensitive manner.
Chapter 9
Paper Contributions
This chapter outlines the key contributions in the appended papers of the thesis.
Additionally, it describes the authors’ role in each paper.
9.1 Paper A
Uncertainty in Robot Assisted Second Language Conversation Prac-
tice - Ronald Cumbal, José Lopes and Olov Engwall
9.2 Paper B
Detection of Listener Uncertainty in Robot-Led Second Language Con-
versation Practice - Ronald Cumbal, José Lopes and Olov Engwall
59
60 CHAPTER 9. PAPER CONTRIBUTIONS
9.3 Paper C
“You don’t understand me!”: Comparing ASR results for L1 and L2
speakers of Swedish - Ronald Cumbal, Birger Moell, José Lopes and Olov En-
gwall
9.4 Paper D
Robot Gaze Can Mediate Participation Imbalance in Groups with Dif-
ferent Skill Levels - Sarah Gillet, Ronald Cumbal1 , André Pereira, José Lopes,
Olov Engwall and Iolanda Leite
9.5 Paper E
Shaping Unbalanced Multi-Party Interactions through Adaptive Robot
Backchannels - Ronald Cumbal, Daniel Alexander Kazzi, Vincent Winberg and
Olov Engwall
9.6 Paper F
Socio-cultural perception of robot backchannels - Olov Engwall, Ronald
Cumbal and Ali Reza Majlesi
9.7 Paper G
Stereotypical Nationality Representations in HRI: Perspectives from
International Young Adults - Ronald Cumbal, Agnes Axelsson, Shivam Mehta
and Olov Engwall
9.8 Paper H
Speaking Transparently: Social Robots in Educational Settings - Ronald
Cumbal and Olov Engwall
In order to answer this question, three different elements were explored and
evaluated throughout the course of this thesis. The following discussion outlines
these elements.
65
66 CHAPTER 10. DISCUSSION AND CONCLUSIONS
Cultural Perspectives
The third point emphasized in this thesis is the imperative for additional research
specifically addressing cultural aspects in human-robot interactions. This work
not only showcased diverse reactions to the same robot behavior based on indi-
viduals’ cultural backgrounds, but also explored the perspectives that a group of
people could hold regarding social robots encoded with different cultural traits.
While the original results indicated that adapting the robot’s backchanneling
strategy could influence participants’ speaking time, a deeper examination re-
vealed distinctive responses among various socio-cultural groups, with a focus on
L1 and L2 aspects. The findings emphasize that among L2 speakers, factors such
as gender, age, extroversion, and familiarity with robots influence how the behav-
ior of the robot is received when encourage to speak more through backchannels.
It becomes evident that tailoring robot behaviors, such as backchannels, based on
68 CHAPTER 10. DISCUSSION AND CONCLUSIONS
integrating into new societies. However, two significant challenges arose that war-
rant reevaluation of our experimental approach: firstly, the tendency to sample
L2 learners predominantly from WEIRD (Western, educated, industrialized, rich,
and democratic (Henrich et al., 2010)) countries, and secondly, the need to align
research objectives more closely with the needs of the intended end-users.
First, while initial studies in this thesis involved L2 learners from the SFI pro-
gram1 , representing a less-privileged segment of the immigrant population, over
time, participants were mainly recruited from demographics that to large extent
fit the WEIRD profile. While the findings remain relevant for studying robot
interactions with L2 learners, the exclusion of immigrants from non-WEIRD so-
cieties may weaken the potential support for sectors of society requiring more
assistance. Notably, significant differences exist among different immigrant pop-
ulations, for instance, less-privileged immigrants will have varying degrees of lit-
eracy, some even missing formal education as children that should be considered
in research focused on robots supporting education (Blommaert, 2010).
Secondly, although efforts were made to collaborate with educators in design-
ing robot interactions, insufficient attention was given to involve less-privileged
immigrant communities directly. Given the emphasis on the value of conversa-
tions promoted through this work, it was crucial to engage these communities in
defining the research problem, gathering data, and preparing technical designs.
For instance, it is essential to recognize that issues like racism, discrimination,
equality, and diversity are inherent aspects of the immigrant experience. In this
context, Doyle (2015) highlights significant work demonstrating how group dis-
cussions among students can be used to combat negative comments in the life
and workplace of immigrants. Consequently, it is not unexpected that these top-
ics might surface in social interactions, although the extent to which they would
emerge in conversations involving robots remains uncertain. Here, HRI research
could benefit from recent pedagogical efforts aimed at training teachers to avoid
delegitimizing students with limited or no formal education (Santos and Shandor,
2012; Simpson and Whiteside, 2015), and explore how these strategies could be
applied in human-robot interaction contexts.
Surely, addressing these topics comprehensively can merit a separate thesis
work, but they should be integral considerations in studies aiming to support
immigrant communities. HRI research stands to benefit from a thorough re-
assessment of the roles social robots can play in society, including a critical ex-
amination of potential power imbalances and their implications for fairness in
research outcomes (Winkle et al., 2023).
Data Collection
During the data collection process for this thesis, careful attention was dedicated
to the handling of video and audio recordings, as well as their storage and presen-
tation. In every study conducted, subjects were provided with informed consent
that clearly communicated the purpose of the study, the nature of the data to be
collected, and the intended use of the recorded materials. Participants were also
granted the option to opt-out from allowing the use of their data, including deci-
sions regarding the utilization of their image or voice in academic publications and
presentations. Moreover, participants were informed about the voluntary nature
of their participation and were assured the option to discontinue the experiment
at any point.
In terms of data storage and sharing, robust measures were implemented to
prevent any improper use. All collected data was securely stored within the
university’s official cloud servers. Access to this server is exclusively granted
through permissions provided by the research team, and there is a designated
time-frame for other collaborators to access the data. A sample of this consent
form is shown in Figure 10.1.
types associated with gender. One comment suggested that altering terms to
eliminate gender identity might be unnecessary, as our focus is solely on robot
research, implying that such changes would have no significant impact (in soci-
ety). I would like to contest this notion. Expanding upon the sentiment expressed
by Turkle (2007) regarding the development of social robots, where researchers
“are not only building robots, but a robot culture”, I argue that we are indeed
influencing broader societal changes, albeit to varying degrees. Consequently,
throughout my thesis work, I have attempted to reevaluate the role of robots in
education and second language learning, with particular attention to carefully
analyzing how robots can be introduced into these crucial roles. In light of this,
I firmly believe that the way we plan, execute, and present our research holds
considerable importance for society and should not be underestimated.
72 CHAPTER 10. DISCUSSION AND CONCLUSIONS
Zsuzsanna Ittzes Abrams. The effect of synchronous and asynchronous cmc on oral
performance in german. The Modern Language Journal, 87(2):157–167, 2003.
Alicia Adsera and Mariola Pytlikova. The role of language in shaping interna-
tional migration. CReAM Discussion Paper Series 1206, Centre for Research and
Analysis of Migration (CReAM), Department of Economics, University College
London, Feb 2012. URL https://ideas.repec.org/p/crm/wpaper/1206.html.
Nese Alyüz, Eda Okur, Ece Oktay, Utku Genc, Sinem Aslan, Sinem Emine Mete,
David Stanhill, Bert Arnrich, and Asli Arslan Esme. Towards an emotional
engagement model: Can affective states of a learner be automatically detected
in a 1: 1 learning scenario? In UMAP (Extended Proceedings), 2016.
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler,
Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor
Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint
arXiv:1912.06670, 2019.
John Baldwin, Sandra Faulkner, Michael Hecht, and Sheryl Lindsley. Redefin-
ing culture: Perspectives across the disciplines. LEA’s communication series.
Lawrence Erlbaum Associates Publishers, Mahwah, New Jersey, 01 2006.
Muzakki Bashori, Roeland van Hout, Helmer Strik, and Catia Cucchiarini. Web-
based language learning and speaking anxiety. Computer Assisted Language
Learning, 35(5-6):1058–1089, 2022.
Mike Baynham. Agency and contingency in the language learning of refugees and
asylum seekers. Linguistics and education, 17(1):24–39, 2006.
73
74 REFERENCES
Tony Belpaeme, James Kennedy, Paul Baxter, Paul Vogt, Emiel EJ Krahmer,
Stefan Kopp, Kirsten Bergmann, Paul Leseman, Aylin C Küntay, Tilbe Göksun,
et al. L2tor-second language tutoring using social robots. In Proceedings of the
ICSR 2015 WONDER Workshop, 2015.
Tony Belpaeme, James Kennedy, Aditi Ramachandran, Brian Scassellati, and Fu-
mihide Tanaka. Social robots for education: A review. Science robotics, 3(21):
eaat5954, 2018a.
Tony Belpaeme, Paul Vogt, Rianne Van den Berghe, Kirsten Bergmann, Tilbe
Göksun, Mirjam De Haas, Junko Kanero, James Kennedy, Aylin C Küntay, Ora
Oudgenoeg-Paz, et al. Guidelines for designing social robots as second language
tutors. International Journal of Social Robotics, 10:325–341, 2018b.
Nigel Bosch, Yuxuan Chen, and Sidney D’Mello. It’s written on your face: detect-
ing affective states from facial expressions while learning computer programming.
In Intelligent Tutoring Systems: 12th International Conference, ITS 2014, Hon-
olulu, HI, USA, June 5-9, 2014. Proceedings 12, pages 39–44. Springer, 2014.
Nigel Bosch, Sidney D’Mello, Ryan Baker, Jaclyn Ocumpaugh, Valerie Shute,
Matthew Ventura, Lubin Wang, and Weinan Zhao. Automatic detection of
learning-centered affective states in the wild. In Proceedings of the 20th interna-
tional conference on intelligent user interfaces, pages 379–388, 2015.
S.E. Brennan and M. Williams. The feeling of another’s knowing: Prosody and filled
pauses as cues to listeners about the metacognitive states of speakers. Journal of
REFERENCES 75
Memory and Language, 34(3):383 – 398, 1995. ISSN 0749-596X. doi: https://doi.
org/10.1006/jmla.1995.1017. URL http://www.sciencedirect.com/science/
article/pii/S0749596X85710170.
Joost Broekens, Marcel Heerink, Henk Rosendal, et al. Assistive social robots in
elderly care: a review. Gerontechnology, 8(2):94–103, 2009.
Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy dispar-
ities in commercial gender classification. In Conference on fairness, accountability
and transparency, pages 77–91. PMLR, 2018.
Donn Byrne, William Griffitt, and Daniel Stefaniak. Attraction and similarity of
personality characteristics. Journal of Personality and Social Psychology, 5(1):
82, 1967.
Keith Cameron. Computer assisted language learning (CALL): media, design, and
applications. Swets & Zeitlinger, 1999.
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth
Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning
evaluation of universal representations of speech. In 2022 IEEE Spoken Language
Technology Workshop (SLT), pages 798–805. IEEE, 2023.
Ronald Cumbal, José Lopes, and Olov Engwall. Detection of listener uncertainty
in robot-led second language conversation practice. In Proceedings of the 2020
International Conference on Multimodal Interaction, ICMI ’20, page 625–629,
New York, NY, USA, 2020a. Association for Computing Machinery. ISBN
9781450375818. doi: 10.1145/3382507.3418873. URL https://doi.org/10.
1145/3382507.3418873.
Ronald Cumbal, José Lopes, and Olov Engwall. Uncertainty in robot assisted
second language conversation practice. In Companion of the 2020 ACM/IEEE
International Conference on Human-Robot Interaction, HRI ’20, page 171–173,
New York, NY, USA, 2020b. Association for Computing Machinery. ISBN
9781450370578. doi: 10.1145/3371382.3378306. URL https://doi.org/10.
1145/3371382.3378306.
Nils Dahlbäck, QianYing Wang, Clifford Nass, and Jenny Alwin. Similarity is more
important than expertise: Accent effects in speech interfaces. In Proceedings of
the SIGCHI conference on Human factors in computing systems, pages 1553–
1556, 2007.
Nick Degens, Birgit Endrass, Gert Jan Hofstede, Adrie Beulens, and Elisabeth
André. ‘what i see is not what you get’: why culture-specific behaviours for
virtual characters should be user-tested across cultures. AI & society, 32:37–49,
2017.
S. V. Dekker, J. Duarte, and H. Loerts. ‘who really speaks like that?’ – children’s
implicit and explicit attitudes towards multilingual speakers of dutch. Interna-
tional Journal of Multilingualism, 18(4):551–569, 2021. doi: 10.1080/14790718.
2021.1908297. URL https://doi.org/10.1080/14790718.2021.1908297.
Ivana Di Leo, Krista R Muis, Cara A Singh, and Cynthia Psaradellis. Curiosity. . .
confusion? frustration! the role and sequencing of emotions during mathematics
problem solving. Contemporary educational psychology, 58:121–137, 2019.
Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shus-
ter, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al.
The second conversational intelligence challenge (convai2). In The NeurIPS’18
Competition, pages 187–208. Springer, 2020.
Sidney K D’Mello, Art Graesser, and Brandon King. Toward spoken human–
computer tutorial dialogues. Human–Computer Interaction, 25(4):289–323, 2010.
Melissa Donnermann, Philipp Schaper, and Birgit Lugrin. Social robots in ap-
plied settings: A long-term study on adaptive robotic tutors in higher education.
Frontiers in Robotics and AI, 9:831633, 2022.
Sandra Doyle. Getting to grips with the english language. In Adult Language
Education and Migration, pages 162–172. Routledge, 2015.
Starkey Duncan. Some signals and rules for taking speaking turns in conversations.
Journal of personality and social psychology, 23(2):283, 1972.
Starkey Duncan. On the structure of speaker–auditor interaction during speaking
turns1. Language in society, 3(2):161–180, 1974.
Starkey Duncan and Donald W Fiske. Face-to-face interaction: Research, methods,
and theory. Routledge, 1977. doi: https://doi.org/10.4324/9781315660998.
Sidney D’Mello, Blair Lehman, Reinhard Pekrun, and Art Graesser. Confusion can
be beneficial for learning. Learning and Instruction, 29:153–170, 2014.
Carole Edelsky. Who’s got the floor? Language in society, 10(3):383–421, 1981.
Rod Ellis. The study of second language acquisition. Oxford University, 1994.
Olov Engwall and José Lopes. Interaction and collaboration in robot-assisted lan-
guage learning for adults. Computer Assisted Language Learning, 35(5-6):1273–
1309, 2022.
Olov Engwall, José Lopes, and Anna Åhlund. Robot interaction styles for con-
versation practice in second language learning. International Journal of Social
Robotics, 13(2):251–276, 2021.
Olov Engwall, José Lopes, and Ronald Cumbal. Is a wizard-of-oz required for
robot-led conversation practice in a second language? International Journal of
Social Robotics, 14(4):1067–1085, 2022.
Søren W Eskildsen and Johannes Wagner. Embodied l2 construction learning.
Language Learning, 65(2):268–297, 2015.
Anita Ferreira and John Atkinson. Designing a feedback component of an intelligent
tutoring system for foreign language. In International Conference on Innovative
Techniques and Applications of Artificial Intelligence, pages 277–290. Springer,
2008.
Anita Ferreira, Johanna D Moore, and Chris Mellish. A study of feedback strate-
gies in foreign language classrooms and tutorials with implications for intelligent
computer-assisted language learning systems. Int. J. Artif. Intell. Educ., 17(4):
389–422, 2007.
78 REFERENCES
Samantha Finkelstein, Evelyn Yarzebinski, Callie Vaughn, Amy Ogan, and Justine
Cassell. The effects of culturally congruent educational technologies on student
achievement. In Artificial Intelligence in Education: 16th International Confer-
ence, AIED 2013, Memphis, TN, USA, July 9-13, 2013. Proceedings 16, pages
493–502. Springer, 2013.
Alan Firth. The discursive accomplishment of normality: On ‘lingua franca’english
and conversation analysis. Journal of pragmatics, 26(2):237–259, 1996.
Bilal Genc and Erdogan Bada. Culture in language learning and teaching. The
reading matrix, 5(1), 2005.
Bart Geurts. Communication as commitment sharing: speech acts, implicatures,
common ground. Theoretical linguistics, 45(1-2):1–30, 2019.
Mary M Gill. Accent and stereotypes: Their effect on perceptions of teachers and
lecture comprehension. Journal of Applied Communication Research, 1994. URL
https://doi.org/10.1080/00909889409365409.
Arthur M Glenberg. Embodiment for education. In Handbook of cognitive science,
pages 355–372. Elsevier, 2008.
Ewa M Golonka, Anita R Bowles, Victor M Frank, Dorna L Richardson, and
Suzanne Freynik. Technologies for foreign language learning: A review of tech-
nology types and their effectiveness. Computer assisted language learning, 27(1):
70–105, 2014.
Charles Goodwin. Conversational organization. Interaction between speakers and
hearers, 1981.
Charles Goodwin. Between and within: Alternative sequential treatments of con-
tinuers and assessments. Human studies, 9(2):205–217, 1986.
Charles Goodwin. Co-operative action. Cambridge University Press, 2018.
Charles Goodwin et al. Restarts, pauses, and the achievement of a state of mutual
gaze at turn-beginning. Sociological inquiry, 50(3-4):272–302, 1980.
Goren Gordon, Cynthia Breazeal, and Susan Engel. Can children catch curiosity
from a social robot? In Proceedings of the tenth annual ACM/IEEE international
conference on human-robot interaction, pages 91–98, 2015.
Goren Gordon, Samuel Spaulding, Jacqueline Kory Westlund, Jin Joo Lee, Luke
Plummer, Marayna Martinez, Madhurima Das, and Cynthia Breazeal. Affective
personalization of a social robot tutor for children’s second language skills. In
Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
Paul Gruba. Computer assisted language learning (call). The handbook of applied
linguistics, pages 623–648, 2004.
REFERENCES 79
Joan Kelly Hall. ”aw, man, where you goin’ ?”: Classroom interaction and the
development of l2 interactional competence. Issues in Applied linguistics, 6(2),
1995.
Luke Harding. Communicative language testing: Current issues and future research.
Language assessment quarterly, 11(2):186–197, 2014.
J. T. Hart. Memory and the feeling-of-knowing experience. Journal of Educational
Psychology, 56:208–216, 1965.
Heidi Hautopp and Thorkild Hanghøj. Game based language learning for bilingual
adults. In Proceedings of the 8th European Conference on Game-Based Learning.
Reading: Academic Conferences and Publishing International, pages 191–198,
2014.
Joseph Henrich, Steven J Heine, and Ara Norenzayan. Most people are not weird.
Nature, 466(7302):29–29, 2010.
Anna Henschel, Guy Laban, and Emily S Cross. What makes a robot social?
a review of social robots from science fiction to a home or hospital near you.
Current Robotics Reports, 2:9–19, 2021.
Graeme Hirst, Susan McRoy, Peter Heeman, Philip Edmonds, and Diane Hor-
ton. Repairing conversational misunderstandings and non-understandings. Speech
Communication, 15(3):213 – 229, 1994. ISSN 0167-6393. doi: https://doi.org/10.
1016/0167-6393(94)90073-6. URL http://www.sciencedirect.com/science/
article/pii/0167639394900736. Special issue on Spoken dialogue.
Anna Hjalmarsson. The additive effect of turn-taking cues in human and synthetic
voice. Speech Communication, 53(1):23–35, 2011.
Anna Hjalmarsson, Preben Wik, and Jenny Brusk. Dealing with deal: a dialogue
system for conversation training. In Proceedings of the 8th SIGdial Workshop on
Discourse and Dialogue, pages 132–135, 2007.
Stephen A Hockema and Linda B Smith. Learning your language, outside-in and
inside-out. Linguistics, 47, 03 2009. doi: 10.1515/LING.2009.016.
Zeng-Wei Hong, Yueh-Min Huang, Marie Hsu, and Wei-Wei Shen. Authoring
robot-assisted instructional materials for improving learning performance and
motivation in efl classrooms. J. Educ. Technol. Soc., 19:337–349, 01 2016. URL
https://api.semanticscholar.org/CorpusID:17667686.
Dell Hymes et al. On communicative competence. sociolinguistics, 269293:269–293,
1972.
W Lewis Johnson and Andre Valente. Tactical language and culture training sys-
tems: Using ai to teach foreign languages and cultures. AI magazine, 30(2):
72–72, 2009.
Tatsuya Kawahara, Takashi Yamaguchi, Miki Uesato, Koichiro Yoshino, and Kat-
suya Takanashi. Synchrony in prosodic and linguistic features between backchan-
nels and preceding utterances in attentive listening. In 2015 Asia-Pacific Signal
and Information Processing Association Annual Summit and Conference (AP-
SIPA), pages 392–395, 2015. doi: 10.1109/APSIPA.2015.7415301.
Kobin H Kendrick. The intersection of turn-taking and repair: the timing of other-
initiations of repair in conversation. Frontiers in psychology, 6(250):10–3389,
2015.
AlBara Khalifa, Tsuneo Kato, and Seiichi Yamamoto. Measuring effect of repetitive
queries and implicit learning with joining-in-type robot assisted language learning
system. In SLaTE, pages 13–17, 2017.
AlBara Khalifa, Tsuneo Kato, and Seiichi Yamamoto. Learning effect of implicit
learning in joining-in-type robot-assisted language learning system. International
Journal of Emerging Technologies in Learning, 14(2), 2019.
Chandra Khatri, Behnam Hedayatnia, Anu Venkatesh, Jeff Nunn, Yi Pan, Qing Liu,
Han Song, Anna Gottardi, Sanjeev Kwatra, Sanju Pancholi, et al. Advancing the
state of the art in open domain dialog systems through the alexa prize. arXiv
preprint arXiv:1812.10757, 2018.
YouJin Kim and Kim McDonough. The effect of interlocutor proficiency on the
collaborative dialogue between korean as a second language learners. Language
teaching research, 12(2):211–234, 2008.
Tetyana Kloubert and Chad Hoggan. Migrants and the labor market: The role and
tasks of adult education. Adult Learning, 32(1):29–39, 2021.
REFERENCES 81
Hanae Koiso, Yasuo Horiuchi, Syun Tutiya, Akira Ichikawa, and Yasuharu Den.
An analysis of turn-taking and backchannels based on prosodic and syntactic
features in japanese map task dialogs. Language and speech, 41(3-4):295–321,
1998.
Jacqueline M Kory Westlund, Sooyeon Jeong, Hae W Park, Samuel Ronfard, Arad-
hana Adhikari, Paul L Harris, David DeSteno, and Cynthia L Breazeal. Flat vs.
expressive storytelling: Young children’s learning and retention of a social robot’s
narrative. Frontiers in human neuroscience, 11:295, 2017.
Maria Kowal and Merrill Swain. Using collaborative language production tasks to
promote students’ language awareness. Language awareness, 3(2):73–93, 1994.
Ming-Mu Kuo and Cheng-Chieh Lai. Linguistics across cultures: The impact of
culture on second language learning. Online Submission, 1(1), 2006.
Sungjin Lee, Hyungjong Noh, Jonghoon Lee, Kyusong Lee, Gary Geunbae Lee,
Seongdae Sagong, and Munsang Kim. On the effectiveness of robot-assisted
language learning. ReCALL, 23(1):25–58, 2011b.
Michael J Leeser. Learner proficiency and focus on form during collaborative dia-
logue. Language teaching research, 8(1):55–81, 2004.
82 REFERENCES
Blair Lehman, Sidney D’Mello, Amber Strain, Caitlin Mills, Melissa Gross, Allyson
Dobbins, Patricia Wallace, Keith Millis, and Art Graesser. Inducing and tracking
confusion with contradictions during complex learning. International Journal of
Artificial Intelligence in Education, 22(1-2):85–105, 2013.
David Kellogg Lewis. Convention: A Philosophical Study. Harvard University
Press, Cambridge, MA, USA, 1969.
Rose Yanhong Li and Mike Kaye. Understanding overseas students’ concerns and
problems. Journal of Higher Education Policy and Management, 20(1):41–50,
1998.
Mei Hui Lim and Vahid Aryadoust. A scientometric review of research trends in
computer-assisted language learning (1977 – 2020). Computer Assisted Language
Learning, 35(9):2675–2700, 2022. doi: 10.1080/09588221.2021.1892768. URL
https://doi.org/10.1080/09588221.2021.1892768.
Diane Litman, Helmer Strik, and Gad S Lim. Speech technologies and the assess-
ment of second language speaking: Approaches, challenges, and opportunities.
Language Assessment Quarterly, 15(3):294–309, 2018.
Diane J Litman, Carolyn P Rosé, Kate Forbes-Riley, Kurt VanLehn, Dumisizwe
Bhembe, and Scott Silliman. Spoken versus typed human and computer dialogue
tutoring. International Journal of Artificial Intelligence in Education, 16(2):145–
170, 2006.
Zhongxiu Liu, Visit Pataranutaporn, Jaclyn Ocumpaugh, and Ryan Baker. Se-
quences of frustration and confusion, and learning. In Educational data mining
2013, 2013.
Birgit Lugrin, Benjamin Eckstein, Kirsten Bergmann, and Corinna Heindl. Adapted
foreigner-directed communication towards virtual agents. In Proceedings of the
18th International Conference on Intelligent Virtual Agents, pages 59–64, 2018.
Roy Lyster and Leila Ranta. Corrective feedback and learner uptake: Negotiation
of form in communicative classrooms. Studies in second language acquisition, 19
(1):37–66, 1997.
Ali Reza Majlesi, Ronald Cumbal, Olov Engwall, Sarah Gillet, Silvia Kunitz, Gus-
tav Lymer, Catrin Norrby, and Sylvaine Tuncer. Managing turn-taking in human-
robot interactions: The case of projections and overlaps, and the anticipation of
turn design by human participants. Social Interaction. Video-based Studies of
Human Sociality, 6(1), 2023.
Matthew Marge, Carol Espy-Wilson, Nigel G Ward, Abeer Alwan, Yoav Artzi,
Mohit Bansal, Gil Blankenship, Joyce Chai, Hal Daumé III, Debadeepta Dey,
et al. Spoken language interaction with robots: Recommendations for future
research. Computer Speech & Language, 71:101255, 2022.
REFERENCES 83
Marie McAuliffe and Binod Khadria. World migration report 2020. 2019.
Conor McGinn and Ilaria Torre. Can you tell the robot by the voice? an ex-
ploratory study on the role of voice in the perception of robots. In 2019 14th
ACM/IEEE international Conference on human-robot interaction (HRI), pages
211–221. IEEE, 2019.
Hazel Morton and Mervyn A Jack. Scenario-based spoken interaction with virtual
agents. Computer Assisted Language Learning, 18(3):171–191, 2005.
Bilge Mutlu, Takayuki Kanda, Jodi Forlizzi, Jessica Hodgins, and Hiroshi Ishiguro.
Conversational gaze mechanisms for humanlike robots. ACM Transactions on
Interactive Intelligent Systems (TiiS), 1(2):1–33, 2012.
Stanislava Naneva, Marina Sarda Gou, Thomas L Webb, and Tony J Prescott.
A systematic review of attitudes, anxiety, acceptance, and trust towards social
robots. International Journal of Social Robotics, 12(6):1179–1201, 2020.
Clifford Nass, Jonathan Steuer, and Ellen R Tauber. Computers are social actors. In
Proceedings of the SIGCHI conference on Human factors in computing systems,
pages 72–78, 1994.
Clifford Ivar Nass and Scott Brave. Wired for speech: How voice activates and
advances the human-computer relationship. MIT press Cambridge, 2005.
Stephen Neale. Paul grice and the philosophy of language. Linguistics and philos-
ophy, pages 509–559, 1992.
E Oksaar. Language contacts within the scope of culture contacts: Behavioral and
structural models. Philippine journal of linguistics, 14(15):246–252, 1983.
Els Oksaar. Language contact and culture contact: Towards an integrative approach
in second language acquisition research. Current Trends in European Second
Language Acquisition Research. Multilingual Matters, Clevendon, pages 10–20,
1990.
Patricia O’Neill-Brown. Setting the stage for the culturally adaptive agent. In
Proceedings of the 1997 AAAI fall symposium on socially intelligent agents, pages
93–97. AAAI Press Menlo Park, CA, 1997.
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani
Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al.
Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516,
2023.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey,
and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.
In International Conference on Machine Learning, pages 28492–28518. PMLR,
2023.
Natasha Randall. A survey of robot-assisted language learning (rall). ACM Trans-
actions on Human-Robot Interaction (THRI), 9(1):1–36, 2019.
Byron Reeves and Clifford Nass. The media equation: How people treat computers,
television, and new media like real people. Cambridge, UK, 10(10), 1996.
Natalia Reich-Stiebert, Friederike Eyssel, and Charlotte Hohnemann. Involve the
user! changing attitudes toward robots by user participation in a robot proto-
typing process. Computers in Human Behavior, 91:290–296, 2019.
J Elizabeth Richey, Jiayi Zhang, Rohini Das, Juan Miguel Andres-Bray, Richard
Scruggs, Michael Mogessie, Ryan S Baker, and Bruce M McLaren. Gaming and
confrustion explain learning advantages for a math digital learning game. In
International conference on artificial intelligence in education, pages 342–355.
Springer, 2021.
Celia Roberts and Melanie Cooke. Authenticity in the adult esol classroom and
beyond. Tesol Quarterly, 43(4):620–642, 2009.
Ma Mercedes T Rodrigo, Ryan S Baker, Matthew C Jadud, Anna Christine M
Amarra, Thomas Dy, Maria Beatriz V Espejo-Lahoz, Sheryl Ann L Lim,
Sheila AMS Pascua, Jessica O Sugay, and Emily S Tabanao. Affective and be-
havioral predictors of novice programmer achievement. In Proceedings of the 14th
annual ACM SIGCSE conference on Innovation and technology in computer sci-
ence education, pages 156–160, 2009.
Ma Mercedes T Rodrigo, Ryan SJd Baker, and Julieta Q Nabos. The relationships
between sequences of affective states and learner achievement. In Proceedings
of the 18th international conference on computers in education, pages 56–60.
Universiti Putra Malaysia Malaysia, 2010.
Astrid M Rosenthal-von der Pütten, Carolin Straßmann, and Nicole C Krämer.
Robots or agents–neither helps you more or less during second language acqui-
sition: Experimental study on the effects of embodiment and type of speech
REFERENCES 85
Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. A simplest systematics for
the organization of turn-taking for conversation. Language, 50(4):696–735, 1974.
ISSN 00978507, 15350665. URL http://www.jstor.org/stable/412243.
Martin Saerbeck, Tom Schut, Christoph Bartneck, and Maddy Janse. Expres-
sive robots in education - varying the degree of social supportive behavior of a
robotic tutor. In 28th ACM Conference on Human Factors in Computing Sys-
tems (CHI2010), pages 1613–1622, Atlanta, 2010a. ACM. doi: 10.1145/1753326.
1753567.
Martin Saerbeck, Tom Schut, Christoph Bartneck, and Maddy D Janse. Expressive
robots in education: varying the degree of social supportive behavior of a robotic
tutor. In Proceedings of the SIGCHI conference on human factors in computing
systems, pages 1613–1622, 2010b.
Maricel G Santos and April Shandor. The role of classroom talk in the creation
of “safe spaces” in adult esl classrooms. In LESLLA Symposium Proceedings,
volume 7, pages 110–134, 2012.
Thorsten Schodde, Kirsten Bergmann, and Stefan Kopp. Adaptive robot language
tutoring based on bayesian knowledge tracing and predictive decision-making. In
Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot
Interaction, pages 128–136, 2017.
86 REFERENCES
Sarah Sebo, Brett Stoll, Brian Scassellati, and Malte F Jung. Robots in groups
and teams: a literature review. Proceedings of the ACM on Human-Computer
Interaction, 4(CSCW2):1–36, 2020.
Paul Seedhouse. The case of the missing “no”: The relationship between pedagogy
and interaction. Language learning, 47(3):547–583, 1997.
Paul Seedhouse. Conversation analysis and language learning. Language teaching,
38(4):165–187, 2005.
Margret Selting. On the interplay of syntax and prosody in the constitution of turn-
constructional units and turns in conversation. Pragmatics. Quarterly Publication
of the International Pragmatics Association (IPrA), 6(3):371–388, 1996.
Sofia Serholt, Wolmet Barendregt, Asimina Vasalou, Patrı́cia Alves-Oliveira, Aidan
Jones, Sofia Petisca, and Ana Paiva. The case of classroom robots: teachers’
deliberations on the ethical tensions. Ai & Society, 32:613–631, 2017.
Michihiro Shimada, Takayuki Kanda, and Satoshi Koizumi. How can a social
robot facilitate children’s collaboration? In Shuzhi Sam Ge, Oussama Khatib,
John-John Cabibihan, Reid Simmons, and Mary-Anne Williams, editors, Social
Robotics, pages 98–107, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
ISBN 978-3-642-34103-8.
James Simpson and Anne Whiteside. Adult language education and migration:
Challenging agendas in policy and practice. Taylor & Francis, 2015.
Gabriel Skantze. Exploring human error recovery strategies: Implications for
spoken dialogue systems. Speech Communication, 45(3):325–341, 2005. ISSN
0167-6393. doi: https://doi.org/10.1016/j.specom.2004.11.005. URL http:
//www.sciencedirect.com/science/article/pii/S0167639304001256. Spe-
cial Issue on Error Handling in Spoken Dialogue Systems.
Gabriel Skantze. Error Handling in Spoken Dialogue Systems : Managing Uncer-
tainty, Grounding and Miscommunication. PhD thesis, KTH, Speech, Music and
Hearing, TMH, 2007. QC 20100812.
Gabriel Skantze. Predicting and regulating participation equality in human-robot
conversations: Effects of age and gender. In Proceedings of the 2017 ACM/IEEE
International Conference on Human-robot Interaction, pages 196–204, 2017.
Gabriel Skantze, Anna Hjalmarsson, and Catharine Oertel. Turn-taking, feedback
and joint attention in situated human–robot interaction. Speech Communication,
65:50–66, 2014.
Gabriel Skantze, Martin Johansson, and Jonas Beskow. Exploring turn-taking cues
in multi-party human-robot discussions about objects. In Proceedings of the 2015
ACM on international conference on multimodal interaction, pages 67–74, 2015.
REFERENCES 87
Matthijs Smakman and Elly A Konijn. Robot tutors: Welcome or ethically ques-
tionable? In Robotics in Education: Current Research and Innovations 10, pages
376–386. Springer, 2020.
Matthijs Smakman, Paul Vogt, and Elly A Konijn. Moral considerations on social
robots in education: A multi-stakeholder perspective. Computers & Education,
174:104317, 2021.
Bernard Spolsky. Communicative competence, language proficiency, and beyond.
Applied Linguistics, 10(2):138–156, 1989.
Robert Stalnaker. Common ground. Linguistics and philosophy, 25(5/6):701–721,
2002.
Sarah Steber and Sonja Rossi. The challenge of learning a new language in adult-
hood: Evidence from a multi-methodological neuroscientific approach. PLOS
ONE, 16:1–23, 02 2021. doi: 10.1371/journal.pone.0246421. URL https:
//doi.org/10.1371/journal.pone.0246421.
Neomy Storch and Ali Aldosari. Pairing learners in pair work activity. Language
teaching research, 17(1):31–48, 2013.
Merrill Swain, Sharon Lapkin, Ibtissem Knouzi, Wataru Suzuki, and Lindsay
Brooks. Languaging: University students learn the grammatical concept of voice
in french. The Modern Language Journal, 93(1):5–29, 2009.
Sherry Turkle. Authenticity in the age of digital companions. Interaction studies,
8(3):501–517, 2007.
Rianne Van den Berghe, Josje Verhagen, Ora Oudgenoeg-Paz, Sanne Van der Ven,
and Paul Leseman. Social robots for language learning: A review. Review of
Educational Research, 89(2):259–295, 2019.
Alistair Van Moere. A psycholinguistic approach to oral language assessment. Lan-
guage Testing, 29(3):325–344, 2012.
Paul Vogt, Rianne van den Berghe, Mirjam De Haas, Laura Hoffman, Junko
Kanero, Ezgi Mamus, Jean-Marc Montanier, Cansu Oranç, Ora Oudgenoeg-Paz,
Daniel Hernández Garcı́a, et al. Second language tutoring using social robots: a
large-scale study. In 2019 14th ACM/IEEE International Conference on Human-
Robot Interaction (HRI), pages 497–505. Ieee, 2019.
Lev Semenovich Vygotsky and Michael Cole. Mind in society: Development of
higher psychological processes. Harvard university press, 1978.
Joshua Wainer, Kerstin Dautenhahn, Ben Robins, and Farshid Amirabdollahian.
Collaborating with kaspar: Using an autonomous humanoid robot to foster co-
operative dyadic play among children with autism. In 2010 10th IEEE-RAS
International Conference on Humanoid Robots, pages 631–638. IEEE, 2010.
88 REFERENCES
Yi Hsuan Wang, Shelley S-C Young, and Jyh-Shing Roger Jang. Using tangible
companions for enhancing learning english conversation. Journal of Educational
Technology & Society, 16(2):296–309, 2013.
Yuko Watanabe and Merrill Swain. Effects of proficiency differences and patterns
of pair interaction on second language learning: Collaborative dialogue between
adult esl learners. Language teaching research, 11(2):121–142, 2007.
J Kory Westlund, Leah Dickens, Sooyeon Jeong, Paul Harris, David DeSteno, and
Cynthia Breazeal. A comparison of children learning new words from robots,
tablets, & people. In Proceedings of the 1st international conference on social
robots in therapy and education, 2015.
Jacqueline Kory Westlund, Goren Gordon, Samuel Spaulding, Jin Joo Lee, Luke
Plummer, Marayna Martinez, Madhurima Das, and Cynthia Breazeal. Lessons
from teachers on performing hri studies with young children in schools. In 2016
11th ACM/IEEE International Conference on Human-Robot Interaction (HRI),
pages 383–390. IEEE, 2016.
Preben Wik, Rebecca Hincks, and Julia Bell Hirschberg. Responses to ville: A
virtual language teacher for swedish. 2009.
Eiko Yasui. Repair and language proficiency: Differences of advanced and beginning
language learners in an english-japanese conversation group. Texas Papers in
Foreign Language Education, 15(1), 2011.
REFERENCES 89
Langxuan Yin, Timothy Bickmore, and Dharma E Cortés. The impact of linguistic
and cultural congruity on persuasion by conversational agents. In Intelligent Vir-
tual Agents: 10th International Conference, IVA 2010, Philadelphia, PA, USA,
September 20-22, 2010. Proceedings 10, pages 343–349. Springer, 2010.
Victor H Yngve. On getting a word in edgewise. In Papers from the sixth re-
gional meeting Chicago Linguistic Society, April 16-18, 1970, Chicago Linguistic
Society, Chicago, pages 567–578, 1970.
Simin Zeng. Second language learners’ strong preference for self-initiated self-repair:
Implications for theory and pedagogy. Journal of Language Teaching and Re-
search, 10(3):541–548, 2019.
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai
Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al. Google usm:
Scaling automatic speech recognition beyond 100 languages. arXiv preprint
arXiv:2303.01037, 2023.
Zheng Zhang, Ryuichi Takanobu, Qi Zhu, MinLie Huang, and XiaoYan Zhu. Re-
cent advances and challenges in task-oriented dialog systems. Science China
Technological Sciences, pages 1–17, 2020.
Part II
Included Papers
91