Talrobot

Download as pdf or txt
Download as pdf or txt
You are on page 1of 107

kth royal institute

of technology

Doctoral Thesis in Speech Communication

Robots Beyond Borders


The Role of Social Robots in Spoken
Second Language ­Practice
RONALD CUMBAL

Stockholm, Sweden 2024


Robots Beyond Borders
The Role of Social Robots in Spoken
Second Language Practice
RONALD CUMBAL

Academic Dissertation which, with due permission of the KTH Royal Institute of Technology,
is submitted for public defence for the Degree of Doctor of Philosophy on Friday the 22nd
March 2024, at 10:00 a.m. in F3, Lindstedtsvägen 26, Stockholm.

Doctoral Thesis in Speech Communication


KTH Royal Institute of Technology
Stockholm, Sweden 2024
© Ronald Cumbal

ISBN 978-91-8040-858-5
TRITA-EECS-AVL-2024:23

Printed by: Universitetsservice US-AB, Sweden 2024


Para mi Cholo.
i

Abstract

This thesis investigates how social robots can support adult second language
(L2) learners in improving conversational skills. It recognizes the challenges
inherent in adult L2 learning, including increased cognitive demands and the
unique motivations driving adult education. While social robots hold po-
tential for natural interactions and language education, research into conver-
sational skill practice with adult learners remains underexplored. Thus, the
thesis contributes to understanding these conversational dynamics, enhancing
speaking practice, and examining cultural perspectives in this context.
To begin, this thesis investigates robot-led conversations with L2 learners,
examining how learners respond to moments of uncertainty. The research re-
veals that when faced with uncertainty, learners frequently seek clarification,
yet many remain unresponsive. As a result, effective strategies are required
from robot conversational partners to address this challenge. These interac-
tions are then used to evaluate the performance of off-the-shelf Automatic
Speech Recognition (ASR) systems. The assessment highlights that speech
recognition for L2 speakers is not as effective as for L1 speakers, with perfor-
mance deteriorating for both groups during social conversations. Addressing
these challenges is imperative for the successful integration of robots in con-
versational practice with L2 learners.
The thesis then explores the potential advantages of employing social
robots in collaborative learning environments with multi-party interactions.
It delves into strategies for improving speaking practice, including the use of
non-verbal behaviors to encourage learners to speak. For instance, a robot’s
adaptive gazing behavior is used to effectively balance speaking contributions
between L1 and L2 pairs of participants. Moreover, an adaptive use of encour-
aging backchannels significantly increases the speaking time of L2 learners.
Finally, the thesis highlights the importance of further research on cultural
aspects in human-robot interactions. One study reveals distinct responses
among various socio-cultural groups in interaction between L1 and L2 partic-
ipants. For example, factors such as gender, age, extroversion, and familiarity
with robots influence conversational engagement of L2 speakers. Addition-
ally, another study investigates preconceptions related to the appearance and
accents of nationality-encoded (virtual and physical) social robots. The re-
sults indicate that initial perceptions may lead to negative preconceptions,
but that these perceptions diminish after actual interactions.
Despite technical limitations, social robots provide distinct benefits in
supporting educational endeavors. This thesis emphasizes the potential of
social robots as effective facilitators of spoken language practice for adult
learners, advocating for continued exploration at the intersection of language
education, human-robot interaction, and technology.

Keywords: Conversations, gaze, backchannels, multi-party, accent, culture


ii

Sammanfattning

Denna avhandling undersöker hur sociala robotar kan ge vuxna andraspråks-


inlärare stöd att förbättra sin konversationsförmåga på svenska. Andraspråks-
inlärning för vuxna, särskilt i migrationskontext, är mer komplext än för barn,
bland annat på grund av att förutsättningarna för språkinlärning försämras
med åren och att drivkrafterna ofta är andra. Sociala robotar har stor poten-
tial inom språkundervisning för att träna naturliga samtal, men fortfarande
har lite forskning om hur robotar kan öva konversation med vuxna elever
genomförts. Därför bidrar avhandlingen till att förstå samtal mellan and-
raspråksinlärare och robotar, förbättra dessa samtalsövningar och undersöka
hur kulturella faktorer påverkar interaktionen.
Till att börja med undersöker avhandlingen hur andraspråkselever rea-
gerar då de blir förbryllade eller osäkra i robotledda konversationsövningar.
Resultaten visar att eleverna ofta försöker få roboten att ge förtydliganden när
de är osäkra, men att de ibland helt enkelt inte svarar något alls, vilket innebär
att roboten behöver kunna hantera sådana situationer. Konversationerna mel-
lan andraspråksinlärare och en robot har även använts för att undersöka hur
väl ledande system för taligenkänning kan tolka det adraspråkstalare säger.
Det kan konstateras att systemen har väsentligt större svårigheter att känna
igen andraspråkstalare än personer med svensk bakgrund, samt att de har
utmananingar att tolka såväl svenska talare som andraspråkselever i friare
sociala konversationer, vilket måste hanteras när robotar ska användas i sam-
talsövningar med andraspråkselever.
Avhandlingen undersöker sedan strategier för att uppmuntra andraspråks-
elever att prata mer och för att fördela ordet jämnare i trepartsövningar
där två personer samtalar med roboten. Strategierna går ut på att modifiera
hur roboten tittar på de två personerna eller ger icke-verbal återkoppling
(hummanden) för att signalera förståelse och intresse för det eleverna säger.
Slutligen belyser avhandlingen vikten av ytterligare forskning om kultu-
rella aspekter i interaktioner mellan människa och robot. En studie visar att
faktorer som kön, ålder, tidigare erfarenhet av robotar och hur extrovert ele-
ven är påverkar både hur mycket olika personer talar och hur de svarar på
robotens försök att uppmuntra dem att tala mer genom icke-verbala signaler.
En andra studie undersöker om och hur förutfattade meningar relaterade
till utseende och uttal påverkar hur människor uppfattar (virtuella och fysiska)
sociala robotar som givits egenskaper (röst och ansikte) som kan kopplas till
olika nationella bakgrunder. Resultaten visar att människors första intryck
av en kulturellt färgad robot speglar förutfattade meningar, men att denna
uppfattning inte alls får samma genomslag när personer faktiskt interagerat
med roboten i ett realistiskt sammanhang.
En huvudsaklig slutsats i avhandlingen är att sociala robotar, trots att
tekniska begränsningar finns kvar, har tydliga fördelar som kan utnyttjas in-
om utbildning. Specifikt betonar avhandlingen potentialen hos sociala ro-
botar att leda samtalsövningar för vuxna andraspråkselever och förespråkar
fortsatt forskning i skärningspunkten mellan språkundervisning, människa-
robotinteraktion och teknik.
List of Papers

This dissertation is based on the following published contributions:

A Uncertainty in Robot Assisted Second Language Conversation Prac-


tice
Ronald Cumbal, José Lopes and Olov Engwall
Companion of the 2020 ACM/IEEE International Conference on Human-Robot
Interaction (2020)

B Detection of Listener Uncertainty in Robot-Led Second Language Con-


versation Practice
Ronald Cumbal, José Lopes and Olov Engwall
Proceedings of the 2020 International Conference on Multimodal Interaction
(2020)

C “You don’t understand me!”: Comparing ASR results for L1 and L2


speakers of Swedish
Ronald Cumbal, Birger Moell, José Lopes and Olov Engwall
Interspeech (2021)

D Robot Gaze Can Mediate Participation Imbalance in Groups with Dif-


ferent Skill Levels
Sarah Gillet, Ronald Cumbal1 , André Pereira, José Lopes, Olov Engwall and
Iolanda Leite
Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot
Interaction (2021)

E Shaping Unbalanced Multi-Party Interactions through Adaptive Robot


Backchannels
Ronald Cumbal, Daniel Alexander Kazzi, Vincent Winberg and Olov Engwall
Proceedings of the 22nd ACM International Conference on Intelligent Virtual
Agents (2022)
1 Shared first authorship

iii
iv LIST OF PAPERS

F Socio-cultural perception of robot backchannels


Olov Engwall, Ronald Cumbal and Ali Reza Majlesi
Frontiers 2023
G Stereotypical Nationality Representations in HRI: Perspectives from
International Young Adults
Ronald Cumbal, Agnes Axelsson, Shivam Mehta and Olov Engwall
Frontiers 2024
H Speaking Transparently: Social Robots in Educational Settings
Ronald Cumbal and Olov Engwall
Companion of the 2024 ACM/IEEE International Conference on Human-Robot
Interaction (2024)
v

Other contributions by the author that are not included in the thesis:

I Is a wizard-of-oz required for robot-led conversation practice in a


second language?
Olov Engwall, José Lopes and Ronald Cumbal
International Journal of Social Robotics (2022)
II Identification of low-engaged learners in robot-led second language
conversations with adults
Olov Engwall, Ronald Cumbal, José Lopes, Mikael Ljung, and Linnea Månsson
ACM Transactions on Human-Robot Interaction (THRI) (2022)
III Adaptive robot discourse for language acquisition in adulthood
Ronald Cumbal
Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot
Interaction (2022)
IV Managing turn-taking in human-robot interactions: The case of pro-
jections and overlaps, and the anticipation of turn design by human
participants
Ali Reza Majlesi, Ronald Cumbal, Olov Engwall, Sarah Gillet, Silvia Ku-
nitz, Gustav Lymer, Catrin Norrby and Sylvaine Tuncer
Social Interaction. Video-based Studies of Human Sociality (2023)
V Let me finish first - The effect of interruption-handling strategy on
the perceived personality of a social agent
Ronald Cumbal, Reshmashree B. Kantharaju, Maike Paetzel-Prüsmann and
James Kennedy
[Manuscript submitted for publication]
Acknowledgement

I would like to begin by expressing my deepest gratitude to my supervisor, Olov


Engwall. I am immensely thankful for the unwavering support over the years,
especially during moments when I found it hard to trust in myself. His willingness
to support and engage critically with all of my ideas, and his dedication to see
life beyond academia, have made a lasting impression on me. Few people have
the capacity to influence someone’s life as profoundly as he has influenced mine,
and I cannot overstate how grateful I am to have shared this process with him.
During the early stages of this journey, I also received close supervision from
José Lopes. His feedback and supportive critical approach were instrumental in
shaping the start of my PhD and I cannot thank him enough for his contributions.
In the later phases, Ali Reza Majlesi played an important role as well and I am
grateful for the opportunities he facilitated and for believing in my work. I would
like to thank Iolanda Leite for her insightful feedback and steadfast support,
always available regardless of time or place.
I am grateful for the input and rigorous feedback provided by Jens Edlund,
especially considering the arduous task of reviewing this thesis. His constructive
and critical feedback has been invaluable through various steps in this process.
Quiero adicionalmente agradecer a mi familia. Para mis padres, Luis Cum-
bal y Nancy Guerrón, sepan que no hay un dı́a que no los tenga presentes. Su
amor y apoyo han sido fundamentales en cada etapa de mi vida. De igual man-
era, agradezco profundamente a mi tı́a Iralda Guerrón y a mi hermana Nadia
Cumbal por hacer de mi vida un espacio feliz y lleno de amor. Sin ustedes, nada.
Todo lo bueno que soy y todo lo bueno que puedo aspirar a ser, lo debo a ustedes.
Quiero también agradecer a Lorena Gallardo, por compartir el camino y
en él crecer como dos. No hay sonrisa más llena que la que siento y encuentro
en ti. No puedo olvidar de agradecer a mis amigos, Daniel Terán, Javier
Llumiquinga , Jonathan Enrı́quez y Andrés Villareal. Cada abrazo, sean
de bienvenida o despedida, ha estado siempre en mi corazón. A la vida las gracias
por tenerlos a todos presentes.
A big thank you to Sanne van Waveren, Agnes Axelsson, Bram Willem-
sen, Dmytro Kalpakchi, and Bahar Irfan for the time and experiences we
have shared. From countless chats, the highs and lows of life, to the joy of bat-
tling this journey together, your friendship has always been a comforting embrace,

vii
viii ACKNOWLEDGEMENT

always bringing a sense of home closer.


I also extend my heartfelt gratitude to James Kennedy, Maike Paetzel-
Prüsmann, Reshmashree Kantharaju and the Disney Research team. I can-
not fully express my appreciation for the invaluable opportunity provided and
the profound impact of collaborating with your amazing group. Additionally,
I want to express my sincere thanks to Catharine Oertel, Maria Tsfasman
and Morita Tarvirdians for their warm hospitality and for considering me as a
member of their group at TU Delft. My appreciation also goes to Patrik Jonell,
Dimos Kontogiorgos, Per Fallgren and Mattias Bystedt for their support
during the initial steps into the world of research.
Thanks to Jura Miniotaité, Ekaterina Torubarova, and, most recently
David Cabrera, for their wonderful company in the office, making the most out
of a short but friendly space. Thanks to Jim O’Regan, Siyang Wang, Shivam
Mehta, Harm Lameris, Ambika Kirkland, Charlotte Stinkeste and Anna
Deichler for filling this road with many moments of joy and laughter, embracing
all aspects of the academic life. De igual manera mi profundo agradecimiento a
Gabriel Calderon, Katty Gonzáles, Roberto Heredia, Gladys Carrion,
Eduardo Álvarez and Nathaly Rea, quienes siempre han convertido nuestro
tiempo en algo más cálido, como el de nuestra tierra.
I also extend my sincere appreciation to André Pereira, Jonas Beskow,
Joakim Gustafson, Giampero Salvi, Gabriel Skantze, Gustav Henter,
Bob Sturm, Sten Ternström and Johan Boye for every opportunity used to
share their expertise and valuable conversations at various stages of this journey.
A very special thanks to Bo Schenkman and David House for always sharing
delightful chats with me.

Ronald Cumbal, Stockholm 2024


Acronyms

List of commonly used acronyms:

ASR Automatic Speech Recognition


CA Conversation Analysis
CALL Computer Assisted Language Learning
FSM Finite State Machine
HRI Human Robot Interaction
IVA Intelligent Virtual Agent
L1 First Language
L2 Second Language
NLP Natural Language Processing
RALL Robot Assisted Language Learning
ROS Robot Operating System
SDS Spoken Dialogue System
SLA Second Language Acquisition
STT Speech-to-Text
WER Word Error Rate

ix
Contents

List of Papers iii

Acknowledgement vii

Acronyms ix

Contents 1

I Overview 3

1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Background: Robot-assisted L2 learning 13


2.1 Computer Assisted Language Learning . . . . . . . . . . . . . . . . . 13
2.2 Robot Assisted Language Learning . . . . . . . . . . . . . . . . . . . 14
2.3 Attitudes, Perceptions and Concerns . . . . . . . . . . . . . . . . . . 16
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Background: Conversation and L2 Learning 19


3.1 Conversations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Conversation Practice for L2 Learning . . . . . . . . . . . . . . . . . 22
3.3 Interactive Conversational Skills for L2 Learning . . . . . . . . . . . 23
3.4 Corrective Feedback in L2 Learning . . . . . . . . . . . . . . . . . . 23
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Background: Culture and L2 Learning 25


4.1 Interculturality in Conversations . . . . . . . . . . . . . . . . . . . . 25
4.2 Cultural Aspects in Social Agents . . . . . . . . . . . . . . . . . . . . 26
4.3 Terminology in Cultural Studies . . . . . . . . . . . . . . . . . . . . 27

1
2 CONTENTS

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Technical Framework 29
5.1 Robot Furhat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Initial Dialogue System . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Taboo Game System . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 General Interactive Components . . . . . . . . . . . . . . . . . . . . 34

6 Understanding Conversations 37
6.1 Uncertainty, Confusion or Doubts . . . . . . . . . . . . . . . . . . . . 37
6.2 Speech Recognition with L2 Speakers . . . . . . . . . . . . . . . . . . 43
6.3 Pathways to Explore . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Enhancing Speaking Practice 47


7.1 Leveraging Group Dynamics . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Different Pairings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3 Balancing and Encouraging Participation . . . . . . . . . . . . . . . 50

8 Cultural Perspectives 55
8.1 Cultural Effects on Social Robots . . . . . . . . . . . . . . . . . . . . 55
8.2 Cultural Stereotypes and Social Robots . . . . . . . . . . . . . . . . 56

9 Paper Contributions 59
9.1 Paper A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.2 Paper B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.3 Paper C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.4 Paper D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.5 Paper E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.6 Paper F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.7 Paper G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.8 Paper H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

10 Discussion and Conclusions 65


10.1 Research Questions and Findings . . . . . . . . . . . . . . . . . . . . 65
10.2 Additional Factors for Reflection . . . . . . . . . . . . . . . . . . . . 68
10.3 Ethical Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10.4 Personal Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

References 73

II Included Papers 91
Part I

Overview

3
Chapter 1

Introduction

1.1 Motivation

In the domain of adult education, whether at the initial stages or for periods
of extended learning, there are complexities that make this educational journey
different from that of younger learners. One of the most notable differences lies
in the motivation that drives adults to embark on this journey. Adults often
enroll in new educational phases because of intricate life events, such as changes
in their professional trajectories, the desire for career advancement, or more sig-
nificantly, driven by evolving migration patterns. Over the past years, there has
been a steady increase in the number of individuals relocating to different coun-
tries, signaling a shift in population dynamics and societal structures (McAuliffe
and Khadria, 2019). Within this evolving landscape, one of the most common
educational objectives for adults is the process of learning a second language (L2),
as the ability to learn or speak the language of destination is one of the crucial
components for successful integration into a new society (Adsera and Pytlikova,
2012; McAuliffe and Khadria, 2019). However, the task of learning a second
language in adulthood does not come without its complications. Primarily, this
process presents a significant cognitive challenge for adults, given the considerable
reduction in language learning rate that occurs after the period of adolescence
(Steber and Rossi, 2021). Additionally, migrants L2 learners may face limited
social interaction within the local community (Barraja-Rohan, 2011; Baynham,
2006; Li and Kaye, 1998) which can hinder their learning process. Consequently,
providing support for adult learners, especially those engaged in learning a second
language, and navigating migration patterns, becomes particularly important for
the thrive and well-being of modern societies (Kloubert and Hoggan, 2021).
In this context, it is essential to identify the areas where adult language learn-
ers could benefit most from additional assistance and, in particular, whether
technology could offer this support. For this population, there’s a greater empha-
sis on practicing speaking skills to achieve interactive competence —the ability to

5
6 CHAPTER 1. INTRODUCTION

adequately communicate in social contexts. Traditionally, achieving this goal in-


volves interactions with proficient or first language (L1) speakers, often facilitated
through informal activities like language cafés1 or tandem meetings2 . However,
as mentioned earlier, opportunities to engage in social conversations may not al-
ways be readily available, even for L2 learners living in the L2 country (Roberts
and Cooke, 2009). Social robots, therefore, emerge as one possible alternative for
practicing speaking skills in socially driven conversations.
Social Robotics, a contemporary branch of Human-Robot Interaction (HRI),
has significantly advanced our understanding of how robots embedded with social
attributes, such as the ability to communicate in natural ways through gaze or
gestures, can facilitate seamless communication with their human counterparts.
These characteristics have motivated the use of social robots across various appli-
cations. In particular, when the form and function of these robots appropriately
aligns to the interaction context (Henschel et al., 2021), evidence demonstrates
their potential to strengthen interactions and contribute to the goals of human-
robot relationship, particularly in areas such as healthcare or education (Bel-
paeme et al., 2018a; Broekens et al., 2009).
In the area of education, whether taking on the roles of peers, assistants,
or tutors, social robots are envisioned as promising tools to support the human
learning process. Recent findings have provided substantial evidence indicating
that educational robots, when designed appropriately with the context of the
interaction, can indeed enhance cognitive performance and offer socio-emotional
support when interacting with students (Belpaeme et al., 2018a). However, when
analysing these contributions, one can easily notice that these findings are skewed
toward a mostly young population of students. This trend is also observed in re-
search focused on robot assisted language learning (RALL), as shown in Figure
1.1. This emphasis on children’s language acquisition aligns with the broader
scientific interests in early human development and the societal imperative to
support early education. Moreover, adults usually expect more technical sophis-
tication in robots’ appearance and interaction (Belpaeme and Tanaka, 2021), a
concern less relevant for children. Notably, initiatives like the L2TOR project
(Second Language Tutoring using Social Robots), funded by the European Com-
mission’s Horizon 2020 program (Belpaeme et al., 2015; Vogt et al., 2019), have
played a crucial role in advancing this field of research.
While the focus on young learners is justified, it leaves a significant gap in
our understanding of how social robots can benefit individuals of different ages,
particularly adult learners. Limited studies on this population have suggested
positive outcomes from the use of social robots in education (Donnermann et al.,
2022; Saerbeck et al., 2010b; Schodde et al., 2017). Furthermore, certain as-
pects, such as pronunciation and speaking skills, remain relatively underexplored
1 Informalconversations where one or many proficient or native speakers engage with students
of a second language, often providing feedback or support during the interaction.
2 Sessions in which two students learning each other’s first language collaborate by teaching

and practicing their respective target languages.


1.1. MOTIVATION 7

Figure 1.1: Areas of research within robot assisted language learning. Values sum-
marized from the survey of robot-assisted language learning by Randall (2019).

in RALL compared to more traditional objectives like grammar, vocabulary, and


reading practice, also illustrated in Figure 1.1. The exploration of these less-
studied areas partially motivates the work in this thesis, however, as previously
described, a more compelling motivation is to support adults in achieving inter-
active competence by practicing their speaking and conversational skills.
Nonetheless, the benefits of using social robots in contrast to other platforms,
such as computer software (Beatty, 2013) or mobile applications like Duolingo3 ,
may not be immediately clear. Recent advancements in dialogue systems (Dinan
et al., 2020; Khatri et al., 2018; Zhang et al., 2020) have certainly improved the
capability of different platforms to generate (constrained) spoken interactions.
However, social robots maintain a unique advantage compared to other technolo-
gies. As analyzed by Van den Berghe et al. (2019), social robots enable interac-
tion within a real physical environment and the appearance –often human-like–
allows for a more natural interaction through nonverbal cues like hand gestures.
Furthermore, computer- or mobile-based interfaces lack the flexibility to interact
with multiple participants without compromising some of its interactive quali-
ties. Robotic platforms, even with a reduced number of degrees of freedom, can
effectively reproduce directional communication, including gaze or head motions,
essential for sustaining group interactions.
Certainly, there are technical constraints that limit the extent to which con-
versational practice can be replicated with social robots. One notable limitation
originates from the performance of Automatic Speech Recognition (ASR) sys-
tems. While there is a general accepted minimal performance level suggested for
the proper functioning of these systems (e.g. Microsoft defines a maximum Word
Error Rate of 30%4 ), research indicates that reduced performance may not nec-
essarily impede the development of dialogues (D’Mello et al., 2010; Litman et al.,

3 https://www.duolingo.com
4 https://learn.microsoft.com/en-us/azure/ai-services/speech-service/

how-to-custom-speech-evaluate-data
8 CHAPTER 1. INTRODUCTION

2006), even in situations involving L2 learners (Engwall et al., 2022; Johnson and
Valente, 2009). However, errors in recognition might still place learners in an
unfavorable position, particularly when feedback is expected. This becomes im-
portant as they often need sufficient assistance to tackle challenges throughout a
practice conversation. The effectiveness of these qualities in social robots, hence,
may still fall short of convincingly reproducing the characteristics of L1 speak-
ers, or educators, who possess the ability to evaluate conversational speaking or
provide adequate feedback.
Nevertheless, this thesis argues that social robots can assume a supportive
role in enhancing settings where learners practice conversational skills. This idea
advocates for:

• Multi-party interactions between L2 learners and a robot: This sce-


nario not only serves as a space for learners to engage in insightful interactions
but also allows collaborative efforts among learners to resolve technical limi-
tations, such as incorrect interpretations of a learner’s utterance.

• Interactions among learners of different proficiency levels: These


interactions can more authentically replicate real-world interactions, where
the robot can contribute in creating a safe environment for L2 learners to
participate more actively, for instance, by managing turn-taking.

• The robot as a source of encouragement: Social robots can inspire


learners to actively participate in second language practice by motivating
them to speak more. For instance, robots can effectively employ verbal and
non-verbal cues to encourage participants in the conversation.

• Interactions with proficient speakers, including L1 speakers: This


setting facilitates L2 learners of diverse backgrounds, or even with L1 speak-
ers, to naturally exchange cultural dynamics during speaking practice interac-
tions. The robot’s role then must gain a better understanding of socio-cultural
nuances, such as how learners manage non-verbal cues in a conversation.

These propositions serve as the foundation for the thesis and are formalized
in the contributions outlined in the following section.

1.2 Contributions
It seem evident then that social robots have distinct characteristics that can
substantially support the development of communicative skills in second language
practice. However, this proposition raises several questions. This thesis, hence,
aims to embark on the initial exploration of:

How can social robots effectively support conversations involv-


ing learners practicing a second language?
1.2. CONTRIBUTIONS 9

Figure 1.2: Visual representation of the contributions described in this thesis. For
detailed contributions of each published academic paper, refer to Chapter 9.

Throughout this undertaking, the thesis highlights the theoretical frameworks


and technical challenges inherent in supporting dialogues with second language
learners. Additionally, it addresses the evolving cultural dynamics inherent in
interactions of this nature. The contributions are separated into three main
components, as described below, and visually represented in Figure 1.2.

Understanding Conversations with L2 Learners


This thesis starts with an assessment of the behaviors that may emerge in dyadic
and multi-party interactions between L2 learners and a social robot. The objec-
tive is to understand the capabilities and behaviors that social robots are expected
to possess and demonstrate in the context of second language practice. This ex-
ploration starts by identifying the types of communications breakdowns that
arise in dialogues with second language learners, a topic examined in Paper A
and Paper B. These insights are then extended to evaluate how known tech-
nological limitations in social robots should be considered when leading these
interactions. Here, the work described in Paper C extends an examination
of the technical constraints of speech recognition. Of particular significance is
the recognition that the speech characteristics of second language learners of-
ten diverge from a “standard” language distribution, potentially affecting system
performance. As a result, the thesis discusses how the autonomy of the robot
and its capacity to support conversational skills practice should be evaluated to
guarantee optimal conditions for learners.
10 CHAPTER 1. INTRODUCTION

Enhancing Speaking Practice


Subsequently, considering the advantages that robots present as embodied agents,
this thesis explores how these characteristics can improve the form in which
speaking skills are practiced in a second language setting. The objective is to
motivate learners to participate in second language practice by simply —although
not easily— encouraging them to extend the time they speak. Notably, prompt-
ing L2 learners to speak may pose its own set of challenges. Therefore, the design
of interactions includes two learners engaging with a robot, not only to promote
collaboration but also to offer additional support in navigating uncertainties dur-
ing practice conversations. As robots excel in mediating interactions, shaping
conversational patterns, and inducing group dynamics (Sebo et al., 2020), the
behavior of the robot is designed with non-verbal characteristics that stimulate
the speaking participation of the learners, as evidenced through Paper D and
Paper E. Within this framework, we further explore the advantages of pairing
learners with individuals of differing speaking proficiency, including pairing them
with native speakers. This strategy ensures the authenticity of spoken interac-
tions and presents an opportunity to fosters the development of interpersonal
dynamics among students from different cultural backgrounds.

Cultural Perspectives
This process then presents another aspect that this thesis aims to address, that of
cultural richness of second language learning. Undoubtedly, language functions as
a transparent window into one’s cultural background, transforming the process
of learning a new language into a rich exchange of cultural intricacies. While
this phenomenon is well-explored within social sciences, especially in linguistic
studies (Genc and Bada, 2005; Kuo and Lai, 2006), the understanding remains
somewhat limited when it comes to the involvement of social robots. The final
phase of this thesis thus offers a look into how cultural characteristics shape
the interaction in language diverse settings. Through an exploration of this
phenomenon, the objective is to offer valuable insights into comprehending the
interplay between cultural nuances in language practice with a social robot, as
evidenced in Paper F. Additionally, the thesis delves into the preconceptions that
individuals may develop regarding culturally-encoded social robots, as illustrated
in Paper G.

1.3 Outline
In the chapters that follow, we will present the three main theoretical frameworks
that have played a crucial role in providing a conceptual structure for understand-
ing how social robots can enhance the practice of a second language with adult
learners. The first framework, presented in Chapter 2, focuses on the elements
surrounding the development and utilization of social robots. This framework ex-
1.3. OUTLINE 11

plores the foundations behind human-robot interactions, with a specific emphasis


on multi-party interactions, and examines the development of spoken conversa-
tions involving embodied agents. Moving on to Chapter 3, the second theoretical
perspective describes fundamental concepts of human conversation. It not only
defines conversational interactions as the ultimate goal but also as the pathway
to achieve interactive competences in the process of L2 learning. In Chapter
4, the third framework introduces a structured approach to understanding the
social and cultural dynamics inherent in human-robot interactions. Chapter 5
outlines the main technical frameworks used in the development of the studies
incorporated in this thesis work. Subsequently, Chapters 6, 7, and 8 detail the
evolution of the studies driving this thesis, highlighting key results and providing
additional insights that influenced the planning and execution of these studies.
Chapter 9 details the scientific contributions made through the academic publi-
cations used through this thesis. Lastly, Chapter 10 presents an overview of how
these frameworks and individual contributions collectively discover the potential
of social robots as effective facilitators of spoken language practice, an ethical
assessment of these propositions and the next steps to follow in the intersection
of language education, human-robot interaction, and transparent/interpretable
technology.
Chapter 2

Background: Robot-assisted L2
learning

This chapter explores the evolution of Robot Assisted Language Learning (RALL),
tracing its origins back to studies within Computer-Assisted Language Learning
(CALL). The chapter examines the inherent advantages of social robots, empha-
sizing their role in multi-party interaction. Towards the conclusion, a summary
is provided regarding people’s attitudes and perceptions towards social robots.
These insights will serve as a foundation to inform the suggested role that so-
cial robots can take in enhancing spoken language practice for second language
learners.

2.1 Computer Assisted Language Learning


Early research on computer-mediated language teaching and learning can be
traced back to the late 1970s (Lim and Aryadoust, 2022). Since then, the field has
evolved along diverse paths to support language learning, with contemporary ef-
forts extending to mobile-based applications and web-enhanced computer-assisted
language learning (WELL) (Gruba, 2004). Currently, these diverse efforts fall un-
der the umbrella of computer-assisted language learning (CALL) (Gruba, 2004).
As defined by Cameron (1999), the field seeks to “improve the learning capacity
of those who are being taught a language through computerized means”. As such,
research studies have explored various learning tasks to support learners includ-
ing vocabulary acquisition, pronunciation training, grammar practice, listening
and reading comprehension, development of writing skills, and the fostering of
speaking abilities (Chun, 2011). However, it is worth noting that the latter, both
in the training and assessment of speaking skills, tend to receive less attention in
research; and if these tasks are integrated into applications, the focus often leans
towards constrained tests involving elicited or read speech (Litman et al., 2018).

13
14 CHAPTER 2. BACKGROUND: ROBOT-ASSISTED L2 LEARNING

In the area of conversational practice, spoken CALL systems typically offer


learners more freedom to speak, although with certain limitations on the settings
and scenarios. For example, Morton and Jack (2005) devised a 3D virtual reality
system named SPELL where beginner learners can interact in scenarios like order-
ing food at a restaurant. The system DEAL (Hjalmarsson et al., 2007; Wik and
Hjalmarsson, 2009) used a role-play setting for conversation training in Swedish,
featuring a virtual shopkeeper in a flea market. Notably, the Tactical Language
and Culture Training System (Johnson and Valente, 2009), used with US army
recruits, combined skill-building lessons with free-play dialogue games, allowing
trainees to interact with animated agents representing Iraqi citizens. The system
was extended by Hautopp and Hanghøj (2014) to develop a 3D computer game
for Danish spoken language practice and cultural integration among adult im-
migrants. Despite the progress in providing interactive alternatives for speaking
practice, the efficacy of these solutions remains unclear. If fact, within CALL,
only pronunciation training or chat-based conversations have shown robust sup-
port in demonstrating their efficacy in learning (Golonka et al., 2014). However,
pronunciation training often lacks the interactive quality necessary for the de-
velopment of conversational skills, and chat interactions, through voice or text,
rely on synchronous human-human conversations (Abrams, 2003), thus removing
some of the autonomy benefits from computerized applications.
Furthermore, technical challenges, such as reduced speech recognition per-
formance for L2 speakers or unconstrained environments, pose substantial im-
pediments to the introduction of more interactive activities (Litman et al., 2018).
Nevertheless, there is compelling evidence indicating that the use of technological
devices offers a clear advantage in reducing reported anxiety levels among learn-
ers (Bashori et al., 2022). This interactive quality, hence, becomes important to
further explore the application of social robots in language learning.

2.2 Robot Assisted Language Learning


The field of robot assisted language learning (RALL) serves the purpose of sup-
porting the learning of expression or comprehension language skills through the
use of robotic platforms. As highlighted by Van den Berghe et al. (2019), robots
supporting language learning are believed to possess two key advantages over
other technological forms. First, social robots can simulate social interactions
more realistically, which enhances the practical dimension of learning a (second)
language in real-world contexts. For example, recreating everyday situations like
buying groceries or engaging in casual conversations in a real scenario provides
learners with more immersive experiences. Second, the physical presence of a
robot, often characterized with human-like features and behaviors, contributes to
learning. This is because language is linked to real-life sensorimotor interactions
(Hockema and Smith, 2009), and thus, interactions involving tangible objects can
significantly help vocabulary acquisition (Glenberg, 2008).
2.2. ROBOT ASSISTED LANGUAGE LEARNING 15

From the review conducted by Randall (2019), the overall benefits of having
a robot involved in a language learning process include: (1) robots can aid in
learning when used as accompaniment to human instructions, (2) they have a
positive effect on learner’s affective states (e.g. confidence, anxiety, and motiva-
tion) and (3) they may offer advantages when used to foster speaking ability. To
optimize these benefits, human-robot interactions must be carefully designed, as
suggested by Belpaeme et al. (2018b). These recommendations include focusing
on meaningful interactions to actively engage the learner, adapting interactions
to the learner and domain, and considering the duration and intensity of the
intervention. Furthermore, Belpaeme et al. (2018b) highlights the importance
of factors such as the robot’s role, type of feedback, and verbal and non-verbal
behaviors. Engwall and Lopes (2022) reaffirm these principles for adult learners.
While the embodiment of the robot may be assumed to inherently benefit learn-
ing, research suggests that evaluating only the effects of embodiment may not
necessary lead to substantial learning outcomes (Gordon et al., 2015; Westlund
et al., 2015). These principles can be extended to all learning contexts, despite
being primarily derived from research focused on younger participants.

Language Learning and Conversations


In the case of speaking and conversation training, including pronunciation skills,
the research landscape is somewhat limited, as originally illustrated in Figure 1.1
in Chapter 1. Among the few studies in this subarea, the work of Lee et al. (2011b)
showed the effects of elementary students interacting with a robot twice a week for
eight weeks, targeting listening and speaking skill development. Although there
was no impact on listening acquisition, cognitive learning in speaking skills did
increase. Similarly, Wang et al. (2013) investigated how robot companions could
enhance English conversational skills with elementary students. The results indi-
cated improved speaking skills compared to traditional classroom settings, with
learners that used the robot companion showing higher confidence, motivation,
and engagement. On the contrary, Rosenthal-von der Pütten et al. (2016) found
no positive impact on language alignment while examining verbal patterns during
interactions with a robot Nao in virtual and physical forms. The study, however,
acknowledged limitations, including participant fatigue.

Multi-party Conversations
Given the benefits that social robots convey for group conversation, in comparison
to other technologies, there has been research devoted to explore this setting
within RALL. For instance, Khalifa et al. (2017) presented a joining-in RALL
system with two humanoid robots playing a teacher and an “advanced” peer
role. The interaction between the robots and learner was designed to smoothly
switch between tutoring and implicit learning. The results revealed that repetitive
queries of specific grammatical expressions prompted by the robot consistently
16 CHAPTER 2. BACKGROUND: ROBOT-ASSISTED L2 LEARNING

improved correct usage, with substantially greater improvement when the peer
learner robot provided assistance for implicit learning, compared to scenarios
without robot assistance. In a follow-up work, Khalifa et al. (2019) applied the
same approach to improve practical communication skills. The authors found that
repetitive implicit learning sessions increased appropriate grammatical pattern
usage. Post-presentation of a reference proved as effective as pre-presentation,
especially for retention, functioning as corrective feedback for learners.
However, robots can take a much more influencing role in group interactions.
Sebo et al. (2020) highlights the profound impact that robots can have on group
dynamics through their behaviors, their assigned roles within the groups, and
their appearance and capabilities. For example, a robot’s verbal and nonverbal
behaviors can actively shape interactions among group members, extending be-
yond the direct interaction with the robot itself. Importantly, Sebo et al. (2020)
note that the effectiveness of these results is strongly related to how most of these
robots —highly anthropomorphic and with human-like modalities of interaction—
fulfill the role of a human member of the group. These considerations, however,
are beneficial for the purpose of supporting L2 practice conversations in group
settings.

2.3 Attitudes, Perceptions and Concerns


An aspect that may not always receive primary focus in research on social robots
is people’s attitudes or perceptions towards these robots. Naneva et al. (2020)
provide a thorough analysis of these factors by considering various categories of
users’ perceptions, including the type of human-robot interaction, application,
design, and geographical location. As the authors discuss, results tend to vary
considerably across categories, partially derived from the reduced number of stud-
ies in each category, but also generated from the inherit difference in domains and
settings within studies. Moreover, the majority these studies focus on children’s
education, highlighting a need for further research in adult education.
When considering the role of robots in education, it’s essential to prioritize
attention towards students and teachers as the main end-users, along with other
stakeholders like parents, policymakers, and representatives from the robot in-
dustry (Smakman and Konijn, 2020; Smakman et al., 2021). While there is
general agreement among these parties that social robots can be engaging and
have the potential to motivate children, teachers express concerns about the latter
on the long-term (Smakman et al., 2021). Research further indicates that both
students and teachers generally maintain neutral attitudes towards the integra-
tion of robots into learning and teaching processes (Reich-Stiebert et al., 2019).
However, it is particularly important to address teachers’ concerns, given the sub-
stantial influence of their positive beliefs about technology and its implementation
in classrooms (Blackwell et al., 2013). Concerns about data privacy and manage-
ment are also shared by most stakeholders (Serholt et al., 2017; Smakman et al.,
2.4. CONCLUSIONS 17

2021), which pose a challenge, particularly in light of research demonstrating the


benefits of personalized robot interaction for learning processes (Gordon et al.,
2016). Although teachers see value in using robots as an additional teaching tool
and less suited for novel concepts (Serholt et al., 2017), they often voice worries
regarding potential disruptions to their teaching methods, increased workload,
and the fear of diminishing interpersonal connections due to robot involvement
(Reich-Stiebert et al., 2019). Prolonged exposure to robots in the classroom may
lead to slight modifications in these views (Westlund et al., 2016). Finally, the
perception of the role robots should play also varies among stakeholders. Children
exhibit diverse preferences, viewing robots as potential friends, tutors, rivals, or
even servants, while parents may see them as companions or simply as mechanical
tools (Smakman et al., 2021).

2.4 Conclusions
Existing computer-assisted and robot-assisted language learning systems often
offer learners constrained speaking interactions which may not fully address the
development of broader conversational skills. Technical challenges, including re-
duced speech recognition performance for L2 speakers, can further impede the
introduction of more interactive activities. Despite these obstacles, the reported
reduction in anxiety levels and the positive impact on learners’ emotional states
highlight the potential benefits of these interactive technologies. Particularly,
considering the advantages associated with social robots in facilitating insightful
real-world interactions, they emerge as promising option for advancing language
learning, with a focus on enhancing speaking and conversational skills.
Chapter 3

Background: Conversation and L2


Learning

Since practice conversations are a central focus of this work, it’s crucial to begin by
establishing foundational concepts for analyzing human conversations. In doing
so, this thesis adopts the perspective of Conversation Analysis (CA) (Seedhouse,
2005), using it not as a methodology, but rather as a framework for interpreting
social interactions. Within this framework, this chapter highlights key factors
contributing to fluent spoken communication and how these elements are used
to understand and support the process of conversation practice with L2 learners.
Finally, in line with the underlying motivation of this work, this chapter provides
a comprehensive background on communicative and interactive competence as
the central objective in learning a second language for adult (migrant) learners.

3.1 Conversations
Building upon the background presented in Chapter 2, which discusses how robots
can improve conversational practice, particularly in multi-party settings, this sec-
tion introduces the concepts of common ground, turn-taking, and conversational
cues. A special focus is directed towards potential challenges arising in the form
of speaking or understanding during spoken interactions, as these challenges are
expected to manifest themselves in conversations with L2 learners.

Common Ground
Paul Grice, in his 1967 William James lectures, introduced the idea of “common
ground” —without explicitly using this term— that would become central to
the field of pragmatics (Geurts, 2019). As originally proposed, common ground
refers to the shared knowledge, beliefs, and assumptions between interlocutors
that facilitate successful communication. Grice emphasized that speakers convey

19
20 CHAPTER 3. BACKGROUND: CONVERSATION AND L2 LEARNING

meaning beyond the literal interpretation of their words, noting the importance
of cooperative principles and conversational maxims to communicate efficiently
(Neale, 1992). Over time, this concept has been explored from various angles, such
as Lewis’s “common knowledge” (Lewis, 1969) and Schiffer’s “mutual knowledge”
(Schiffer, 1972). Clark and Schaefer (1989) defined it as the mutual agreement
among conversation participants that they have understood each other “to a
criterion sufficient for current purposes”.
Stalnaker (2002) coined the term common ground, describing it as the sum
of interlocutors’ mutual, common, or joint beliefs and knowledge. Once informa-
tion achieves common ground status, participants need not invest further efforts
into redefining or clarifying it (Knutsen and Le Bigot, 2012). Nonetheless, com-
plications may occur in the form of misunderstandings and non-understandings
(Hirst et al., 1994), where the former denotes an incorrect interpretation of the
speaker’s intention, while the latter signifies a complete absence or minimal con-
fidence in any interpretation (Skantze, 2007). In such instances, problems are
usually corrected through repair mechanisms initiated by the participants of the
conversation as the dialogue unfolds. Preferences regarding which interlocutor is
expected to manage these repairs vary (Skantze, 2005), although there is a slight
inclination towards self-repair. Self-repairs occur when a speaker spontaneously
corrects or revises their own utterance during a conversation, while other-repairs,
on the other hand, involve one participant in a conversation initiating a correction
or clarification for something said by another participant. For example, Kendrick
(2015) showed that the time interval before other-initiated repairs is longer than
the typical delay in turn-taking, suggesting a deliberate communicative act aimed
at prompting the speaker to engage in self-repair.
Understanding common ground and the mechanisms for resolving miscommu-
nications are essential in the proposed context of L2 speaking practice. As will be
elaborated on Chapter 6, from results of Paper A and Paper B, learners may
request clarification from a robot but could also choose to remain silent, either
because they are trying to decode what the robot said or because this reaction
could signal an implicit request for clarification.

Turn-taking
Moving forward, the concept of turn-taking encompasses the dynamics sur-
rounding the organization of speaking turns in a conversation. Among the earli-
est models proposed, Sacks et al. (1974) suggested that the coordination of turns
is not pre-planned but evolves dynamically during the dialogue. Their model
primarily adheres to the principle of “one part speaks at a time” even though
transitions can easily, and frequently, occur without a gap, or even with overlap.
Alternative models have instead emphasized the overlapping nature of conversa-
tions, whereby interlocutors can develop “more than one floor” and take turns
as deemed functionally appropriate (Edelsky, 1981; Schegloff, 2000). Notably,
participant of a conversation can use various cues and signals to indicate when a
3.1. CONVERSATIONS 21

transition to a different speaker is suitable or contextually justified. For instance,


Duncan (1972) demonstrated how speakers use a combination of prosody, syntax,
and gestures to indicate whether they intend to keep or yield the conversational
turn. Cues associated with yielding the turn include shifts in pitch, a decrease
in loudness, and termination of hand gesture (Hjalmarsson, 2011; Skantze et al.,
2014). In contrast, a flat or sustained pitch contour suggests a desire to retain the
floor (Koiso et al., 1998; Selting, 1996). While there are conflicting findings on
the effect of some of these cues, these inconsistencies might stem from variations
among dialects, languages, and dialogue contexts (Hjalmarsson, 2011).
The exchange of speaking turns is not just crucial for keeping conversations
flowing smoothly; these mechanisms can also be leveraged by a robot participating
in a conversation to steer the direction of the overall interaction. Non-verbal
signals, in particular, are an important factor in the behavior that this thesis
proposes for the robot. As it will be detailed in Chapter 7, non-verbal behaviors
of the robot have the power to shape the dynamics of group interaction, and this
quality can be exploited to enhance conversation practice with L2 learners.

Backchannels
The utilization of conversational cues to shape an interaction is not limited to
the participant holding the speaking turn. Usually referred to as active lis-
tening, participants of a conversation often use verbal (e.g. “yeah”, “uh-huh”)
and non-verbal (e.g. head nods or smiles) cues to demonstrate attention to the
speaker. The terminology for these signals tends to differ across literature, in-
cluding “listener responses” (Dittmann and Llewellyn, 1968), “accompaniment
signals” (Kendon, 1967) and “backchannels” (Yngve, 1970). In this work, we
use the latter term: backchannel. As described by Wolf (2008), early research
—focused primarily on American English— proposed a discrete classification of
backchannels, starting as short messages to indicate interest and attention (Yn-
gve, 1970) and further developed to include clarification requests, sentence com-
pletions, brief restatements, and nonverbal messages (Duncan, 1974; Duncan and
Fiske, 1977). Schegloff (1982) further suggested that backchannels “must instead
be analyzed in view of their interactive functions within discourse”, proposing
them to be “continuers” and having a regulative function (Wolf, 2008). However,
in practice, their interpretation, from the perspective of the listener, can be eas-
ily modified due to differences in timing and form (Kawahara et al., 2015). For
instance, the English expressions “oh” or “okay” used at the end or toward the
end of a turn might be interpreted as an attempt to take the floor or even signal
the end of the sequence (Goodwin, 1986; Schegloff, 1982). Conversational cues,
in this manner, assume a multifaceted role, impacting not just the flow of the
dialogue but also molding the dynamics of participation and engagement within
the interaction. These cues are the focus of Paper E that will be discussed in
Chapter 7.
22 CHAPTER 3. BACKGROUND: CONVERSATION AND L2 LEARNING

Gaze
Expanding on non-verbal cues, gaze occupies a distinct role in shaping conver-
sations further. As eloquently summarized by Mutlu et al. (2012), gaze serves
as a valuable signal for defining conversation roles (Goodwin, 1981), facilitat-
ing turn-taking, and providing information on the speaker’s discourse (through
gaze shifts primarily). This thesis explores the two aforementioned functions. In
the determination of participants’ roles within a conversation, especially in those
involving more than two individuals, interlocutors use gaze directed towards spe-
cific participants to clarify who is being addressed (Goodwin, 1981; Sacks et al.,
1974; Schegloff, 1982). The absence of this mechanism, i.e. not gazing towards an
intended addressee, may potentially lead to breakdowns in the organization of the
conversation (Schegloff, 1982). Gaze serves a crucial cue in aiding turn-taking,
providing clarity on which speaker holds the turn (Kendon, 1967) or facilitat-
ing turn exchanges, whether through simple single-floor turns Goodwin (1981);
Goodwin et al. (1980); Sacks et al. (1974) or with overlapping speech (Schegloff,
2000). Chapter 7 will describe the exploration of an adaptive gazing robot be-
havior presented in Paper D.

3.2 Conversation Practice for L2 Learning


Attaining spoken proficiency in a second language is widely recognized as a com-
plex undertaking in the process of learning a second language. The complexity
partly derives from the multifaceted challenges embedded in the components re-
quired for achieving fluent L2 speaking. These components include mastering
new phonetic ranges, acquiring an extensive vocabulary, and understanding the
appropriate use of the language. Particularly within the context of conversa-
tion practice, communication problems can stem from both comprehension and
production aspects (Sacks, 1992). Problems in production are often defined us-
ing two paradigms: the Feeling of Knowing (FOK) paradigm (Hart, 1965) and
Feeling of Another’s Knowing (Brennan and Williams, 1995). In these scenar-
ios, the speaker struggles to recall or produce an idea, even if its concept is well
understood. Conversely, comprehension troubles occur when a learner fails to
understand the information conveyed by their conversational partner (Cumbal
et al., 2020a). Research indicates that these problems can be addressed in a
manner similar to how L1 speakers handle them (Zeng, 2019). However, these
findings are typically associated with classroom conditions, where the form of
repair may be linked to the type of activity, such as language-centered versus
content-centered (Kasper, 1985). The proficiency level of the learner is also a
critical factor. Yasui (2011) observed conversations among Japanese learners in
a US university, finding that advanced learners tend to prefer self-repairs, while
beginners prefer to be corrected (i.e. other-repairs), likely due to their limitations
in language competence.
3.3. INTERACTIVE CONVERSATIONAL SKILLS FOR L2 LEARNING 23

3.3 Interactive Conversational Skills for L2 Learning


Through the process of practicing conversations in a second language, and as
argued in the motivation of this work, the primary goal for adult learners is to
achieve communicative or interactive competence. Communicative Competence,
as proposed by Hymes et al. (1972), expands upon Chomsky’s Linguistic Com-
petence (Chomsky, 1965) by addressing the crucial aspect of appropriacy. This
involves not only recognizing well-formed sentences but also understanding their
appropriate use in specific contexts (Spolsky, 1989). Hymes argues that assessing
a learner’s speaking proficiency should go beyond evaluating grammatical and
phonetic correctness to include the appropriateness of responses in social inter-
action (Hymes et al., 1972).
In the scope of L2 learning, this concept is further elaborated as Interactive
Competence (IC), i.e. an individual’s ability to effectively engage in conversa-
tional interactions. Interestingly, despite its significance in Conversation Analysis
(CA) research, the L2 learning community initially hesitated to adopt this per-
spective when teaching conversational skills (Barraja-Rohan, 2011). However, the
exploration of teaching interactional competence gained momentum with Kram-
sch’s pioneering work (Kramsch, 1986). Oksaar (1983, 1990) expanded on this
concept, delving into the complexities of interactional competence and emphasiz-
ing the influence of culture in multilingual interactions. Additionally, Hall (1995)
focused on the repetitive and goal-directed elements of conversational practices
that contribute to social cohesion and highlighting the significance of pragmatic
competence within communicative competence. More recently, Kasper (2006),
who originally supported an explicit teaching of pragmatic functions, now pro-
poses examining CA for Second Language Acquisition (SLA). This shift reflects
the evolving understanding of the role of conversational competence in language
acquisition.
In the context of classrooms, the interactive aspect has gained a general pref-
erence when teaching and evaluating L2 spoken practice. These practices often
involve activities like oral storytelling or role-playing, situated within a target use
domain (Van Moere, 2012), and in this ways emphasizing the “real-life” nature of
these tasks (Harding, 2014). However, practicing, and specially evaluating, com-
municative skills require some sort of interaction, which prompts the question of
how one can control the expected irregularity of the interactions (Harding, 2014).

3.4 Corrective Feedback in L2 Learning


When studying the extent of corrective feedback in L2 learning, there is a general
understanding that this area is highly complex (Ellis, 1994). There is no con-
crete agreement as to whether different feedback strategies are superior to others,
given the wide variety of approaches teachers may take to address student errors
(Chaudron, 1977; Lyster and Ranta, 1997; Seedhouse, 1997). Indeed, as summa-
24 CHAPTER 3. BACKGROUND: CONVERSATION AND L2 LEARNING

rized by Ferreira et al. (2007), the are several factors affecting the effectiveness
of these strategies, including the “specific aspects of language being corrected,
conditions relating to the provision of teacher correction, and characteristics of
the students” (e.g., considering the difference among proficiency levels). Further-
more, educators employ various strategies within corrective feedback, including
the repetition of errors, reformulation of all or part of the student’s answer (re-
cast), explicit correction, or providing the correct answer when uncertain (Ferreira
and Atkinson, 2008). When teachers aim to solicit a response from students, they
may question the correctness of the student’s utterance, requesting clarification
when an utterance is ill-formed (and soliciting a repetition or reformulation), or
directly eliciting a correction by pausing to allow the student to complete the
reformulation (Ferreira and Atkinson, 2008). These findings underscore the in-
tricate nature of corrective feedback in L2 learning and highlight the importance
of considering various factors when implementing feedback strategies in language
education.
Implementing appropriate feedback from the robot poses a challenge due to
these complex requirements. In the context of this thesis work, this challenge
is amplified as the robot leads a practice conversations. Therefore, we suggest
refraining from providing specific feedback in this setting. Instead, using a mul-
tiparty settings may present a potential solution, allowing peers to potentially
fulfill the role of providing some (communicative) feedback.

3.5 Conclusions
The theoretical frameworks that underlie the learning of communicative skills
in a second language create a compelling need for the development of interac-
tions facilitating participants to engage in practice conversations. However, this
requirement encounters challenges when translated into interactive technologies.
As demonstrated in the preceding sections, the use of practice conversations in a
second language involves navigating various dimensions, encompassing compre-
hension and production challenges faced by language learners.
An additional aspect that arises is that of socio-cultural factors within the
domain of L2 learning. Therefore, it becomes crucial to incorporate this contex-
tual dimension into the development of the thesis work. These topics are further
discussed in the next chapter.
Chapter 4

Background: Culture and L2


Learning

In the preceding chapters, this thesis has described the role of language as a
path for exploring the cultural dimensions inherent in L2 learning. As Baldwin
et al. (2006) highlight, the term “culture” embodies a complex and multifaceted
concept, encompassing a variety of collective beliefs, values, customs, practices,
behaviors, and social institutions defining a specific group of people. Culture
should, furthermore, be recognized as a dynamic concept, evolving over time in
response to internal and external factors. Building on this foundation, the current
chapter explores critical elements that clarify how cultural backgrounds impact
the dynamics shaping spoken interactions and the subsequent development of
social robots within this context.

4.1 Interculturality in Conversations


Early Conversation Analysis (CA) primarily examined conversations among adults
of the same culture and language, typically of English (Firth, 1996). This ap-
proach assumed that analysis should be conducted by someone with co-membership
in the linguistic-cultural community of the conversation participants (Firth, 1996).
As research evolved, it expanded to incorporate an intercultural perspective, ac-
knowledging the complexities of a globalized world. However, the exploration of
“intercultural” interactions extends beyond the mere presence of speakers from
diverse lingua-cultural backgrounds. Ethnomethodological approaches emphasize
demonstrating how local social structure and context significantly shape conver-
sation dynamics, including practical attributions of knowledge, competencies, and
sensitivity to lingua-cultural differences (Arano, 2019).
When the focus of CA aligns with the objectives of second language learn-
ing, the dynamics of these interactions become more transparent. Strategies to
address troubles in talk have been studied in L1–L2 interactions, revealing dif-

25
26 CHAPTER 4. BACKGROUND: CULTURE AND L2 LEARNING

ferences in the management of troubles. For example, L2 speakers may utilize


additional semiotic resources such as gestures to handle challenges (Eskildsen
and Wagner, 2015). Kushida (2011) demonstrated that Japanese speakers may
provide candidate interpretations not only to confirm understanding but also to
assist in speaking.
The rights and obligations associated with repair actions, as well as the deliv-
ery methods, are subject to attunements of cultural practices, particularly evident
in L2 interactions where social identities of expert-novice may be assigned through
other-initiated repairs. Hence, spoken language should be viewed not merely as
a tool for communication but should highlight the complex interplay of diverse
interactional resources, encompassing both verbal and bodily actions, within the
repertoire of human learners (Goodwin, 2018). This perspective also applies to
the potential utilization of these resources by social robots.

4.2 Cultural Aspects in Social Agents


In the scope of social agents, culture manifests itself through perspectives, atti-
tudes, and assumptions related to the appearance, sound, and behavior of virtual
agents or physical robots. The effect of cultural backgrounds on technology ac-
ceptance is crucial, particularly for social agents embodying varying social norms.
Cultural adaptation in virtual agents, as noted by O’Neill-Brown (1997), consid-
ers motivation, interaction style, and the use of verbal and non-verbal expressions
in shaping user perceptions.
The field of Intelligent Virtual Agents (IVA) has witnessed substantial re-
search on cultural connotations, facilitated by the ease of conveying different
multimodal characteristics with virtual agents. Here, it is common to explore the
effects of the Similarity Principle, proposed by Byrne et al. (1967), which suggests
that individuals are more likely to develop a liking for interaction partners they
perceive as similar in characteristics, beliefs, values, or attitudes. Additionally,
the Computers As Social Actors (CASA) paradigm, introduced by Nass et al.
(1994), proposes that people treat computers as social entities, influencing their
perspectives and perceptions in interactions. Consequently, cultural similarities
are generally perceived to enhance interactions with agents.
Research has explored whether participants adapt to a virtual agent’s cultural
behavior, both verbally and non-verbally, with studies showing a higher level of
adaptation in non-verbal behaviors (Lugrin et al., 2018). In contrast, research
has also examined how virtual agents can adapt to participants’ background. For
example, drawing on work by Gill (1994), virtual agents were designed to match
the speaking style of African American children, leading to clear improvement
in the children’s performance when interacting with a culturally matched virtual
peer (Finkelstein et al., 2013).
Speech, as a primary mode of communication, further plays a significant role
in how individuals perceive cultural backgrounds. This phenomenon extends to
4.3. TERMINOLOGY IN CULTURAL STUDIES 27

interactions with computerized systems, where accents can serve as strong cues
in how users perceive computers as social actors Dahlbäck et al. (2007); Nass and
Brave (2005); Reeves and Nass (1996). Accents in speech, furthermore, contribute
to the categorization of a speaker’s background, affective state and identity, with
accent often prevailing over physical appearance (Krenn et al., 2017; McGinn and
Torre, 2019). For example, in a study involving virtual agents, Khooshabeh et al.
(2017) found that accents in American English were perceived as foreign by indi-
viduals who did not share the agent’s simulated mixed background but increased
perceived shared social identity among those who shared a mixed background.
Indeed, the design of culturally-diverse social agents requires expertise from
various disciplines, including cross-cultural psychology and computer science (De-
gens et al., 2017). While stereotypes based on features like voice and appearance
influence people’s perceptions of social agents, often aligning with their own social
beliefs, it’s essential to acknowledge that stereotypes can carry negative connota-
tions. At times, researchers may inadvertently introduce their own biases into the
design process, making it challenging to recognize this phenomenon (see Yin et al.
(2010)). Hence, designing culturally rich agents should follow strict evaluations
and comprehensive analysis.

4.3 Terminology in Cultural Studies


In discussing cultural aspects, terms like “nationality”, “ethnicity” and “race”1 ,
are often used interchangeably. This interchangeability is notably prevalent in
technical research, where these terms are often considered synonymous, particu-
larly in studies addressing bias or stereotypes in vision-based technology. How-
ever, recent research proposes enhanced guidelines for addressing bias in image
datasets, advocating for the more precise label of “skin-type” to measure diversity
effectively (Buolamwini and Gebru, 2018). This ongoing discourse also extends
to speech-based technology, where accents or dialects are utilized interchange-
ably as dimensions associated with ethnicity or nationality. The work presented
in Paper G explores these concepts with more depth, fist defining the terms that
denote an individual’s cultural background. As previously noted, investigating
the cultural dimensions of social agents, like social robots, requires appropriate
precautions in approaching these subjects, including careful consideration of the
terminology.

4.4 Conclusions
The exploration of culture in the context of second language learning and social
robotics reveals the intricate interplay of linguistic, social, and cultural dimen-
1 Many international bodies, including the UN General Assembly and the European Union,

dismiss the concept of distinct human races. Here it is only used to exemplify erroneous terms in
various academic works.
28 CHAPTER 4. BACKGROUND: CULTURE AND L2 LEARNING

sions. Language, as a proxy for culture, extends beyond communication, involving


practical attributions and social sensitivities. Social robots, influenced by cultural
perspectives and user attitudes, must navigate the complexities of diverse cultural
backgrounds. This exploration emphasizes the need for adaptive approaches, rec-
ognizing the dynamic nature of culture and its multifaceted influence on human-
robot interactions. As technological advancements continue, understanding and
addressing cultural nuances become imperative for the successful integration of
social robots in diverse societal contexts.
Chapter 5

Technical Framework

The contributions presented in this thesis work have all required substantial de-
velopment of technical frameworks that allowed for the generation of multi-party
interactions with a robot. This chapter presents the main components used to
develop these frameworks.

5.1 Robot Furhat


In the course of this thesis, the only robotic platform employed in all conducted
experiments was the Furhat robot. Developed by Furhat Robotics1 , this robot is
designed to emulate a human-like torso, with its primary goal being the facilita-
tion of social interactions. Its distinctive feature lies in the back-projected face
mask, positioned over a three-degrees-of-freedom neck, that allows for versatility
in both appearance and expressive features, such as facial expressions and head
nods. Additionally, the company provides a virtual version of the robot, both
versions are displayed in Figure 5.1.
The platform incorporates various models of computer vision and natural lan-
guage processing to facilitate seamless spoken interactions with humans. These
models utilize sound input, either through a built-in microphone or a connectable
directional microphone array, as well as visual input through a wide-lens camera.
Noteworthy functionalities of these models include automatic lip-synching, face
detection and recognition, and natural gazing (including natural saccades). Addi-
tionally, the platform includes modules supporting speech recognition and speech
synthesis from popular commercial services. All these components are carefully
designed to foster engagement with users in a manner that feels entirely natural,
rendering it applicable across diverse domains like customer service, education,
and entertainment.

1 https://furhatrobotics.com/

29
30 CHAPTER 5. TECHNICAL FRAMEWORK

Figure 5.1: Photos taken during experiments and demos with the robot Furhat
(left) and virtual version (right).

Development with the Furhat robot is facilitated through their proprietary


Software Development Kit (SDK)2 . This comprehensive SDK not only provides a
virtual version of the robot for testing, but also serves as a robust tool for interac-
tion design. Programming skills, i.e., creating and customizing applications using
the robot, is done with the support of the Kotlin Skill API, using a state-machine
like conversational framework. However, throughout this thesis, the primary
mode of programming the robot involved using the Furhat Remote API3 . Func-
tioning as a Python-based bridge, this API enables developers to connect with
the robot, access (some) internal processing information, and send commands to
both its physical and virtual avatar versions. Throughout the course of this thesis
work, this API was used to develop all of the robot’s behaviors within a separate
Python framework. Subsequently, these implementations were transitioned to
the ROS framework, as detailed in Section 5.3. The Furhat Remote API played
a pivotal role in facilitating a broader range of flexibility in implementing these
behaviors.

5.2 Initial Dialogue System


Paper A and B employed a Finite-state Machine (FSM) to control the dialogue
system integrated with the Furhat robot. Within this FSM, each states includes
potential responses from the robot, along with a list of possible state transitions.
The development of both content and state transitions was a manually designed
by the authors of the original exploratory study (Engwall et al., 2022). Figure 5.2
illustrates a condensed representation of the state tree derived from this FSM.
The implementation of this dialogue system required a human to select of one
response in the current state. The underlying software would then manage the
updating of the next state responses and the computation of each state transition.
2 https://furhat.io/
3 https://docs.furhat.io/remote-api/
5.2. INITIAL DIALOGUE SYSTEM 31

Figure 5.2: A reduced FSM state tree from a social conversation used in the initial
dialogue system employed in Paper A and B.

The human in charge of controlling the robot, known as wizard 4 , used an interface
that received the list of possible robot responses at each state, as shown in the
lower-left segment of Figure 5.3. Additionally, the wizard always had the option
4 The term originates from the metaphor of the Wizard of Oz, where a human operator sim-

ulates advanced functionalities of a computer-based system, much like the character of the The
Wonderful Wizard of Oz story.
32 CHAPTER 5. TECHNICAL FRAMEWORK

to employ a predefined set of responses, including “Yes”, “No”, “Maybe”, “OK”,


and “I don’t know”, as depicted at the top left of Figure 5.3. Wizards also had
the ability to repeat the previous utterance at any moment. Finally, the Wizard
had control over initiating or terminating the video recording of the interaction,
as shown with the buttons at the bottom-right of Figure 5.3. All the buttons
within the interface could also be operated using keyboard keys.

5.3 Taboo Game System


The initial implementation of the Taboo game, as detailed in Paper D, was
constructed within the gRPC (gRPC Remote Procedure Calls) framework. Sub-
sequently, this implementation was adapted to the Robotic Operating System
(ROS)5 framework, as presented in Paper E. The choice to shift to ROS was
primarily driven by the goal of improving reproducibility and encourage collab-
oration within the HRI community. Consequently, the system discussed here is
accessible via a public GitHub repository6 .

5 https://www.ros.org/
6 https://github.com/ronaldcumbal

Figure 5.3: Wizard control interface. Top left shows the list of (static) default
response. Bottom left has the list of (dynamic) robot responses updated on every
state of the dialogue tree. The corresponding short-cut keys for these responses are
shown at the right side of each option (middle of the image) in grey circles. Bottom
right presents the buttons to control the recording the interaction.
5.3. TABOO GAME SYSTEM 33

Robotic Operating System (ROS)


As an open-source middleware framework, ROS excels in promoting the devel-
opment, integration, and operation of robotic systems. It offers an extensive set
of tools, libraries, and conventions, empowering users to construct and manage
intricate robotic applications. Within the ROS architecture, several components
contribute to its decentralized communication infrastructure, with key elements
including nodes, topics, and messages, described next:

Nodes are executable processes responsible for computation, communicate with


each other and other components through streaming topics. Nodes can be
used, for example, to control specific actuators, like motors, or perform com-
putations, like planning and localization (Open Robotics, 2018b).

Topics function as communication buses over which nodes exchange messages.


Nodes can publish and subscribe to topics, effectively decoupling the produc-
tion of information from its consumption (Open Robotics, 2019).

Messages : Communication is facilitated through messages, simple data struc-


tures with typed fields such as integers, floating points, or booleans (Open
Robotics, 2018a).

ROS Master plays a crucial role by providing naming and registration services
to nodes, as well as tracking publishers and subscribers to topics.

The overall framework used in this doctoral work is illustrated in Figure 5.4.

Figure 5.4: General structure used for experimental setup within the ROS frame-
work. Nodes are shown with dashed-line boxes, while arrows represent Topics.
Dotted boxes denote simple configuration or logging files.
34 CHAPTER 5. TECHNICAL FRAMEWORK

5.4 General Interactive Components


This thesis work also incorporates a diverse range of devices that are playing
a important role in the various stages of the studies presented. Here the most
recurrent ones are described.

Speech Recognition Following the work presented in Paper C, the main Speech
Recognition service used through these experiments was Microsoft’s Azure
Speech-to-Text7 . Typically, this system was employed with its default set-
tings, modifying the language code when required.
Voice Activity A Voice Activity Detector (VAD) is a model trained or designed
to identify periods of speech and silence in an audio signal. In paper Paper D
and Paper E a version of the py-webrtcvad 8 Voice Activity Detector (VAD)
was employed. The settings for this models were not changed.
Microphones Paper A and B used USB headsets to capture participants’ voice
signals during interactions with the robot. However, this approach introduced
undesirable levels of environmental noise and cross-talking. In Paper D a
different strategy was adopted by employing two USB headsets to handle the
VAD and ASR modules with separate inputs. Paper E, instead used head-
mounted Shure Model WH20 professional microphones, achieving improved
speaker diarization. Paper H used a microphone array and professional
microphones to enhance both speaker diarization and speech recognition.

Transparency in Spoken Conversations with Embodied Agents


The presentation of robots to users often defines the intended goals for the in-
teraction. However, these goals are frequently implicit, reflecting a “hidden”
intention that robot designers seek to assess. Shaping group interactions through
subtle cues is a notable example. In Paper H, there is an early effort to enhance
communication regarding the capabilities and limitations of social robots during
spoken interactions.
This work introduces a dialogue system designed to express the level of cer-
tainty in its decision-making process. This includes providing confidence levels re-
lated to speech recognition, utterance selection, and gesture generation, as shown
in Figure 5.5. The aim is to make the robot’s internal processes more transparent
to users, promoting a better understanding of its functionalities and instilling
confidence in human-robot interactions.

7 https://azure.microsoft.com/en-us/products/ai-services/speech-to-text
8 https://github.com/wiseman/py-webrtcvad
5.4. GENERAL INTERACTIVE COMPONENTS 35

Figure 5.5: Scheme of the system architecture, showing the components that gen-
erate confidence scores for their respective outputs.
Chapter 6

Understanding Conversations with


L2 Learners

This Chapter describes the initial attempts to explore robot-led practice con-
versations with learners of a second language. The information presented here is
extracted from Paper A, Paper B and Paper C, supplemented with additional
insights into the conducted studies.
In particular, this chapter argues that while it is feasible to carefully develop
(constrained) practice conversations between L2 learners and autonomous social
robots, it stresses that engaging with L2 learners introduces greater complexities
that demand thorough evaluation. These challenges, that arise especially at lower
proficiency levels, require specific assistance and current technical constraints may
hinder this objective.

6.1 Uncertainty, Confusion or Doubts


At the start of this doctoral work, two pivotal exploratory studies were under-
taken to assess the role of a robot in social conversations within a language café
setting. In both studies, a Wizard, i.e. the human controller, directed the dia-
logue system, selecting responses from a set of options generated by a Finite-state
machine, as described in Section 5.2. The first study by Engwall et al. (2021), pro-
posed to evaluate four different robot interaction styles to identify those that were
better perceived by the language learners themselves, as well as to understand
the complexities of the interaction. These personalities included an interviewer
who posed sequential questions to one participant at a time, a narrator that
spoke about itself, discussing robots, and Sweden or asked questions related to
Sweden, a facilitator that tried to make participants talk to each other, and an
interlocutor that tried to have an equal participation among itself and the par-
ticipants. The general results indicated a notable preference for the interviewer
personality, but individual participant preferences were influenced by various fac-

37
38 CHAPTER 6. UNDERSTANDING CONVERSATIONS

tors. A subsequent analysis revealed increased participant activity when the


robot assumed the facilitator personality, that encourage engagement (Engwall
and Lopes, 2022). Proficiency level and participant familiarity also played crucial
roles in preferences and activity levels.
Among the results that were reported, it became apparent that participants
experienced confusion during these interactions. In these instances, learners often
sought clarification from the robot or assistance from their conversation partners.
Although the Wizard could repeat utterances or provide default short responses
like “Yes”, “No” or “I don’t know”, as shown in Figure 5.3, these options proved
insufficient in resolving learners’ doubts. From this perspective, one may think
that improvements in the dialogue system might serve as an easy solution for
this predicament. However, as presented in the background sections, the way
learners behave in a practice conversation, and given the technical limitations of
speech recognition and appropriate (personalized) feedback, more complex solu-
tions are required for a robot to autonomously resolve these events. Furthermore,
as expected and discussed by Engwall and Lopes (2022), many learners did not
identify the robot as a source of clarification, instead using their partners to
provide assistance.
From the data collected in these original studies, it was clear that most con-
fusion events were associated with language knowledge (not comprehending the
previous conversation turn), conversation misunderstanding or miscommunica-
tion (where common ground was not established), or mishearing (due to the
robot’s synthesized speech). In an effort to explore this phenomenon, the first
step was to understand in depth why learners found themselves in a position of
doubt about the content or state of the conversation. The data collected from
this proposed analysis would then be used to categorize the reactions that lan-
guage learners had when uncertain. Therefore, the first study of this doctoral
work focused on exploring uncertainty in L2 conversations, in particular focusing
on listening uncertainty, defined as the events where a student fails to understand
the information spoken by the conversational partner, as opposed to the instances
when the student struggles to communicate an idea Cumbal et al. (2020a,b).

Figure 6.1: Dyad social conversation between L2 learners and the robot Furhat.
6.1. UNCERTAINTY, CONFUSION OR DOUBTS 39

In this section only a summary of the experimental design is described, and


the details are specified in Paper A. The experiment employed one-on-one con-
versations with the Furhat robot, as shown in Figure 6.1. As previously indicated,
occurrences of uncertainty were already observed in the preliminary studies, where
default settings were used for speech rate synthesis and the vocabulary complexity
was kept at a “normal” level. In the experiment, these parameters were manip-
ulated to elicit moments of uncertainty from the participants. This was achieved
by increasing the rate of speech of the synthesizer and introducing highly com-
plex vocabulary (at a university proficiency level). The change in prosody was
expected to produce mishearing, while the change in vocabulary was assumed to
lead to misunderstanding and miscommunications.
The dialogue system used in this study was an updated version of the system
described Section 5.2, specifically modified to interact with a single participant,
as we wanted to avoid the possibility that learners employed their conversational
partners to resolve moments of uncertainty. The dialogue content was designed
to engage the participants into a social conversation, including discussion on per-
sonal preferences, experiences in Sweden, and the process of learning a second
language. During the interaction, the system automatically introduced the mod-
ified utterances (with higher complexity and faster speech rate synthesis) a few
turns after the conversation started and again shortly after its midpoint. The
wizard was instructed to select the modified options when these appeared on
screen. The complete interaction was recorded with face-front webcams.
Although Paper A does not address this aspect, the experimental design
included a self-report measure of uncertainty. Participants were instructed to
assess video clips capturing instances where the system prompted moments of
uncertainty, after concluding the conversation, as illustrated in Figure 6.2. The
objective was to use these self-evaluations to validate that participants had indeed
experienced confusion during their interactions with the robot. Unfortunately, a
large portion of this data was corrupted during the execution of the studies.

Figure 6.2: Video replay with self-reporting questions focused on uncertainty. The
faces were intentionally blurred solely for publication purposes and were not ob-
scured in the actual system.
40 CHAPTER 6. UNDERSTANDING CONVERSATIONS

Figure 6.3: Variation of uncertainty across four different type of learners’ reactions.

This occurred because the video clips presented to the participants displayed
incorrect moments of the conversation, failing to show the instances where the
robot intentionally tried to elicit moments of uncertainty. Hence, the self-reported
values of confusion were attached to random moments of the conversation.
Paper A presents findings from a visual analysis conducted on the recordings
of robot-led conversations. In this opportunity, instead, this section focuses on the
results pertaining to the overall outcomes of our intended manipulation. Among
all the events manipulated to induce uncertainty in the learners, it was observed
that two-thirds of the more challenging robot utterances, characterized by higher
speed or complexity, resulted in confused reactions. Interestingly, many learners
were able to comprehend the modified output of the robot by grounding the
interaction in the recent dialogue context. Out of the thirty-six instances where
uncertainty was successfully triggered, thirty-two led to a clarification request.
However, in four instances, the learners did not provide a response, indicating a
more complex scenario. This non-response aspect hints at potential challenges or
complexities in the learners’ interactions with the robot.
Based on these discoveries, our objective was to comprehensively examine the
entire collection of recordings to enhance our understanding of how moments of
uncertainty manifest themselves in conversation practice. The analysis started by
identifying instances of uncertainty that were effectively generated through mod-
ifications in the robot’s spoken output and progressed through the entire range
of interactions. This analysis led to a clear realization that the concept of uncer-
tainty should not be simplified as a binary phenomenon. Therefore, (un)certainty
was reinterpreted as a continuum ranging from absolute confidence to complete
unresponsiveness, as shown in Figure 6.3. In the initial analysis, and with a few
refinements thereafter, four specific reactions were identified that corresponded
appropriately to the range of confidence displayed in learners’ responses to the
robot’s input during the conversation. These reactions included direct responses,
thoughtful responses, clarification requests, and instances where no response was
provided. A summary of the annotation scheme outlining these reactions is pre-
sented next (reflecting the illustration of Figure 6.3):
6.1. UNCERTAINTY, CONFUSION OR DOUBTS 41

• Direct Response: Participant responds quickly and confidently, shows no


thinking process or responds with “I don’t know” quickly.

• Thoughtful Response: Participant reasons, wonders, meditates about the


response. Participant may start a response, but delays answer while reasoning
about its content. Responds with “I don’t know” after some hesitation.

• Clarification Response: Participant request the robot to slow down, repeat


or explain what it said. This request is directed to the robot.

• No Response: Participant does not reply back to the robot and the robot
continues conversations.

At this point, it was also interesting to evaluate whether these same inter-
pretations could be applied to multi-party interactions. Using the data collected
in the initial exploratory experiments, we extended the annotations scheme for
learners’ reactions to uncertainty, incorporating the following:

• Clarification-Peer Response: Participant asks partner for clarification


(not to the robot) about the robot’s utterances.

A total of 42 conversations led by robots were examined by two non-expert


annotators, following the scheme presented above. These conversations included
20 dyad and 22 triad interactions. Each annotated segment started and concluded
at the beginning of consecutive robot utterances, as depicted in Figure 6.4. Addi-
tionally, within each segment, the end of the robot’s utterance and the initiation
of the participant’s response were manually annotated. To prevent misinterpre-
tations of participant responses, fillers (or hesitation marks) were not considered
as the starting point of a participant’s response. Finally, the agreement between
annotators resulted in a kappa coefficient between 0.67 and 0.73 (computed from
different subsets of dyad and tryad interactions).

Figure 6.4: A labeled audio recording used to train an (un)certainty detection


model. The portion of data Before Participant Speech predicts uncertainty levels
prior to a learner’s response, whereas the Complete Turn data is used to categorize
uncertainty levels throughout the entire interaction.
42 CHAPTER 6. UNDERSTANDING CONVERSATIONS

While the interpretation of (un)certainty presented in this thesis is char-


acterized as a range, the manifestation of these states in a learner’s behavior
is articulated through distinct dialogue acts. With this concept in mind, our
next goal involved training a model to automatically detect the various ranges
of (un)certainty exhibited by a learner. In particular, given that heightened
(un)certainty could manifest itself in learners either taking too long to respond
or completely shutting down, our objective was to develop models capable of
promptly identifying signs of (un)certainty. Therefore, we used two chunks of the
annotates segments: only frames before the participant’s speech started (BPS)
and the complete turn as input data (CT) (as depicted in Figure 6.4). In this
process, each annotated segment was processed to extract Facial Action Units,
gaze direction angles, head pose coordinates, and head rotation angles from ev-
ery image frame. Time derivatives (∆) of these features were also computed.
As for speech features, we extracted RMS, (13) MFFCs, Mel-Bank Spectrogram
(13 components), and voice activity. From the MFCCs, we also computed the
time derivatives (∆ and ∆∆). Finally, temporal features included the manually
annotated time duration of the silence gap and the length of the complete turn.
The complete set of results of this experiments are reported in Paper B; here,
this sections emphasizes the most important finding. Figure 6.5 illustrates the re-
sults of the best-trained models (with the Random Forests algorithm). Notably,
segments labeled as Direct Responses and No Responses exhibit high levels of
prediction accuracy, whereas those falling in between tend to have lower values.
These findings affirm that visual, speech, and time features contribute to the ac-
curate classification of a student’s level of (full) certainty or uncertainty. These
results are in line with literature that defines uncertainty as a binary concept,
where attempts to classify confusion often adopt a framework of either no confu-
sion or complete confusion (see (Alyüz et al., 2016; Bosch et al., 2014, 2015; Lallé
et al., 2016)). However, characterizing a learner’s progress as entirely uncertain

Figure 6.5: Normalized Confusion Matrix results for (un)certainty detection (Ran-
dom Forest models). BPS: Data before participant’s speech starts, CT: Data of the
complete turn.
6.2. SPEECH RECOGNITION WITH L2 SPEAKERS 43

or certain would be unfair. In particular – aligning with the theory that encap-
sulated the Zone of Proximal Development (ZPD) (Vygotsky and Cole, 1978) –
effective learning occurs when learners receive guidance and support to tackle
tasks slightly beyond their current independent capability. Consequently, pro-
viding assistance only when a learner is excessively confused or entirely certain
might not be optimal. It is argued then that for a robot to lead conversation
practice effectively, it should be aware of all of these complexities and capable of
fluidly react to various degrees of (un)certainty. If this requirement cannot be
fully guaranteed, then different forms of support should be examined.

6.2 Speech Recognition with L2 Speakers


In the course of the previous study’s development, and the analysis of initial
multi-party exploratory studies, we attempted to transcribe conversations using
commercially available ASR, also known as Speech-to-Text (STT), services. It
became evident that the transcriptions produced were of lower than acceptable
quality. As such, it was necessary to run a carefully analysis of these services.
Despite constant advancements in the field of Speech Recognition, particularly
evident in benchmarks comparing new research models, the utilization of cloud-
based services remains the primary method for integrating speech recognition
capabilities into robotic platforms (Marge et al., 2022).
To carefully evaluate commercially available ASR services, two distinct datasets
were used. The first dataset, named Ville, comprised read sentences produced
as part of pronunciation training within a virtual teacher program designed for
L2 learners of Swedish (Wik et al., 2009). The second dataset, named CORALL,
consisted of the manually transcribed conversations collected during the initial
exploratory studies, as presented at the beginning of this chapter. Given that this
dataset predominantly featured L2 speakers, a decision was made to incorporate
transcriptions from recordings made in a pilot study where a human speaker led
the conversation instead of the robot Furhat. These recordings corresponded to
a L1 Swedish speaker. Furthermore, during the initial search for ASR systems,
it was noted that only a limited number of them offered support for various lan-
guages. Among these, only Google Cloud1 and Microsoft Azure2 could facilitate
Swedish speech transcriptions. Additionally, a comparison with non-commercial
state-of-the-art models in research, such as Wav2vec23 , an “off-the-shelf” model
from the Huggingface4 platform, was also incorporated.
The main findings of this study are summarized here, with a more detailed
presentation available in Paper C. It is evident from these results, shown in Ta-
ble 6.1, that samples corresponding to Swedish speakers generally exhibit better
1 https://cloud.google.com/speech-to-text
2 https://azure.microsoft.com/en-us/products/ai-services/speech-to-text
3 KBLab/wav2vec2-large-xlsr-53-swedish
4 https://huggingface.co
44 CHAPTER 6. UNDERSTANDING CONVERSATIONS

Dataset Speech Google Microsoft Huggingface


Ville L1 0.162 0.111 0.522
(Read sentences) L2 0.325 0.410 0.593
CORALL L1 0.412 0.356 0.641
(Social conversation) L2 0.421 0.507 0.663
Table 6.1: Word Error Rate (WER) for 3 different ASR services tested with two
different datasets containing recordings of read speech and from social conversation.

transcription performance, with a lower Word Error Rate (WER), than those
from second language speakers. However, this difference becomes less obvious
when dealing with utterances in spontaneous speech. Analyzing the results from
the CORALL dataset, the only statistically significant result is noted in the tran-
scriptions generated by Microsoft ASR (L1: 0.36 vs. L2: 0.51, p < 0.05), while
other ASRs perform equally bad (Google L1: 0.41 vs. L2: 0.42 and Hugging-
face L1: 0.64 vs. L2: 0.66). These results highlight that WER increase, nearly
doubling, with L2 speakers for read sentences, but that for spontaneous speech
dataset, the performance of all ASRs deteriorate. The observation that Microsoft
Azure ASR performs better for L1 speakers in conversations compared to Google
and Huggingface may be attributed to its system development, specifically tai-
lored for conversations. Moreover, upon analyzing word errors, it was discovered
that among the most frequently misrecognized utterances for L2 speakers there
were specific words that signal important requests for assistance from the user,
e.g. “understand” and “repeat”.
In recent years, there has been a substantial advancement in state-of-the-art
speech recognition, notably demonstrated by OpenAI’s Whisper model (Rad-
ford et al., 2023). This model has obtained recognition for its substantial im-
provements across various benchmarks and its performance in multiple languages.
Google (Zhang et al., 2023) and Meta (Pratap et al., 2023) have has also con-
tributed with large models, emphasizing their capability to transcribe diverse
languages as well. Despite these achievements, a notable concern persists regard-
ing the ease with which these models can generalize to data beyond their training
distribution. This raises a dual challenge: languages not fully represented in the
training data may exhibit disparate recognition performance, and the phonetic
variability of these languages could further impact the model’s efficacy. With this
idea in mind, the performance of speech recognition for second language learners,
particularly with less-resourced languages, is likely to drop below an adequate
level for effective conversation practice.
As a final remark, though a direct one-to-one comparison may not be entirely
applicable, it is intriguing to note the performance metrics for the Swedish lan-
guage using the Whisper model. In Common Voice 9 (Ardila et al., 2019) it
obtains a WER of 10.6% and in FLEURS (Conneau et al., 2023) it stands at
8.5%. Both of these datasets contain only read speech data. Not too far away,
6.3. PATHWAYS TO EXPLORE 45

the best result we obtained in our experiments was 11.1% WER, with the read
speech dataset as well.

6.3 Pathways to Explore


This chapter made the case that the intricacies linked to second language speak-
ers and the persisting limitations in speech technology present compelling reasons
to reconsider the support provided by social robots in practice conversations. As
previously commented, there are various studies suggesting that a reduced per-
formance in ASR may not necessarily impede the development of dialogue with
spoken dialogue systems (D’Mello et al., 2010; Litman et al., 2006). As part
of our team’s research, Engwall et al. (2022) evaluated the adequacy of robot
utterance selection by comparing manual selection based on ASR transcriptions
with autonomous methods. The latter option included methods using predefined
sequences of robot utterances, a language model selecting utterances based on
learner input, and a custom statistical method trained on the wizard’s choices in
prior conversations. This analysis was done using data from Paper A, Paper
B, and (Engwall and Lopes, 2022). The results revealed that a custom statistical
method performed as well as manual selection, supporting the notion that dia-
logues can succeed despite reduced ASR accuracy. Furthermore, human wizards,
even with high ASR word error rates, selected acceptable robot utterances in the
majority of cases (96%).
Nonetheless, as second language learners engage in the process of acquiring
a new language, they often do require assistance to address evolving challenges
during conversations. As it was found in Paper C, some important words that
learners can use to request assistance, or clarification, were still misrecognized
frequently. This shows that the level of support that a robot could provide may
still be short from optimal considering that learners often appreciate clarification
on misunderstood phrases. Furthermore, even thought these were not explored in
this thesis work, learners can also request corrections for pronunciation, grammar,
vocabulary usage and constructive guidance on cultural nuances in communica-
tion. The ability of social robots to provide nuanced and contextually appropriate
feedback in these aspects remains an area that requires further research.
However, this thesis work firmly believes that social robots possess the po-
tential to enhance the development of spoken interaction practice between L2
learners and robots. The emphasis is then not primarily on the potential ways
a robot might contribute to cognitive performance improvements, but rather on
how a social robot can enhance the overall conditions to effective spoken language
practice.
Chapter 7

Enhancing Speaking Practice

While it has been previously contended that achieving unconstrained conversa-


tional practice led by (autonomous) robots requires further research, this Chapter
argues that there are distinct interactive tasks where robots excel in supporting
second language practice within unconstrained settings. The development of this
chapter draws upon insights from studies reported in Paper D and Paper E,
supplemented with additional information not covered in those publications.

7.1 Leveraging Group Dynamics


The role of robots in group settings is an evolving area of research, with consistent
findings suggesting that appropriately designed robots can indeed influence group
dynamics (Sebo et al., 2020). However, the question of whether these observed
behaviors can be successfully translated to settings focused on L2 learners remains
unanswered. Surely, the connection between a robot influencing group behavior
and the specific goal of L2 practice may not be immediately apparent. However,
the concept of willingness to communicate, i.e. measuring a person’s inclination
to use a language in various situations, is frequently employed in second language
learning. Considering this notion, this thesis proposes a shift in the role of a
social robot in group interactions, focusing on how it could enhance L2 learners’
willingness to communicate during spoken interactions.
Previous research demonstrated the ability of social robots to positively im-
pact socio-emotional states in learners (Randall, 2019). For example, Hong et al.
(2016) found that introducing a robot into the curriculum positively influenced
the motivation of a classroom of children. Saerbeck et al. (2010a) and Shimada
et al. (2012) demonstrated that explicit use of supportive verbal expressions from
a robot tutor enhanced learning performance and promoted group collaboration
among children. However, the specific aspect of whether a social robot can mo-
tivate learners, in particular young-adults or adults, to increase their speaking
practice has not been thoroughly explored.

47
48 CHAPTER 7. ENHANCING SPEAKING PRACTICE

Our focus, hence, centers at how a social robots could efficiently and naturally
motivate learners to increase the amount of speaking in L2 practice. An important
consideration in motivating students to engage in speaking tasks is whether they
could feel rushed or surprised, particularly during discussions, usually referred
as “cold-calling” or “random-calling”. While these methods have been proven
to boost participation among typically quiet students (Dallimore et al., 2013),
some studies highlight concerns about their impact on learner and potential to
cause anxiety or discomfort (Cooper et al., 2018; Ishino, 2022). Therefore, a key
characteristic of our approach involved using nonverbal cues, powerful in interac-
tion but subtle enough to avoid inducing anxiety in L2 speakers. Through this
process, different alternatives were evaluated, including backchannels to indicate
when the robot is listening (Skantze et al., 2015), gazing to persuade participants
to consider the robot’s suggestions Chidambaram et al. (2012), or mixed with
other cues to manage turn-taking (Skantze et al., 2015). Changes in speech, in-
cluding intonation (Chidambaram et al., 2012; Kory Westlund et al., 2017), were
also examined, as studies on persuasive vocal tone demonstrated its impact on
compliance (Wainer et al., 2010). Gestures like head nods and face expressions
were also considered (Saerbeck et al., 2010b).
Due to the strength of their effect in group interactions, and to limit the
analysis to only one cue per study, gaze shifting and backchanneling were selected
to shape interactive dynamics of participants in an L2 group practice activity.
Studies have demonstrated that a robot’s gaze can shape the roles of participants
in a conversation (Mutlu et al., 2012) and different backchannels can contribute
to balanced participation in turn-taking behavior (Skantze, 2017).

7.2 Different Pairings


One practice that this thesis advocates for is the use of real-life scenarios to com-
prehensively evaluate the role of social robots in society. Within the context of
conversation practice with L2 learners, this approach is reflected in how the inter-
actions are conceptualized. Unlike conventional language learning settings, where
interactions are typically confined to the classroom, limiting learners’ exposure
to individuals of diverse proficiency levels and impeding a true representation
of “real-world” dynamics, this thesis emphasizes the importance of mixing L2
learners across different proficiency levels, and even including interactions with
L1 speakers to better simulate real-life scenarios
This idea has captured interest within both cognitive and socio-cultural per-
spectives of L2 learning. Notably, prominent theories highlight the importance
of learner interactions in facilitating this process (Storch and Aldosari, 2013).
Specifically, when learners collaborate, they engage in “languaging”1 , allowing
them to collectively tackle language challenges, combine their linguistic expertise,
1 As noted by Swain et al. (2009), during complex activities, “languaging” can take the form

of speaking aloud, whispering to oneself, or explaining tasks to someone else, among other forms.
7.2. DIFFERENT PAIRINGS 49

Figure 7.1: Photographs captured during the studies in Paper E and Paper D
depicting pairs of participants engaging in the game Taboo with the robot Furhat.
Both L1 and L2 speakers collaborate to describe a word presented on a table or
on the screen, as illustrated in the right bottom corner. The robot Furhat employs
non-verbal behaviors to balance or encourage participation.

and consequently deepen or co-construct their understanding of language (Storch


and Aldosari, 2013). Research examining the advantages of pairing L2 learners
with mixed proficiency, however, yields somewhat mixed results. For instance,
Kowal and Swain (1994) observed that the most dominant participant benefited
more from the interaction, while Leeser (2004) found that the more proficient par-
ticipant faced greater disadvantages. Furthermore, Kim and McDonough (2008)
noted the influence of proficiency levels on the relationships formed within pairs.
The general consensus seems to suggests that pairing students with mixed L2
proficiencies may benefit both learners if they collaborate effectively (Storch and
Aldosari, 2013), as opposed to situations where one participant dominates the
interaction (Watanabe and Swain, 2007). This approach, hence, is a valuable
alternative for practice conversations with L2 learners and a social robot.
Furthermore, although this initial consideration did not explicitly address the
additional phenomena that are manifested when individuals from diverse back-
grounds interact, the chosen setting offered a unique vantage point to explore
these cultural dynamics. In particular, the collaboration between L1 and L2
speakers not only served as a means to evaluate the effectiveness of social robots
in facilitating language learning but also provided a rich context for understand-
ing the broader socio-cultural implications. This interaction sheds light on the
intricacies of cross-cultural communication, highlighting the potential challenges
and opportunities that arise when people with different linguistic and cultural
backgrounds come together. These topics are further explored in Chapter 8.
50 CHAPTER 7. ENHANCING SPEAKING PRACTICE

7.3 Balancing and Encouraging Participation


The implementation of these objectives is detailed in the findings of both Pa-
per D and Paper E. In these studies, a variation of the Taboo game, known as
Med Andra Ord (or “With Other Words” in Swedish), was employed as a natural
and popular speaking game frequently used for second language practice. Taboo
involves players describing a specific word to their teammates without using cer-
tain “taboo” words associated with it. In our setup, the social robot assumed the
role of the player guessing the words, while the participants, L1 and L2 speak-
ers, described the target words, as illustrated in Figure 7.1. This arrangement
inherently placed L2 speakers in a situation where they had to communicate in a
second language to advance the game, which not only granted greater autonomy
to the robot but also provided participants with increased interactive freedom
within a loosely controlled setting. Moreover, careful considerations were made
to progressively heighten the difficulty of the target words played during the game.
In Paper D, gaze was utilized to establish a balance in speaking participation,
whereas in Paper E, backchannels were employed to extend the speaking time of
L2 speakers. In both instances, we employed the speech ratio between participants
as an indicator of group dynamics and to regulate the generation of non-verbal
behaviors. A description of this approach is depicted in 7.2 and the design of the
gazing behavior was structured as follows:

Gazing Behavior: During instances when the participant with the majority
of speaking time was active, the robot consistently distributed its gaze equally
between both participants. On the other hand, when the participant with the
lower speaking time assumed the speaking role, the robot adjusted its gaze
proportionally based on the participants’ relative speaking duration. This
adjustment resulted in allocating more gaze time to the participant with the

Figure 7.2: Proposed adaptive robot behavior to balance L1-L2 interaction.


7.3. BALANCING AND ENCOURAGING PARTICIPATION 51

lower speaking time, a behavior consistently maintained even during moments


of silence. To accommodate the tendency of humans to not perfectly align
their head angle with their gaze, especially at subtle gaze angles, the robot
incorporated subtle head rotations towards the participant who was the focal
point of its gaze. These subtle head movements were integrated to accentuate
gaze patterns. Additionally, the robot engaged in gaze aversion approximately
25% of the time. The gaze target for aversion was deliberately kept constant
to prevent the impression of the robot randomly looking around the room.

The backchanneling behavior followed a similar structure, depicted in 7.3 and


outlined as follows:

Backchanneling Behavior: The generation of backchannels was facilitated


by a simple model that identified suitable moments for the robot to prompt
a backchannel. This system relied solely on speech activation and did not
consider prosodic features. The model was governed by two parameters: (1)
a minimum amount of active speech detected, set at 1.5 seconds, and (2) a
minimum gap between potential backchannels, set at 2.0 seconds. Given this
distribution of backchannel opportunities, the robot prompted a backchan-
nel with an inverse ratio to the speaking time between participants. Con-
sequently, the participant who spoke the least received a higher number of
backchannels. These backchannels were additionally complemented with a
randomly selected head nod during the interaction.
Importantly, all identified opportunities for backchannels were constrained to
occur within a speaker’s ongoing utterance. Finally, the robot’s gaze was
programmed to track the current speaking participant.

Figure 7.3: Proposed backchannel generation to encourage L2 speakers.


52 CHAPTER 7. ENHANCING SPEAKING PRACTICE

The detailed results of these studies can be found in Paper D and Paper E.
This section provides an overview of the most important results. As illustrated
in Figure 7.4a, the imbalance in participation between an L1 and L2 speaker
was notably diminished through the implementation of an adaptive gaze behav-
ior. Remarkably, these positive outcomes persisted even as the game’s difficulty
increased. We further observed that part of the reason for a more balanced inter-
action was the simultaneous decrease in participation from L1 speakers, coupled
with an increase in speaking time from L2 speakers. As a result, there was an
interest in exploring whether additional actions from the robot could exclusively
boost the participation of L2 speakers. Turning to the results from employing an
adaptive backchannel strategy, depicted in Figure 7.4b, it became evident that
the amount of speaking time for L2 speakers did, indeed, significantly increase.
Additionally, there was a slight decrease in speaking time for L1 speakers, which
aligns with the context of the game. The game’s setup imposes limits on total
speaking times due to a maximum game time per word and a semi-fixed number
of game words.
Importantly, these results present a highly encouraging outlook on the po-
tential role of social robots, portraying them as a promising force in cultivating
positive (pro-social) environments for human interactions. Especially within the
area of second language learning, the findings suggest that despite the complex-
ities inherent in L2 practice and potential technical limitations that may not
currently offer a robust solution for open practice conversations, there is clear
feasibility in a robot’s ability to orchestrate multi-party interactions to enhance
the practice of a second language.
Furthermore, while exploring the cultural aspects of second language learning,

(b) Results from an adaptive


backchannel generation on the
(a) Results from an adaptive gazing behavior on bal- amount of speech between L1 and
ancing spoken participation. L2 speakers.
Figure 7.4: Results from an adaptive backchanneling robot strategy to encourage
L2 speakers.
7.3. BALANCING AND ENCOURAGING PARTICIPATION 53

the interaction between L1 and L2 speakers not only showcased the effectiveness
of social robots in language practice but also reveals a nuanced perspective on
socio-cultural dynamics intricately intertwined with the evolution of social robots.
This intricate interplay accentuates the challenges and opportunities that emerge
in the development of human-robot interactions and underscores their potential
impact on societal development. These considerations are consequently evaluated
more deeply in Chapter 8.
Chapter 8

Cultural Perspectives

In this chapter it is argued that cultural aspects have not been adequately em-
phasized within social robots research, including certain sections of RALL. While
the preceding discussions have primarily addressed the challenges and nuances
associated with implementing robot-led practice conversations and explored the
distinct characteristics that make robots optimal for supporting language prac-
tice in unconstrained settings, there remains a notable gap in the consideration
of cultural perspectives within this discourse. The content of this discussion is
based on Paper F and Paper G.

8.1 Cultural Effects on Social Robots


The discussion initiated in Paper E was expanded through additional analyses
derived from the collected data. The original results indicated that adapting the
robot’s backchanneling strategy could influence the participants’ speaking time,
specifically encouraging less active speakers to participate more. However, a more
in-depth examination revealed that different socio-cultural groups responded dis-
tinctively differently to the robot’s backchannel strategy. While the complete
analysis includes factors like gender, age, first language, extroversion, and fa-
miliarity with robots, here, we focus on the L1 and L2 aspects of the pair of
participants.
The results showed that among L2 speakers, males, younger individuals, ex-
troverts, and those familiar with robots were more encouraged by the additional
generated backchannels. In contrast, females, older individuals, introverts, and
those less familiar with robots were less encouraged to speak more. Further anal-
yses supported the hypothesis of a positive relationship between age (grouped by
age below and above 34 years) and speaking time for L2 speakers in the control
condition in which the robot did not use an adaptive backchanneling strategy,
indicating that older individuals tended to speak more. These findings suggest
that socio-cultural factors play a crucial role in shaping individuals’ responses to

55
56 CHAPTER 8. CULTURAL PERSPECTIVES

Figure 8.1: Proposed study to examine preconceptions of nationality-encoded robot


interactions.

the robot’s backchannels, although limited subject numbers per category result
in few significant differences
The demonstrated results highlight that the effectiveness of participation-
adjustment, i.e., tailoring robot backchannels based on the participants’ speaking
contributions, varies across different socio-cultural groups. Consequently, the
formulation, timing, and frequency of backchannels may need to be tailored for
specific socio-cultural groups to achieve the same intended function across diverse
participants. This underscores the importance of adapting robot interactions
based on the dynamic cultural nuances inherent in human interactions.

8.2 Cultural Stereotypes and Social Robots


In the context of HRI, it is logical for robots to adapt to cultural factors; however,
it is equally crucial to examine which cultural elements elicit specific reactions
from individuals interacting with robots possessing human-like features. Paper
G addresses this inquiry by focusing on people’s perceptions of cultural elements
in human-like robots. Specifically, the study explores whether nationality-based
preconceptions related to appearance and accents influence individuals’ percep-
tions of both virtual and physical social robots. It is crucial to note that in this
study the classification of nationality does not endorse specific robot designs that
implying a cultural identity, instead it evaluates existing presentations available
in commercial and research platforms.
To achieve this goal we designed a study with multiple phases, as shown in
Figure 8.1. The study commenced by evaluating people’s immediate perceptions
of virtual robots embodying characteristics representative of a specific nationality.
An online survey, assessing different accents of English and nationality-influenced
faces for a virtual robot, revealed that accents, in particular, led to preconcep-
tions regarding perceived competence and likability. The results for the accented
synthesized voices are shown in Figure 8.2. These findings align with prior re-
8.2. CULTURAL STEREOTYPES AND SOCIAL ROBOTS 57

Figure 8.2: Mean scores for the results from the online survey measuring easiness
of Understanding, Naturalness, perceived Competence, Likeability and perceived
English Proficiency in different English accented voices. ∧ maker indicates voices
selected for the fallowing physical robot study.

search in social science, reflecting negative and positive stereotypes associated


with accents in human-human interaction.
Subsequently, another group of participants engaged with a robot embodying
four nationality representations derived from the online survey. It was ensured
that these robot identities were comparable across various perceptual dimensions,
differing only in likability and perceived competence. The results revealed that
the preconceptions based on national stereotypes observed in the online survey
were either diminished or overshadowed by factors related to general interaction
quality. An extension of the study, replacing the physical robot with a virtual
one in the same online scenario, produced similar results. This indicates that
preconceptions become less significant in actual interactions, emphasizing that
differences in robot ratings between the online survey and the interaction are not
influenced by the interaction medium.
The study suggests that attitudes towards stereotypical national represen-
tations in HRI have a weak effect, at least for the user group included in this
study, primarily composed of educated young students in an international set-
ting. Future work is required to explore additional human-robot interaction set-
tings, verifying the validity of the findings in diverse HRI scenarios. Additionally,
it is essential to investigate whether the results persist when improvements in
text-to-speech technology eliminate technology-induced pronunciation mistakes.
A noteworthy and fundamentally positive conclusion is that the study indi-
cates that prejudice regarding different nationality stereotypes does not appear to
be a strong factor in human-robot interaction, at least for the examined subject
group. The participants, young university students in an international setting,
58 CHAPTER 8. CULTURAL PERSPECTIVES

possess traits such as being young adults, higher educational level, exposure to
different cultures, and multilingual competence, that have been associated with
cultural open-mindedness and higher acceptance of accented speech (Boduch-
Grabka and Lev-Ari, 2021; Dekker et al., 2021).
Overall, these findings suggest the necessity of considering cultural nuances
in designing social robots and highlight the potential impact of socio-cultural
factors on human-robot interactions. Further research is warranted to explore
these dynamics in diverse settings and populations, ensuring that social robots
are designed and deployed in a culturally sensitive manner.
Chapter 9

Paper Contributions

This chapter outlines the key contributions in the appended papers of the thesis.
Additionally, it describes the authors’ role in each paper.

9.1 Paper A
Uncertainty in Robot Assisted Second Language Conversation Prac-
tice - Ronald Cumbal, José Lopes and Olov Engwall

Scientific Contributions: Through the evaluation of conversations led


by a social robot, this study demonstrates that most L2 learners employ repair
mechanisms to resolve instances of uncertainty, while a notable portion opts to
remain silent when in doubt. An analysis of Facial Action Units and gaze direc-
tions reveals substantial differences between uncertain events and general events
throughout of the conversation. The experiment involves a dyadic practice con-
versations wherein the social robot’s output is manipulated to induce uncertainty
through increased lexical complexity or prosody modifications.
Author Contributions: Ronald Cumbal proposed the study idea and led
the design of the experiment. Ronald Cumbal took charge of conducting the
study, performing data analysis and leading the writing of the published paper.
José Lopes and Olov Engwall contributed valuable insights and suggestions in
this process. The original dialogue system was developed by José Lopes and Olov
Engwall, with Ronald Cumbal handling the necessary modifications for the new
study. Per Fallgren provided assistance in the role of Wizard to control the robot
Furhat in the experiments.

9.2 Paper B
Detection of Listener Uncertainty in Robot-Led Second Language Con-
versation Practice - Ronald Cumbal, José Lopes and Olov Engwall

59
60 CHAPTER 9. PAPER CONTRIBUTIONS

Scientific Contributions: This study demonstrates the substantial chal-


lenge associated with automatically classifying four levels of (un)certainty (i.e.,
no response, clarification response, thoughtful response and direct response) us-
ing audio-visual features. Specifically, results show that intermediate levels of
(un)certainty are frequently misclassified between them. The findings highlight
that visual features, particularly Facial Action Units, contribute the majority of
information for the classification model. This study uses the data collected from
dyadic and triadic robot-led practice conversations from Paper A and the work
by Engwall and Lopes (2022), respectively.
Author Contributions: Ronald Cumbal developed the guidelines for an-
notating the conversations, and the data was subsequently annotated by Marine
Bastidas and Ronald Cumbal. Ronald Cumbal processed the data and imple-
mented code to evaluate the detection models with the supervision of José Lopes.
Ronald Cumbal conducted the data analysis and led the writing of the published
paper, with valuable support provided by José Lopes and Olov Engwall through-
out the process.

9.3 Paper C
“You don’t understand me!”: Comparing ASR results for L1 and L2
speakers of Swedish - Ronald Cumbal, Birger Moell, José Lopes and Olov En-
gwall

Scientific Contributions: The evaluation of off-the-shelf Automatic Speech


Recognition (ASR) systems reveals a significant increase –almost double– in Word
Error Rates (WER) for L2 speakers of Swedish compared to their L1 counterparts
(using a read sentences dataset). In a dataset containing spontaneous speech, the
performance of two ASRs (Google and Wav2vec2 from Huggingface) deteriorates
considerably, reaching similar levels for both L1 and L2 speakers. The results for
spontaneous speech with Microsoft Azure ASR are relatively acceptable, but still
show a level that falls short of the recommended performance of 30% WER.
Author Contributions: The read speech dataset originated from the work
by Wik et al. (2009). Manual transcription of the conversations, used as gold
standards for the evaluation and gathered from both Paper A and the studies
by Engwall and Lopes (2022), was conducted by Gustav Melander and Robin
Wänlund as part of their BSc thesis. Ronald Cumbal processed the audio record-
ings and transcribed them using the Google and Microsoft Azure ASR systems.
Birger Moell was responsible for training the Wav2vec2 model, hosted on Hug-
gingface, and generating transcriptions using this model. The complete evaluation
was supervised by José Lopes and Olov Engwall. Ronald Cumbal led the writing
of the published paper, with valuable inputs provided by José Lopes and Olov
Engwall.
9.4. PAPER D 61

9.4 Paper D
Robot Gaze Can Mediate Participation Imbalance in Groups with Dif-
ferent Skill Levels - Sarah Gillet, Ronald Cumbal1 , André Pereira, José Lopes,
Olov Engwall and Iolanda Leite

Scientific Contributions: This study demonstrates the ability of a social


robot to effectively balance the spoken participation of players engaged in the
speech-based game Taboo. The pairs of players consist of individuals fluent in
Swedish and those in the process of learning the language, resulting in an inherent
disparity in spoken participation. As a result, the robot is placed in a distinc-
tive scenario where participants exhibit notably different levels of proficiency in
the game’s required skills. The robot uses an adaptive gaze behavior aimed at
encouraging the spoken participation of the less active player. Further analysis
reveals that the effect of the balancing behavior is primarily associated with the
participants’ traits.
Author Contributions: Sarah Gillet and Ronald Cumbal were responsible
for the conceptualization of the study, methodological design, code implementa-
tion, software development, participant recruitment and study execution. André
Pereira, José Lopes, Olov Engwall, and Iolanda Leite contributed by providing
invaluable assistance and supervision across all phases of the research process.
Sarah Gillet assumed the primary role in writing the paper, with Ronald Cumbal
collaborating in the process, and all authors contributing with important feedback
and revisions.

9.5 Paper E
Shaping Unbalanced Multi-Party Interactions through Adaptive Robot
Backchannels - Ronald Cumbal, Daniel Alexander Kazzi, Vincent Winberg and
Olov Engwall

Scientific Contributions: This study demonstrates the effectiveness of a


robot’s adaptive generation of backchannels to encourage more speaking contri-
bution in participants playing the speaking game Taboo. The participant pairs
consisted of L1 speakers and L2 learners of Swedish, establishing a possible im-
balance in participation. The results show a significant increase in speaking
participation for the least active speaker, i.e. the L2 learners. While previous
research has indicated the general impact of backchannels on extending speaking
time, this study is the first to reveal that such strategies can effectively stim-
ulate second language learners to become more actively involved in unbalanced
speaking interactions.

1 Shared first authorship


62 CHAPTER 9. PAPER CONTRIBUTIONS

Author Contributions: Ronald Cumbal designed the methodology of the


study, following the work presented in Paper D. Ronald Cumbal was responsible
for code development and setup implementations, with collaboration from Daniel
Alexander Kazzi and Vincent Winberg as part of their BSc thesis, specifically in
the execution of tests and fine-tuning models. Experiments were primarily car-
ried out by Daniel Alexander Kazzi and Vincent Winberg, with Ronald Cumbal
providing some assistance. Both Ronald Cumbal and Olov Engwall contributed
to the data analysis. Olov Engwall provided guidance and supervision through-
out the entirety of the process. Ronald Cumbal took the lead in composing the
paper, with substantial input from Olov Engwall during the process.

9.6 Paper F
Socio-cultural perception of robot backchannels - Olov Engwall, Ronald
Cumbal and Ali Reza Majlesi

Scientific Contributions: This study highlights the difference in responses


of various socio-cultural groups to a robot’s backchannel strategy described in
Paper E. The findings suggest that among L2 speakers, individuals who are
male, younger, more extroverted, and possess greater familiarity with robots tend
to be more motivated by the additional backchannel attention from the robot.
In contrast, female, older, and more introverted L2 speakers may not experience
the same encouragement to speak more, and in some cases, may even speak less.
This analysis is an extension of the data collected from Paper E, looking beyond
the conventional L1 and L2 participant pairings to include pairings among L1
individuals.
Author Contributions: This paper is based on the experiment reported
in Paper E, for which Ronald Cumbal was the main investigator, as described
above. Olov Engwall led the processing and in-depth analysis of the data gath-
ered from Paper E. Ali Reza Majlesi took the lead in ethnomethodology and
multimodal conversation analysis. Ronald Cumbal was responsible for the pro-
cessing of transcripts and data. Olov Engwall assumed responsibility for writing
the published paper, with supporting input from both Ronald Cumbal and Ali
Reza Majlesi.

9.7 Paper G
Stereotypical Nationality Representations in HRI: Perspectives from
International Young Adults - Ronald Cumbal, Agnes Axelsson, Shivam Mehta
and Olov Engwall

Scientific Contributions: The findings in this study suggest that pre-


existing (negative) perceptions regarding nationality-encoded robots become less
9.8. PAPER H 63

significant after actual interactions, regardless of the interaction medium em-


ployed (whether virtual or physical). These observations indicate that attitudes
toward stereotypical national representations in HRI have a weak effect, at least
within the user group studied (primarily educated young students in an inter-
national setting). The research uses nationality-encoded representations in the
appearance and accents of social robots that are commercially accessible.
Author Contributions: Ronald Cumbal led the conceptualization, anal-
ysis, and execution of the experiments. Agnes Axelsson provided the primary
software guiding robot interactions and offered valuable insights during the con-
ceptualization phase. Shivam Mehta contributed significantly in the development
and deployment of accented voices. Olov Engwall provided support in concep-
tualization and supervision throughout the research process. Additional data
collection efforts were undertaken by Willhelm Ahlqvist and Anton Wennmark
for the physical robot interaction, and by Hugo E. Norberg, Karim Nettelbladt,
and Philip Nilsson for the virtual robot interaction, as part of their respective BSc
theses. Ronald Cumbal assumed the principal responsibility for paper writing,
with substantial collaboration from Agnes Axelsson and contributions from Olov
Engwall across various sections of the manuscript.

9.8 Paper H
Speaking Transparently: Social Robots in Educational Settings - Ronald
Cumbal and Olov Engwall

Scientific Contributions: This paper presents a system that leverages


confidence levels from its various components to denote uncertainty within a
robot’s decision-making process, predominantly through expressions used in con-
versations with pairs of participants. Additionally, the paper outlines the next
steps in conducting a formal experiment to evaluate whether enhanced trans-
parency in a robot’s dialogue influences participants’ perception and trust during
interactions with the robot.
Author Contributions: Ronald Cumbal led the conceptualization and im-
plementation of the system, along with the development of the proposed method-
ology for the study. Additionally, Ronald Cumbal took responsibility for writing
the paper, with input and manuscript revisions provided by Olov Engwall.
Chapter 10

Discussion and Conclusions

10.1 Research Questions and Findings


At the beginning of this thesis, a pivotal research question was posed to guide
the development of this work:

How can social robots effectively support conversations involv-


ing learners practicing a second language?

In order to answer this question, three different elements were explored and
evaluated throughout the course of this thesis. The following discussion outlines
these elements.

Understanding Conversations with L2 Learners


Initially, this work evaluated the role of robots in a social conversation and ex-
plored the dynamics that appeared when interacting with L2 learners. A com-
prehensive evaluation of how learners navigate instances of uncertainty revealed
a preference among many learners to seek clarification when in doubt. However,
a notable portion of learners remained unresponsive in moments of confusion,
emphasizing the need for social robots to be able to differentiate between varying
degrees of uncertainty in a learner’s interaction.
It is import to note that confusion or uncertainty should not be solely viewed
as negative states in the learning process. Indeed, confusion, sometimes tangled
with states of frustration, has been associated with both positive (D’Mello et al.,
2014; Lehman et al., 2013) and negative (Rodrigo et al., 2009; Schneider et al.,
2016) effects on learning performance. The disparity in these findings could be
attributed to factors such as the duration of confusing events (e.g. persistent
confusion seem to lead to no or negative effects (Lee et al., 2011a; Liu et al.,
2013; Rodrigo et al., 2010)) or the presence of support and metacognitive skills
in students (Di Leo et al., 2019; Liu et al., 2013). Certainly, context also plays

65
66 CHAPTER 10. DISCUSSION AND CONCLUSIONS

an important role in the effect of confusion in a learning process (Richey et al.,


2021) and further research is required to understand the effects of confusion in
L2 practice conversations.
Considering that learners may not always request clarification due to limita-
tions in their speaking skills, it is critical for a robot to appropriately address
such instances. This involves distinguishing whether a learner is formulating a
response, where resolution may not be necessary, or if the learner has lost track
of the conversation, requiring formal resolution. Notably, inferring someone’s
(un)certainty state is a very complex and subjective task, requiring further re-
search. Nonetheless, it is important to recognize that in line with the Zone of
Proximal Development principle, collaborating with an educator is essential for
a student’s progress in overcoming these challenges.
Moreover, our assessment of Automatic Speech Recognition performance with
L1 and L2 speakers highlighted that speech recognition for L2 learners remains
less than optimal compared to L1 speakers. The study specifically revealed even
poorer performance in the context of a social conversation and important words
being misrecognized, for example “understand” and “repeat”, that signal requests
for assistance. Additional findings from our team’s research, led by Engwall et al.
(2022), supported existing work suggesting that the automatic development of
dialogues can still succeed despite reduced ASR accuracy. However, further in-
vestigation is necessary to understand the effectiveness of practice conversations if
they effectively prolong speaking interactions but fail to recognize certain elements
in the dialogue, including inaccuracies in utterance formulations. Considering the
challenges involved in identifying issues in speaking interactions and the further
complexity in providing nuanced and personalized feedback, the research path
towards introducing autonomous robots leading conversations with L2 speakers
faces many exciting challenges yet to be addressed.
Here, it is crucial to emphasize the interpretation of these findings not as
constraints for roboticists or researchers in HRI, but rather as encouragement to
not only address current technical limitations that constrain the range of activities
a robot can undertake but also to expand the possibilities through which a robot
can continue to facilitate L2 conversation practice. The latter idea served as the
inspiration for the next element explored in this thesis.

Enhancing Speaking Practice


The second part of this thesis argued that the impact of robots on group dynamics
could play a crucial role in enhancing speaking practice for L2 learners. It was
emphasized that by introducing either highly proficient speakers or L1 speakers,
we could replicate a “real-world” scenario and simultaneously examine challenging
situations for less proficient L2 speakers. In this context, our studies revealed
that a robot’s effort to balance speaking contributions with an adaptive gazing
behavior in a language game was successful. Furthermore, the adaptive use of
encouraging backchannel techniques proved to significantly increase the speaking
10.1. RESEARCH QUESTIONS AND FINDINGS 67

time of L2 learners. It is noteworthy that learners might find it stressful to express


themselves in a different language, a phenomenon demonstrated by Wilcock and
Yamamoto (2015) in the case of Japanese students speaking in English.
In this context, we draw attention to the intricacies inherent in the interaction
between L1 and L2 speakers. Our team’s research employed Ethnomethodology
and Conversation Analysis (EMCA) to assess errors in turn-taking between L1
and L2 learners when the robot made an introduction mistake (detailed in Paper
IV, Majlesi et al. (2023)). However, we did not explore deeper into the interac-
tions between L1 and L2 speakers, despite observing high levels of collaboration
during the experiments. Understandably, such dynamics should also be consid-
ered in the context of the reactions that a robot is capable of processing. A
simple characteristic of these interactions is the possibility that participants co-
operate in different languages (e.g., code-switching (Nurhamidah et al., 2018)),
which may not align with the targeted second language intended for interaction
with the robot. Certainly, recent advances in processing speech may enable the
recognition of different languages in a conversation. However, doubts may arise
regarding the accuracy with which this is accomplished. While it can be argued
that human educators may not possess this characteristic and still provide proper
guidance, social robots should ideally be equipped to handle such circumstances
appropriately.
Furthermore, this work also underscores phenomena related to the cultural
aspects that underlie such interactions. In particular, we assert that this setting
represents “real-world” interactions, but, these interactions should be understood
for more than just their ecological validity; they should be comprehended in a
way that both research in social robots and the studies themselves serve a higher
purpose for the communities that researchers “use” for their research. In this
case, we emphasize that this work has a strong pursuit of pro-social goals for
robots, denoted here by the objective to explore potential ways in which this
technology could enhance the integration of L2 speakers into a community.

Cultural Perspectives
The third point emphasized in this thesis is the imperative for additional research
specifically addressing cultural aspects in human-robot interactions. This work
not only showcased diverse reactions to the same robot behavior based on indi-
viduals’ cultural backgrounds, but also explored the perspectives that a group of
people could hold regarding social robots encoded with different cultural traits.
While the original results indicated that adapting the robot’s backchanneling
strategy could influence participants’ speaking time, a deeper examination re-
vealed distinctive responses among various socio-cultural groups, with a focus on
L1 and L2 aspects. The findings emphasize that among L2 speakers, factors such
as gender, age, extroversion, and familiarity with robots influence how the behav-
ior of the robot is received when encourage to speak more through backchannels.
It becomes evident that tailoring robot behaviors, such as backchannels, based on
68 CHAPTER 10. DISCUSSION AND CONCLUSIONS

these socio-cultural factors is crucial, given the varying responsiveness observed


among different participant groups.
Furthermore, the investigation into preconceptions related to appearance and
accents revealed that individuals’ immediate perception of a virtual robot could
lead to negative preconceptions regarding perceived competence and likability.
However, these preconceptions diminished after actual interactions. Importantly,
the positive outcome from these studies suggests that prejudice related to na-
tionality stereotypes does not strongly influence human-robot interaction, par-
ticularly within the examined subject group of young university students in an
international setting.
These studies emphasize the significance of considering cultural nuances in
designing robot interactions. Understanding how cultural factors influence par-
ticipants’ responses and perceptions is crucial for creating effective and inclusive
human-robot interactions in language learning and beyond.

10.2 Additional Factors for Reflection

Educators’ Perspectives on Robots


In Chapter 3, a brief exploration of public attitudes towards robots was intro-
duced. It is essential to now highlight the perspective of educators in the context
of human-robot interaction. While participatory research is gaining prominence
in HRI, the involvement of various stakeholders in the development of social
robots remains relatively limited. Notably, educators’ opinions on the utility of
this technology are often undervalued, and their concerns tend to revolve around
practical considerations that may not be adequately addressed by roboticists.
Teachers express apprehensions related to potential disruptions to teaching
processes, increased workload, and the fear that robots might replace interper-
sonal relationships. Similar concerns have been voiced by workers in healthcare
who have integrated robots into their work environments (Wright, 2023). One
possible explanation for these concerns is that individuals may lack a realistic
understanding of the characteristics and capabilities of currently available social
robots (Reich-Stiebert et al., 2019). Therefore, it becomes crucial to assess, as
researchers, how prepared different stakeholders are to introduce robots into class-
rooms. Addressing these concerns and ensuring effective communication about
the capabilities and limitations of social robots is essential for fostering acceptance
and successful integration in educational settings.

Social Robots for the “real-world”


The initial premise of this thesis proposed that “social robots [could] emerge
as one possible alternative for practicing speaking skills in socially driven con-
versations” for L2 learners, with a particular focus on supporting immigrants in
10.2. ADDITIONAL FACTORS FOR REFLECTION 69

integrating into new societies. However, two significant challenges arose that war-
rant reevaluation of our experimental approach: firstly, the tendency to sample
L2 learners predominantly from WEIRD (Western, educated, industrialized, rich,
and democratic (Henrich et al., 2010)) countries, and secondly, the need to align
research objectives more closely with the needs of the intended end-users.
First, while initial studies in this thesis involved L2 learners from the SFI pro-
gram1 , representing a less-privileged segment of the immigrant population, over
time, participants were mainly recruited from demographics that to large extent
fit the WEIRD profile. While the findings remain relevant for studying robot
interactions with L2 learners, the exclusion of immigrants from non-WEIRD so-
cieties may weaken the potential support for sectors of society requiring more
assistance. Notably, significant differences exist among different immigrant pop-
ulations, for instance, less-privileged immigrants will have varying degrees of lit-
eracy, some even missing formal education as children that should be considered
in research focused on robots supporting education (Blommaert, 2010).
Secondly, although efforts were made to collaborate with educators in design-
ing robot interactions, insufficient attention was given to involve less-privileged
immigrant communities directly. Given the emphasis on the value of conversa-
tions promoted through this work, it was crucial to engage these communities in
defining the research problem, gathering data, and preparing technical designs.
For instance, it is essential to recognize that issues like racism, discrimination,
equality, and diversity are inherent aspects of the immigrant experience. In this
context, Doyle (2015) highlights significant work demonstrating how group dis-
cussions among students can be used to combat negative comments in the life
and workplace of immigrants. Consequently, it is not unexpected that these top-
ics might surface in social interactions, although the extent to which they would
emerge in conversations involving robots remains uncertain. Here, HRI research
could benefit from recent pedagogical efforts aimed at training teachers to avoid
delegitimizing students with limited or no formal education (Santos and Shandor,
2012; Simpson and Whiteside, 2015), and explore how these strategies could be
applied in human-robot interaction contexts.
Surely, addressing these topics comprehensively can merit a separate thesis
work, but they should be integral considerations in studies aiming to support
immigrant communities. HRI research stands to benefit from a thorough re-
assessment of the roles social robots can play in society, including a critical ex-
amination of potential power imbalances and their implications for fairness in
research outcomes (Winkle et al., 2023).

1 Swedish for Immigrants, or Svenskundervisning för invandrare in Swedish, is the national

course in Swedish offer for free to most categories of immigrants.


70 CHAPTER 10. DISCUSSION AND CONCLUSIONS

10.3 Ethical Concerns


Terminology in Technical Studies
Throughout this thesis, the use of terms such as native or non-native speakers has
been minimized, with a preference for employing L1 or L2 speakers when possible.
This decision aligns with previous work that highlighted the negative side effects
associated with the use of such terms Cheng et al. (2021). It is important to
note that this understanding was derived from suggestions by colleagues from a
different field and does not necessarily represent a common practice within the
field of HRI and social robotics. This tendency seems to be consistent with other
technical fields. In the preparation leading up to the work presented in Paper G,
it was observed that the way culture is addressed often lacks a social perspective.
For instance, terms like nationality and ethnicity are frequently interchanged in
technical studies addressing bias or stereotypes in vision-based technology, even
though the scope of bias or stereotypes may extend beyond these conceptions,
involving factors that cannot be easily segregated within this terminology. While
additional research in cultural aspects is warranted in the context of social robots,
it is essential that such research is conducted in a manner that acknowledges the
complexities of this focus, as is done in social science fields.

Data Collection
During the data collection process for this thesis, careful attention was dedicated
to the handling of video and audio recordings, as well as their storage and presen-
tation. In every study conducted, subjects were provided with informed consent
that clearly communicated the purpose of the study, the nature of the data to be
collected, and the intended use of the recorded materials. Participants were also
granted the option to opt-out from allowing the use of their data, including deci-
sions regarding the utilization of their image or voice in academic publications and
presentations. Moreover, participants were informed about the voluntary nature
of their participation and were assured the option to discontinue the experiment
at any point.
In terms of data storage and sharing, robust measures were implemented to
prevent any improper use. All collected data was securely stored within the
university’s official cloud servers. Access to this server is exclusively granted
through permissions provided by the research team, and there is a designated
time-frame for other collaborators to access the data. A sample of this consent
form is shown in Figure 10.1.

10.4 Personal Reflections


Recently, I found myself engaged in a group discussion regarding how HRI re-
searchers should label or introduce robots to avoid perpetuating (negative) stereo-
10.4. PERSONAL REFLECTIONS 71

types associated with gender. One comment suggested that altering terms to
eliminate gender identity might be unnecessary, as our focus is solely on robot
research, implying that such changes would have no significant impact (in soci-
ety). I would like to contest this notion. Expanding upon the sentiment expressed
by Turkle (2007) regarding the development of social robots, where researchers
“are not only building robots, but a robot culture”, I argue that we are indeed
influencing broader societal changes, albeit to varying degrees. Consequently,
throughout my thesis work, I have attempted to reevaluate the role of robots in
education and second language learning, with particular attention to carefully
analyzing how robots can be introduced into these crucial roles. In light of this,
I firmly believe that the way we plan, execute, and present our research holds
considerable importance for society and should not be underestimated.
72 CHAPTER 10. DISCUSSION AND CONCLUSIONS

Figure 10.1: Example of a consent form used for Paper D.


References

Zsuzsanna Ittzes Abrams. The effect of synchronous and asynchronous cmc on oral
performance in german. The Modern Language Journal, 87(2):157–167, 2003.

Alicia Adsera and Mariola Pytlikova. The role of language in shaping interna-
tional migration. CReAM Discussion Paper Series 1206, Centre for Research and
Analysis of Migration (CReAM), Department of Economics, University College
London, Feb 2012. URL https://ideas.repec.org/p/crm/wpaper/1206.html.

Nese Alyüz, Eda Okur, Ece Oktay, Utku Genc, Sinem Aslan, Sinem Emine Mete,
David Stanhill, Bert Arnrich, and Asli Arslan Esme. Towards an emotional
engagement model: Can affective states of a learner be automatically detected
in a 1: 1 learning scenario? In UMAP (Extended Proceedings), 2016.

Yusuke Arano. Interculturality as an interactional achievement: Doubting others’


nationality and accounting for the doubt. Journal of International and Intercul-
tural Communication, 12(2):167–189, 2019.

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler,
Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor
Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint
arXiv:1912.06670, 2019.

John Baldwin, Sandra Faulkner, Michael Hecht, and Sheryl Lindsley. Redefin-
ing culture: Perspectives across the disciplines. LEA’s communication series.
Lawrence Erlbaum Associates Publishers, Mahwah, New Jersey, 01 2006.

Anne-Marie Barraja-Rohan. Using conversation analysis in the second language


classroom to teach interactional competence. Language Teaching Research, 15
(4):479–507, 2011.

Muzakki Bashori, Roeland van Hout, Helmer Strik, and Catia Cucchiarini. Web-
based language learning and speaking anxiety. Computer Assisted Language
Learning, 35(5-6):1058–1089, 2022.

Mike Baynham. Agency and contingency in the language learning of refugees and
asylum seekers. Linguistics and education, 17(1):24–39, 2006.

73
74 REFERENCES

Ken Beatty. Teaching & researching: Computer-assisted language learning. Rout-


ledge, 2013.

Tony Belpaeme and Fumihide Tanaka. Social robots as educators. In OECD


Digital Education Outlook 2021 Pushing the Frontiers with Artificial Intelli-
gence, Blockchain and Robots: Pushing the Frontiers with Artificial Intelligence,
Blockchain and Robots, page 143. OECD Publishing Paris, 2021.

Tony Belpaeme, James Kennedy, Paul Baxter, Paul Vogt, Emiel EJ Krahmer,
Stefan Kopp, Kirsten Bergmann, Paul Leseman, Aylin C Küntay, Tilbe Göksun,
et al. L2tor-second language tutoring using social robots. In Proceedings of the
ICSR 2015 WONDER Workshop, 2015.

Tony Belpaeme, James Kennedy, Aditi Ramachandran, Brian Scassellati, and Fu-
mihide Tanaka. Social robots for education: A review. Science robotics, 3(21):
eaat5954, 2018a.

Tony Belpaeme, Paul Vogt, Rianne Van den Berghe, Kirsten Bergmann, Tilbe
Göksun, Mirjam De Haas, Junko Kanero, James Kennedy, Aylin C Küntay, Ora
Oudgenoeg-Paz, et al. Guidelines for designing social robots as second language
tutors. International Journal of Social Robotics, 10:325–341, 2018b.

Courtney K Blackwell, Alexis R Lauricella, Ellen Wartella, Michael Robb, and


Roberta Schomburg. Adoption and use of technology in early education: The
interplay of extrinsic barriers and teacher attitudes. Computers & Education, 69:
310–319, 2013.

Jan Blommaert. The sociolinguistics of globalization. Cambridge University Press,


2010.

Katarzyna Boduch-Grabka and Shiri Lev-Ari. Exposing individuals to foreign


accent increases their trust in what nonnative speakers say. Cognitive Sci-
ence, 45(11):e13064, 2021. doi: https://doi.org/10.1111/cogs.13064. URL
https://onlinelibrary.wiley.com/doi/abs/10.1111/cogs.13064.

Nigel Bosch, Yuxuan Chen, and Sidney D’Mello. It’s written on your face: detect-
ing affective states from facial expressions while learning computer programming.
In Intelligent Tutoring Systems: 12th International Conference, ITS 2014, Hon-
olulu, HI, USA, June 5-9, 2014. Proceedings 12, pages 39–44. Springer, 2014.

Nigel Bosch, Sidney D’Mello, Ryan Baker, Jaclyn Ocumpaugh, Valerie Shute,
Matthew Ventura, Lubin Wang, and Weinan Zhao. Automatic detection of
learning-centered affective states in the wild. In Proceedings of the 20th interna-
tional conference on intelligent user interfaces, pages 379–388, 2015.

S.E. Brennan and M. Williams. The feeling of another’s knowing: Prosody and filled
pauses as cues to listeners about the metacognitive states of speakers. Journal of
REFERENCES 75

Memory and Language, 34(3):383 – 398, 1995. ISSN 0749-596X. doi: https://doi.
org/10.1006/jmla.1995.1017. URL http://www.sciencedirect.com/science/
article/pii/S0749596X85710170.

Joost Broekens, Marcel Heerink, Henk Rosendal, et al. Assistive social robots in
elderly care: a review. Gerontechnology, 8(2):94–103, 2009.

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy dispar-
ities in commercial gender classification. In Conference on fairness, accountability
and transparency, pages 77–91. PMLR, 2018.

Donn Byrne, William Griffitt, and Daniel Stefaniak. Attraction and similarity of
personality characteristics. Journal of Personality and Social Psychology, 5(1):
82, 1967.

Keith Cameron. Computer assisted language learning (CALL): media, design, and
applications. Swets & Zeitlinger, 1999.

Craig Chaudron. A descriptive model of discourse in the corrective treatment of


learners’errors 1. Language learning, 27(1):29–46, 1977.

Lauretta SP Cheng, Danielle Burgess, Natasha Vernooij, Cecilia Solı́s-Barroso, Ash-


ley McDermott, and Savithry Namboodiripad. The problematic concept of native
speaker in psycholinguistics: Replacing vague and harmful terminology with in-
clusive and accurate measures. Frontiers in Psychology, page 3980, 2021.

Vijay Chidambaram, Yueh-Hsuan Chiang, and Bilge Mutlu. Designing persua-


sive robots: how robots might persuade people using vocal and nonverbal cues.
In Proceedings of the seventh annual ACM/IEEE international conference on
Human-Robot Interaction, pages 293–300, 2012.

Noam Chomsky. Aspects of the Theory of Syntax. MIT press, 1965.

Dorothy M Chun. Computer-assisted language learning. In Handbook of research


in second language teaching and learning, pages 663–680. Routledge, 2011.

Herbert H Clark and Edward F Schaefer. Contributing to discourse. Cognitive


science, 13(2):259–294, 1989.

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth
Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning
evaluation of universal representations of speech. In 2022 IEEE Spoken Language
Technology Workshop (SLT), pages 798–805. IEEE, 2023.

Katelyn M Cooper, Virginia R Downing, and Sara E Brownell. The influence of


active learning practices on student anxiety in large-enrollment college science
classrooms. International Journal of STEM Education, 5(1):1–18, 2018.
76 REFERENCES

Ronald Cumbal, José Lopes, and Olov Engwall. Detection of listener uncertainty
in robot-led second language conversation practice. In Proceedings of the 2020
International Conference on Multimodal Interaction, ICMI ’20, page 625–629,
New York, NY, USA, 2020a. Association for Computing Machinery. ISBN
9781450375818. doi: 10.1145/3382507.3418873. URL https://doi.org/10.
1145/3382507.3418873.

Ronald Cumbal, José Lopes, and Olov Engwall. Uncertainty in robot assisted
second language conversation practice. In Companion of the 2020 ACM/IEEE
International Conference on Human-Robot Interaction, HRI ’20, page 171–173,
New York, NY, USA, 2020b. Association for Computing Machinery. ISBN
9781450370578. doi: 10.1145/3371382.3378306. URL https://doi.org/10.
1145/3371382.3378306.

Nils Dahlbäck, QianYing Wang, Clifford Nass, and Jenny Alwin. Similarity is more
important than expertise: Accent effects in speech interfaces. In Proceedings of
the SIGCHI conference on Human factors in computing systems, pages 1553–
1556, 2007.

Elise J Dallimore, Julie H Hertenstein, and Marjorie B Platt. Impact of cold-


calling on student voluntary participation. Journal of Management Education,
37(3):305–341, 2013.

Nick Degens, Birgit Endrass, Gert Jan Hofstede, Adrie Beulens, and Elisabeth
André. ‘what i see is not what you get’: why culture-specific behaviours for
virtual characters should be user-tested across cultures. AI & society, 32:37–49,
2017.

S. V. Dekker, J. Duarte, and H. Loerts. ‘who really speaks like that?’ – children’s
implicit and explicit attitudes towards multilingual speakers of dutch. Interna-
tional Journal of Multilingualism, 18(4):551–569, 2021. doi: 10.1080/14790718.
2021.1908297. URL https://doi.org/10.1080/14790718.2021.1908297.

Ivana Di Leo, Krista R Muis, Cara A Singh, and Cynthia Psaradellis. Curiosity. . .
confusion? frustration! the role and sequencing of emotions during mathematics
problem solving. Contemporary educational psychology, 58:121–137, 2019.

Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shus-
ter, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al.
The second conversational intelligence challenge (convai2). In The NeurIPS’18
Competition, pages 187–208. Springer, 2020.

Allen T Dittmann and Lynn G Llewellyn. Relationship between vocalizations and


head nods as listener responses. Journal of personality and social psychology, 9
(1):79, 1968.
REFERENCES 77

Sidney K D’Mello, Art Graesser, and Brandon King. Toward spoken human–
computer tutorial dialogues. Human–Computer Interaction, 25(4):289–323, 2010.
Melissa Donnermann, Philipp Schaper, and Birgit Lugrin. Social robots in ap-
plied settings: A long-term study on adaptive robotic tutors in higher education.
Frontiers in Robotics and AI, 9:831633, 2022.
Sandra Doyle. Getting to grips with the english language. In Adult Language
Education and Migration, pages 162–172. Routledge, 2015.
Starkey Duncan. Some signals and rules for taking speaking turns in conversations.
Journal of personality and social psychology, 23(2):283, 1972.
Starkey Duncan. On the structure of speaker–auditor interaction during speaking
turns1. Language in society, 3(2):161–180, 1974.
Starkey Duncan and Donald W Fiske. Face-to-face interaction: Research, methods,
and theory. Routledge, 1977. doi: https://doi.org/10.4324/9781315660998.
Sidney D’Mello, Blair Lehman, Reinhard Pekrun, and Art Graesser. Confusion can
be beneficial for learning. Learning and Instruction, 29:153–170, 2014.
Carole Edelsky. Who’s got the floor? Language in society, 10(3):383–421, 1981.
Rod Ellis. The study of second language acquisition. Oxford University, 1994.
Olov Engwall and José Lopes. Interaction and collaboration in robot-assisted lan-
guage learning for adults. Computer Assisted Language Learning, 35(5-6):1273–
1309, 2022.
Olov Engwall, José Lopes, and Anna Åhlund. Robot interaction styles for con-
versation practice in second language learning. International Journal of Social
Robotics, 13(2):251–276, 2021.
Olov Engwall, José Lopes, and Ronald Cumbal. Is a wizard-of-oz required for
robot-led conversation practice in a second language? International Journal of
Social Robotics, 14(4):1067–1085, 2022.
Søren W Eskildsen and Johannes Wagner. Embodied l2 construction learning.
Language Learning, 65(2):268–297, 2015.
Anita Ferreira and John Atkinson. Designing a feedback component of an intelligent
tutoring system for foreign language. In International Conference on Innovative
Techniques and Applications of Artificial Intelligence, pages 277–290. Springer,
2008.
Anita Ferreira, Johanna D Moore, and Chris Mellish. A study of feedback strate-
gies in foreign language classrooms and tutorials with implications for intelligent
computer-assisted language learning systems. Int. J. Artif. Intell. Educ., 17(4):
389–422, 2007.
78 REFERENCES

Samantha Finkelstein, Evelyn Yarzebinski, Callie Vaughn, Amy Ogan, and Justine
Cassell. The effects of culturally congruent educational technologies on student
achievement. In Artificial Intelligence in Education: 16th International Confer-
ence, AIED 2013, Memphis, TN, USA, July 9-13, 2013. Proceedings 16, pages
493–502. Springer, 2013.
Alan Firth. The discursive accomplishment of normality: On ‘lingua franca’english
and conversation analysis. Journal of pragmatics, 26(2):237–259, 1996.
Bilal Genc and Erdogan Bada. Culture in language learning and teaching. The
reading matrix, 5(1), 2005.
Bart Geurts. Communication as commitment sharing: speech acts, implicatures,
common ground. Theoretical linguistics, 45(1-2):1–30, 2019.
Mary M Gill. Accent and stereotypes: Their effect on perceptions of teachers and
lecture comprehension. Journal of Applied Communication Research, 1994. URL
https://doi.org/10.1080/00909889409365409.
Arthur M Glenberg. Embodiment for education. In Handbook of cognitive science,
pages 355–372. Elsevier, 2008.
Ewa M Golonka, Anita R Bowles, Victor M Frank, Dorna L Richardson, and
Suzanne Freynik. Technologies for foreign language learning: A review of tech-
nology types and their effectiveness. Computer assisted language learning, 27(1):
70–105, 2014.
Charles Goodwin. Conversational organization. Interaction between speakers and
hearers, 1981.
Charles Goodwin. Between and within: Alternative sequential treatments of con-
tinuers and assessments. Human studies, 9(2):205–217, 1986.
Charles Goodwin. Co-operative action. Cambridge University Press, 2018.
Charles Goodwin et al. Restarts, pauses, and the achievement of a state of mutual
gaze at turn-beginning. Sociological inquiry, 50(3-4):272–302, 1980.
Goren Gordon, Cynthia Breazeal, and Susan Engel. Can children catch curiosity
from a social robot? In Proceedings of the tenth annual ACM/IEEE international
conference on human-robot interaction, pages 91–98, 2015.
Goren Gordon, Samuel Spaulding, Jacqueline Kory Westlund, Jin Joo Lee, Luke
Plummer, Marayna Martinez, Madhurima Das, and Cynthia Breazeal. Affective
personalization of a social robot tutor for children’s second language skills. In
Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
Paul Gruba. Computer assisted language learning (call). The handbook of applied
linguistics, pages 623–648, 2004.
REFERENCES 79

Joan Kelly Hall. ”aw, man, where you goin’ ?”: Classroom interaction and the
development of l2 interactional competence. Issues in Applied linguistics, 6(2),
1995.
Luke Harding. Communicative language testing: Current issues and future research.
Language assessment quarterly, 11(2):186–197, 2014.
J. T. Hart. Memory and the feeling-of-knowing experience. Journal of Educational
Psychology, 56:208–216, 1965.
Heidi Hautopp and Thorkild Hanghøj. Game based language learning for bilingual
adults. In Proceedings of the 8th European Conference on Game-Based Learning.
Reading: Academic Conferences and Publishing International, pages 191–198,
2014.
Joseph Henrich, Steven J Heine, and Ara Norenzayan. Most people are not weird.
Nature, 466(7302):29–29, 2010.
Anna Henschel, Guy Laban, and Emily S Cross. What makes a robot social?
a review of social robots from science fiction to a home or hospital near you.
Current Robotics Reports, 2:9–19, 2021.
Graeme Hirst, Susan McRoy, Peter Heeman, Philip Edmonds, and Diane Hor-
ton. Repairing conversational misunderstandings and non-understandings. Speech
Communication, 15(3):213 – 229, 1994. ISSN 0167-6393. doi: https://doi.org/10.
1016/0167-6393(94)90073-6. URL http://www.sciencedirect.com/science/
article/pii/0167639394900736. Special issue on Spoken dialogue.
Anna Hjalmarsson. The additive effect of turn-taking cues in human and synthetic
voice. Speech Communication, 53(1):23–35, 2011.
Anna Hjalmarsson, Preben Wik, and Jenny Brusk. Dealing with deal: a dialogue
system for conversation training. In Proceedings of the 8th SIGdial Workshop on
Discourse and Dialogue, pages 132–135, 2007.
Stephen A Hockema and Linda B Smith. Learning your language, outside-in and
inside-out. Linguistics, 47, 03 2009. doi: 10.1515/LING.2009.016.
Zeng-Wei Hong, Yueh-Min Huang, Marie Hsu, and Wei-Wei Shen. Authoring
robot-assisted instructional materials for improving learning performance and
motivation in efl classrooms. J. Educ. Technol. Soc., 19:337–349, 01 2016. URL
https://api.semanticscholar.org/CorpusID:17667686.
Dell Hymes et al. On communicative competence. sociolinguistics, 269293:269–293,
1972.

Mika Ishino. Teachers’ embodied mitigation against allocating turns to unwilling


students. Classroom Discourse, 13(4):343–364, 2022.
80 REFERENCES

W Lewis Johnson and Andre Valente. Tactical language and culture training sys-
tems: Using ai to teach foreign languages and cultures. AI magazine, 30(2):
72–72, 2009.

Gabriele Kasper. Repair in foreign language teaching. Studies in Second Language


Acquisition, 7(2):200–215, 1985.

Gabriele Kasper. Beyond repair: Conversation analysis as an approach to sla. AILA


review, 19(1):83–99, 2006.

Tatsuya Kawahara, Takashi Yamaguchi, Miki Uesato, Koichiro Yoshino, and Kat-
suya Takanashi. Synchrony in prosodic and linguistic features between backchan-
nels and preceding utterances in attentive listening. In 2015 Asia-Pacific Signal
and Information Processing Association Annual Summit and Conference (AP-
SIPA), pages 392–395, 2015. doi: 10.1109/APSIPA.2015.7415301.

Adam Kendon. Some functions of gaze-direction in social interaction. Acta psy-


chologica, 26:22–63, 1967.

Kobin H Kendrick. The intersection of turn-taking and repair: the timing of other-
initiations of repair in conversation. Frontiers in psychology, 6(250):10–3389,
2015.

AlBara Khalifa, Tsuneo Kato, and Seiichi Yamamoto. Measuring effect of repetitive
queries and implicit learning with joining-in-type robot assisted language learning
system. In SLaTE, pages 13–17, 2017.

AlBara Khalifa, Tsuneo Kato, and Seiichi Yamamoto. Learning effect of implicit
learning in joining-in-type robot-assisted language learning system. International
Journal of Emerging Technologies in Learning, 14(2), 2019.

Chandra Khatri, Behnam Hedayatnia, Anu Venkatesh, Jeff Nunn, Yi Pan, Qing Liu,
Han Song, Anna Gottardi, Sanjeev Kwatra, Sanju Pancholi, et al. Advancing the
state of the art in open domain dialog systems through the alexa prize. arXiv
preprint arXiv:1812.10757, 2018.

Peter Khooshabeh, Morteza Dehghani, Angela Nazarian, and Jonathan Gratch.


The cultural influence model: When accented natural language spoken by virtual
characters matters. AI & society, 32:9–16, 2017.

YouJin Kim and Kim McDonough. The effect of interlocutor proficiency on the
collaborative dialogue between korean as a second language learners. Language
teaching research, 12(2):211–234, 2008.

Tetyana Kloubert and Chad Hoggan. Migrants and the labor market: The role and
tasks of adult education. Adult Learning, 32(1):29–39, 2021.
REFERENCES 81

Dominique Knutsen and Ludovic Le Bigot. Managing dialogue: How information


availability affects collaborative reference production. Journal of Memory and
Language, 67(3):326–341, 2012.

Hanae Koiso, Yasuo Horiuchi, Syun Tutiya, Akira Ichikawa, and Yasuharu Den.
An analysis of turn-taking and backchannels based on prosodic and syntactic
features in japanese map task dialogs. Language and speech, 41(3-4):295–321,
1998.

Jacqueline M Kory Westlund, Sooyeon Jeong, Hae W Park, Samuel Ronfard, Arad-
hana Adhikari, Paul L Harris, David DeSteno, and Cynthia L Breazeal. Flat vs.
expressive storytelling: Young children’s learning and retention of a social robot’s
narrative. Frontiers in human neuroscience, 11:295, 2017.

Maria Kowal and Merrill Swain. Using collaborative language production tasks to
promote students’ language awareness. Language awareness, 3(2):73–93, 1994.

Claire Kramsch. From language proficiency to interactional competence. The mod-


ern language journal, 70(4):366–372, 1986.

Brigitte Krenn, Stephanie Schreitter, and Friedrich Neubarth. Speak to me and i


tell you who you are! a language-attitude study in a cultural-heritage application.
AI & society, 32:65–77, 2017.

Ming-Mu Kuo and Cheng-Chieh Lai. Linguistics across cultures: The impact of
culture on second language learning. Online Submission, 1(1), 2006.

Shuya Kushida. Confirming understanding and acknowledging assistance: Manag-


ing trouble responsibility in response to understanding check in japanese talk-in-
interaction. Journal of Pragmatics, 43(11):2716–2739, 2011.

Sébastien Lallé, Cristina Conati, and Giuseppe Carenini. Predicting confusion in


information visualization from eye tracking and interaction data. In IJCAI, pages
2529–2535, 2016.

Diane Marie C Lee, Ma Mercedes T Rodrigo, Ryan SJ d Baker, Jessica O Sugay,


and Andrei Coronel. Exploring the relationship between novice programmer
confusion and achievement. In Affective Computing and Intelligent Interaction:
4th International Conference, ACII 2011, Memphis, TN, USA, October 9–12,
2011, Proceedings, Part I 4, pages 175–184. Springer, 2011a.

Sungjin Lee, Hyungjong Noh, Jonghoon Lee, Kyusong Lee, Gary Geunbae Lee,
Seongdae Sagong, and Munsang Kim. On the effectiveness of robot-assisted
language learning. ReCALL, 23(1):25–58, 2011b.

Michael J Leeser. Learner proficiency and focus on form during collaborative dia-
logue. Language teaching research, 8(1):55–81, 2004.
82 REFERENCES

Blair Lehman, Sidney D’Mello, Amber Strain, Caitlin Mills, Melissa Gross, Allyson
Dobbins, Patricia Wallace, Keith Millis, and Art Graesser. Inducing and tracking
confusion with contradictions during complex learning. International Journal of
Artificial Intelligence in Education, 22(1-2):85–105, 2013.
David Kellogg Lewis. Convention: A Philosophical Study. Harvard University
Press, Cambridge, MA, USA, 1969.
Rose Yanhong Li and Mike Kaye. Understanding overseas students’ concerns and
problems. Journal of Higher Education Policy and Management, 20(1):41–50,
1998.
Mei Hui Lim and Vahid Aryadoust. A scientometric review of research trends in
computer-assisted language learning (1977 – 2020). Computer Assisted Language
Learning, 35(9):2675–2700, 2022. doi: 10.1080/09588221.2021.1892768. URL
https://doi.org/10.1080/09588221.2021.1892768.
Diane Litman, Helmer Strik, and Gad S Lim. Speech technologies and the assess-
ment of second language speaking: Approaches, challenges, and opportunities.
Language Assessment Quarterly, 15(3):294–309, 2018.
Diane J Litman, Carolyn P Rosé, Kate Forbes-Riley, Kurt VanLehn, Dumisizwe
Bhembe, and Scott Silliman. Spoken versus typed human and computer dialogue
tutoring. International Journal of Artificial Intelligence in Education, 16(2):145–
170, 2006.
Zhongxiu Liu, Visit Pataranutaporn, Jaclyn Ocumpaugh, and Ryan Baker. Se-
quences of frustration and confusion, and learning. In Educational data mining
2013, 2013.
Birgit Lugrin, Benjamin Eckstein, Kirsten Bergmann, and Corinna Heindl. Adapted
foreigner-directed communication towards virtual agents. In Proceedings of the
18th International Conference on Intelligent Virtual Agents, pages 59–64, 2018.
Roy Lyster and Leila Ranta. Corrective feedback and learner uptake: Negotiation
of form in communicative classrooms. Studies in second language acquisition, 19
(1):37–66, 1997.
Ali Reza Majlesi, Ronald Cumbal, Olov Engwall, Sarah Gillet, Silvia Kunitz, Gus-
tav Lymer, Catrin Norrby, and Sylvaine Tuncer. Managing turn-taking in human-
robot interactions: The case of projections and overlaps, and the anticipation of
turn design by human participants. Social Interaction. Video-based Studies of
Human Sociality, 6(1), 2023.
Matthew Marge, Carol Espy-Wilson, Nigel G Ward, Abeer Alwan, Yoav Artzi,
Mohit Bansal, Gil Blankenship, Joyce Chai, Hal Daumé III, Debadeepta Dey,
et al. Spoken language interaction with robots: Recommendations for future
research. Computer Speech & Language, 71:101255, 2022.
REFERENCES 83

Marie McAuliffe and Binod Khadria. World migration report 2020. 2019.

Conor McGinn and Ilaria Torre. Can you tell the robot by the voice? an ex-
ploratory study on the role of voice in the perception of robots. In 2019 14th
ACM/IEEE international Conference on human-robot interaction (HRI), pages
211–221. IEEE, 2019.

Hazel Morton and Mervyn A Jack. Scenario-based spoken interaction with virtual
agents. Computer Assisted Language Learning, 18(3):171–191, 2005.

Bilge Mutlu, Takayuki Kanda, Jodi Forlizzi, Jessica Hodgins, and Hiroshi Ishiguro.
Conversational gaze mechanisms for humanlike robots. ACM Transactions on
Interactive Intelligent Systems (TiiS), 1(2):1–33, 2012.

Stanislava Naneva, Marina Sarda Gou, Thomas L Webb, and Tony J Prescott.
A systematic review of attitudes, anxiety, acceptance, and trust towards social
robots. International Journal of Social Robotics, 12(6):1179–1201, 2020.

Clifford Nass, Jonathan Steuer, and Ellen R Tauber. Computers are social actors. In
Proceedings of the SIGCHI conference on Human factors in computing systems,
pages 72–78, 1994.

Clifford Ivar Nass and Scott Brave. Wired for speech: How voice activates and
advances the human-computer relationship. MIT press Cambridge, 2005.

Stephen Neale. Paul grice and the philosophy of language. Linguistics and philos-
ophy, pages 509–559, 1992.

N Nurhamidah, Endang Fauziati, and Slamet Supriyadi. Code-switching in efl


classroom: Is it good or bad? Journal of English Education, 3(2):78–88, 2018.

E Oksaar. Language contacts within the scope of culture contacts: Behavioral and
structural models. Philippine journal of linguistics, 14(15):246–252, 1983.

Els Oksaar. Language contact and culture contact: Towards an integrative approach
in second language acquisition research. Current Trends in European Second
Language Acquisition Research. Multilingual Matters, Clevendon, pages 10–20,
1990.

Open Robotics. Nodes, 2018a. http://wiki.ros.org/Messages, Last accessed on


2024-01-16.

Open Robotics. Nodes, 2018b. http://wiki.ros.org/Nodes, Last accessed on


2024-01-16.

Open Robotics. Topics, 2019. http://wiki.ros.org/Topics, Last accessed on


2024-01-16.
84 REFERENCES

Patricia O’Neill-Brown. Setting the stage for the culturally adaptive agent. In
Proceedings of the 1997 AAAI fall symposium on socially intelligent agents, pages
93–97. AAAI Press Menlo Park, CA, 1997.
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani
Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al.
Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516,
2023.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey,
and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.
In International Conference on Machine Learning, pages 28492–28518. PMLR,
2023.
Natasha Randall. A survey of robot-assisted language learning (rall). ACM Trans-
actions on Human-Robot Interaction (THRI), 9(1):1–36, 2019.
Byron Reeves and Clifford Nass. The media equation: How people treat computers,
television, and new media like real people. Cambridge, UK, 10(10), 1996.
Natalia Reich-Stiebert, Friederike Eyssel, and Charlotte Hohnemann. Involve the
user! changing attitudes toward robots by user participation in a robot proto-
typing process. Computers in Human Behavior, 91:290–296, 2019.
J Elizabeth Richey, Jiayi Zhang, Rohini Das, Juan Miguel Andres-Bray, Richard
Scruggs, Michael Mogessie, Ryan S Baker, and Bruce M McLaren. Gaming and
confrustion explain learning advantages for a math digital learning game. In
International conference on artificial intelligence in education, pages 342–355.
Springer, 2021.
Celia Roberts and Melanie Cooke. Authenticity in the adult esol classroom and
beyond. Tesol Quarterly, 43(4):620–642, 2009.
Ma Mercedes T Rodrigo, Ryan S Baker, Matthew C Jadud, Anna Christine M
Amarra, Thomas Dy, Maria Beatriz V Espejo-Lahoz, Sheryl Ann L Lim,
Sheila AMS Pascua, Jessica O Sugay, and Emily S Tabanao. Affective and be-
havioral predictors of novice programmer achievement. In Proceedings of the 14th
annual ACM SIGCSE conference on Innovation and technology in computer sci-
ence education, pages 156–160, 2009.
Ma Mercedes T Rodrigo, Ryan SJd Baker, and Julieta Q Nabos. The relationships
between sequences of affective states and learner achievement. In Proceedings
of the 18th international conference on computers in education, pages 56–60.
Universiti Putra Malaysia Malaysia, 2010.
Astrid M Rosenthal-von der Pütten, Carolin Straßmann, and Nicole C Krämer.
Robots or agents–neither helps you more or less during second language acqui-
sition: Experimental study on the effects of embodiment and type of speech
REFERENCES 85

output on evaluation and alignment. In Intelligent Virtual Agents: 16th Inter-


national Conference, IVA 2016, Los Angeles, CA, USA, September 20–23, 2016,
Proceedings 16, pages 256–268. Springer, 2016.

Harvey Sacks. Lectures on conversation: Volume i. Malden, Massachusetts: Black-


well, 1992.

Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. A simplest systematics for
the organization of turn-taking for conversation. Language, 50(4):696–735, 1974.
ISSN 00978507, 15350665. URL http://www.jstor.org/stable/412243.

Martin Saerbeck, Tom Schut, Christoph Bartneck, and Maddy Janse. Expres-
sive robots in education - varying the degree of social supportive behavior of a
robotic tutor. In 28th ACM Conference on Human Factors in Computing Sys-
tems (CHI2010), pages 1613–1622, Atlanta, 2010a. ACM. doi: 10.1145/1753326.
1753567.

Martin Saerbeck, Tom Schut, Christoph Bartneck, and Maddy D Janse. Expressive
robots in education: varying the degree of social supportive behavior of a robotic
tutor. In Proceedings of the SIGCHI conference on human factors in computing
systems, pages 1613–1622, 2010b.

Maricel G Santos and April Shandor. The role of classroom talk in the creation
of “safe spaces” in adult esl classrooms. In LESLLA Symposium Proceedings,
volume 7, pages 110–134, 2012.

Emanuel A. Schegloff. Discourse as an interactional achievement: some uses of


’uh huh’ and other things that come between sentences. In Analyzing Dis-
course: Text and Talk, page 71–93. Georgetown University Press, Washington,
D.C., 1982. URL https://repository.library.georgetown.edu/bitstream/
handle/10822/555474/GURT_1981.pdf.

Emanuel A Schegloff. Overlapping talk and the organization of turn-taking for


conversation. Language in society, 29(1):1–63, 2000.

Stephen R Schiffer. Meaning. 1972.

Barbara Schneider, Joseph Krajcik, Jari Lavonen, Katariina Salmela-Aro, Michael


Broda, Justina Spicer, Justin Bruner, Julia Moeller, Janna Linnansaari, Kalle
Juuti, et al. Investigating optimal learning moments in us and finnish science
classes. Journal of Research in Science Teaching, 53(3):400–421, 2016.

Thorsten Schodde, Kirsten Bergmann, and Stefan Kopp. Adaptive robot language
tutoring based on bayesian knowledge tracing and predictive decision-making. In
Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot
Interaction, pages 128–136, 2017.
86 REFERENCES

Sarah Sebo, Brett Stoll, Brian Scassellati, and Malte F Jung. Robots in groups
and teams: a literature review. Proceedings of the ACM on Human-Computer
Interaction, 4(CSCW2):1–36, 2020.
Paul Seedhouse. The case of the missing “no”: The relationship between pedagogy
and interaction. Language learning, 47(3):547–583, 1997.
Paul Seedhouse. Conversation analysis and language learning. Language teaching,
38(4):165–187, 2005.
Margret Selting. On the interplay of syntax and prosody in the constitution of turn-
constructional units and turns in conversation. Pragmatics. Quarterly Publication
of the International Pragmatics Association (IPrA), 6(3):371–388, 1996.
Sofia Serholt, Wolmet Barendregt, Asimina Vasalou, Patrı́cia Alves-Oliveira, Aidan
Jones, Sofia Petisca, and Ana Paiva. The case of classroom robots: teachers’
deliberations on the ethical tensions. Ai & Society, 32:613–631, 2017.
Michihiro Shimada, Takayuki Kanda, and Satoshi Koizumi. How can a social
robot facilitate children’s collaboration? In Shuzhi Sam Ge, Oussama Khatib,
John-John Cabibihan, Reid Simmons, and Mary-Anne Williams, editors, Social
Robotics, pages 98–107, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
ISBN 978-3-642-34103-8.
James Simpson and Anne Whiteside. Adult language education and migration:
Challenging agendas in policy and practice. Taylor & Francis, 2015.
Gabriel Skantze. Exploring human error recovery strategies: Implications for
spoken dialogue systems. Speech Communication, 45(3):325–341, 2005. ISSN
0167-6393. doi: https://doi.org/10.1016/j.specom.2004.11.005. URL http:
//www.sciencedirect.com/science/article/pii/S0167639304001256. Spe-
cial Issue on Error Handling in Spoken Dialogue Systems.
Gabriel Skantze. Error Handling in Spoken Dialogue Systems : Managing Uncer-
tainty, Grounding and Miscommunication. PhD thesis, KTH, Speech, Music and
Hearing, TMH, 2007. QC 20100812.
Gabriel Skantze. Predicting and regulating participation equality in human-robot
conversations: Effects of age and gender. In Proceedings of the 2017 ACM/IEEE
International Conference on Human-robot Interaction, pages 196–204, 2017.
Gabriel Skantze, Anna Hjalmarsson, and Catharine Oertel. Turn-taking, feedback
and joint attention in situated human–robot interaction. Speech Communication,
65:50–66, 2014.
Gabriel Skantze, Martin Johansson, and Jonas Beskow. Exploring turn-taking cues
in multi-party human-robot discussions about objects. In Proceedings of the 2015
ACM on international conference on multimodal interaction, pages 67–74, 2015.
REFERENCES 87

Matthijs Smakman and Elly A Konijn. Robot tutors: Welcome or ethically ques-
tionable? In Robotics in Education: Current Research and Innovations 10, pages
376–386. Springer, 2020.
Matthijs Smakman, Paul Vogt, and Elly A Konijn. Moral considerations on social
robots in education: A multi-stakeholder perspective. Computers & Education,
174:104317, 2021.
Bernard Spolsky. Communicative competence, language proficiency, and beyond.
Applied Linguistics, 10(2):138–156, 1989.
Robert Stalnaker. Common ground. Linguistics and philosophy, 25(5/6):701–721,
2002.
Sarah Steber and Sonja Rossi. The challenge of learning a new language in adult-
hood: Evidence from a multi-methodological neuroscientific approach. PLOS
ONE, 16:1–23, 02 2021. doi: 10.1371/journal.pone.0246421. URL https:
//doi.org/10.1371/journal.pone.0246421.
Neomy Storch and Ali Aldosari. Pairing learners in pair work activity. Language
teaching research, 17(1):31–48, 2013.
Merrill Swain, Sharon Lapkin, Ibtissem Knouzi, Wataru Suzuki, and Lindsay
Brooks. Languaging: University students learn the grammatical concept of voice
in french. The Modern Language Journal, 93(1):5–29, 2009.
Sherry Turkle. Authenticity in the age of digital companions. Interaction studies,
8(3):501–517, 2007.
Rianne Van den Berghe, Josje Verhagen, Ora Oudgenoeg-Paz, Sanne Van der Ven,
and Paul Leseman. Social robots for language learning: A review. Review of
Educational Research, 89(2):259–295, 2019.
Alistair Van Moere. A psycholinguistic approach to oral language assessment. Lan-
guage Testing, 29(3):325–344, 2012.
Paul Vogt, Rianne van den Berghe, Mirjam De Haas, Laura Hoffman, Junko
Kanero, Ezgi Mamus, Jean-Marc Montanier, Cansu Oranç, Ora Oudgenoeg-Paz,
Daniel Hernández Garcı́a, et al. Second language tutoring using social robots: a
large-scale study. In 2019 14th ACM/IEEE International Conference on Human-
Robot Interaction (HRI), pages 497–505. Ieee, 2019.
Lev Semenovich Vygotsky and Michael Cole. Mind in society: Development of
higher psychological processes. Harvard university press, 1978.
Joshua Wainer, Kerstin Dautenhahn, Ben Robins, and Farshid Amirabdollahian.
Collaborating with kaspar: Using an autonomous humanoid robot to foster co-
operative dyadic play among children with autism. In 2010 10th IEEE-RAS
International Conference on Humanoid Robots, pages 631–638. IEEE, 2010.
88 REFERENCES

Yi Hsuan Wang, Shelley S-C Young, and Jyh-Shing Roger Jang. Using tangible
companions for enhancing learning english conversation. Journal of Educational
Technology & Society, 16(2):296–309, 2013.

Yuko Watanabe and Merrill Swain. Effects of proficiency differences and patterns
of pair interaction on second language learning: Collaborative dialogue between
adult esl learners. Language teaching research, 11(2):121–142, 2007.

J Kory Westlund, Leah Dickens, Sooyeon Jeong, Paul Harris, David DeSteno, and
Cynthia Breazeal. A comparison of children learning new words from robots,
tablets, & people. In Proceedings of the 1st international conference on social
robots in therapy and education, 2015.

Jacqueline Kory Westlund, Goren Gordon, Samuel Spaulding, Jin Joo Lee, Luke
Plummer, Marayna Martinez, Madhurima Das, and Cynthia Breazeal. Lessons
from teachers on performing hri studies with young children in schools. In 2016
11th ACM/IEEE International Conference on Human-Robot Interaction (HRI),
pages 383–390. IEEE, 2016.

Preben Wik and Anna Hjalmarsson. Embodied conversational agents in computer


assisted language learning. Speech communication, 51(10):1024–1037, 2009.

Preben Wik, Rebecca Hincks, and Julia Bell Hirschberg. Responses to ville: A
virtual language teacher for swedish. 2009.

Graham Wilcock and Seichi Yamamoto. Towards computer-assisted language learn-


ing with robots, wikipedia and coginfocom. In 2015 6th IEEE International Con-
ference on Cognitive Infocommunications (CogInfoCom), pages 115–119. IEEE,
2015.

Katie Winkle, Donald McMillan, Maria Arnelid, Katherine Harrison, Madeline


Balaam, Ericka Johnson, and Iolanda Leite. Feminist human-robot interaction:
Disentangling power, principles and practice for better, more ethical hri. In
Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot
Interaction, pages 72–82, 2023.

James P. Wolf. The effects of backchannels on fluency in l2 oral task produc-


tion. System, 36(2):279–294, 2008. ISSN 0346-251X. doi: https://doi.org/10.
1016/j.system.2007.11.007. URL https://www.sciencedirect.com/science/
article/pii/S0346251X08000249.

James Wright. Robots Won’t Save Japan: An Ethnography of Eldercare Automation.


Cornell University Press, 2023.

Eiko Yasui. Repair and language proficiency: Differences of advanced and beginning
language learners in an english-japanese conversation group. Texas Papers in
Foreign Language Education, 15(1), 2011.
REFERENCES 89

Langxuan Yin, Timothy Bickmore, and Dharma E Cortés. The impact of linguistic
and cultural congruity on persuasion by conversational agents. In Intelligent Vir-
tual Agents: 10th International Conference, IVA 2010, Philadelphia, PA, USA,
September 20-22, 2010. Proceedings 10, pages 343–349. Springer, 2010.
Victor H Yngve. On getting a word in edgewise. In Papers from the sixth re-
gional meeting Chicago Linguistic Society, April 16-18, 1970, Chicago Linguistic
Society, Chicago, pages 567–578, 1970.
Simin Zeng. Second language learners’ strong preference for self-initiated self-repair:
Implications for theory and pedagogy. Journal of Language Teaching and Re-
search, 10(3):541–548, 2019.
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai
Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al. Google usm:
Scaling automatic speech recognition beyond 100 languages. arXiv preprint
arXiv:2303.01037, 2023.
Zheng Zhang, Ryuichi Takanobu, Qi Zhu, MinLie Huang, and XiaoYan Zhu. Re-
cent advances and challenges in task-oriented dialog systems. Science China
Technological Sciences, pages 1–17, 2020.
Part II

Included Papers

91

You might also like