0% found this document useful (0 votes)
60 views7 pages

A Vision and Speech Enabled Customizable Virtual Assistant For Smart Environments

This document proposes a vision and speech-enabled virtual assistant for smart home environments. It introduces an architecture that combines computer vision, deep learning, speech recognition and synthesis, and artificial intelligence. The proposed assistant is customizable, interactive, and runs on low-cost hardware. It was integrated with an open-source home automation platform and performed accurately and reliably during testing. The assistant aims to engage users through multimodal communication by providing an expressive graphical interface in addition to speech capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views7 pages

A Vision and Speech Enabled Customizable Virtual Assistant For Smart Environments

This document proposes a vision and speech-enabled virtual assistant for smart home environments. It introduces an architecture that combines computer vision, deep learning, speech recognition and synthesis, and artificial intelligence. The proposed assistant is customizable, interactive, and runs on low-cost hardware. It was integrated with an open-source home automation platform and performed accurately and reliably during testing. The assistant aims to engage users through multimodal communication by providing an expressive graphical interface in addition to speech capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

A vision and speech enabled, customizable, virtual

assistant for smart environments


Giancarlo Iannizzotto Lucia Lo Bello Andrea Nucita Giorgio Mario Grasso
Dept. for Cognitive Sciences, Department of Electrical, Dept. for Cognitive Sciences, Dept. for Cognitive Sciences,
Psychology, Education and Electronic and Computer Psychology, Education and Psychology, Education and
Cultural Studies (COSPECS) Engineering (DIEEI) Cultural Studies (COSPECS) Cultural Studies (COSPECS)
University of Messina University of Catania University of Messina University of Messina
Messina, Italy Catania, Italy Messina, Italy Messina, Italy
[email protected] [email protected] [email protected] [email protected]

Abstract—Recent developments in smart assistants and smart critical security issues due to the fact that most speech-
home automation are lately attracting the interest and curiosity enabled smart assistants do not support effective authentication
of consumers and researchers. Speech enabled virtual assistants mechanisms, while being able to trigger security- critical
(often named smart speakers) offer a wide variety of network-
oriented services and, in some cases, can connect to smart actions. Face recognition or other identification mechanisms
environments, thus enhancing them with new and effective user should be required before accepting voice commands, for such
interfaces. However, such devices also reveal new needs and devices [1].
some weaknesses. In particular, they represent faceless and blind Current smart assistants can talk and listen to their users,
assistants, unable to show a face, and therefore an emotion, and but cannot “see” them. Moreover, in most cases they do not
unable to ‘see’ the user. As a consequence, the interaction is
impaired and, in some cases, ineffective. Moreover, most of those feature any kind of visual emotional feedback. They are blind
devices heavily rely on cloud-based services, thus transmitting and faceless to the user, thus their interaction is often impaired
potentially sensitive data to remote servers. To overcome such and incomplete and, therefore, less effective and efficient [2].
issues, in this paper we combine some of the most advanced To overcome the problem listed above, this paper introduces
techniques in computer vision, deep learning, speech generation an architecture for building vision-enabled smart assistants,
and recognition, and artificial intelligence, into a virtual assistant
architecture for smart home automation systems. The proposed provided with expressive and animated graphical characters
assistant is effective and resource-efficient, interactive and cus- and speech recognition and synthesis. The proposed architec-
tomizable, and the realized prototype runs on a low-cost, small- ture is specifically devised for, but not limited to, interfac-
sized, Raspberry PI 3 device. For testing purposes, the system was ing with smart home and home automation platforms. The
integrated with an open source home automation environment resulting smart assistant aims at engaging the user in a very
and ran for several days, while people were encouraged to
interact with it, and proved to be accurate, reliable and appealing. involving and effective interaction, exploiting multimodal and
nonverbal communication.
Index Terms—Smart home, virtual assistant, computer vision, In the next sections, this paper reports a brief description
deep learning. of the related work (Sect. II), a description of the proposed
architecture and the developed prototype (Sect. III) and the
I. I NTRODUCTION produced preliminary experimental results (Sect. IV). Final
In the last decade the concept of smart assistant has be- remarks and future plans conclude the paper (Sect. V).
come widely known and gained large popularity. Commercial II. R ELATED W ORK
devices such as Amazon Alexa, Google Home, Mycroft are
able to interact with the user by means of speech recognition Initial research on the effects of embodied virtual agents on
and speech synthesis, offer several network-based services and human-computer interaction and on the ‘persona effect’, i.e.,
can interface with smart home automation systems, enhanc- the positive effect of the presence of a lifelike character in
ing them with an advanced user interface. The diffusion of an interaction environment, dates back more than 20 years
such speech-enabled smart assistants is constantly spreading, [3]. Since then, the original findings have been confirmed
mainly thanks to the availability of a large number of network several times and in several different applications, while the
services and of an increasing number of additional skills, or relevant literature grew enormously, covering a large num-
capabilities, that can be easily added to the smart assistants. ber of technologies, applications and approaches [4] [5] [6].
However, their potential is still limited by their inability to Probably one of the most advanced, and recent, embodied
acquire real-time visual information from video data, either virtual agents is SARA, described in [7], which features very
about the user or the environment. This also poses some accurate and complex abilities for affective and expressive
human-computer interaction. However, SARA is principally
978-1-5386-5024-0/18/$31.00 ©2018 IEEE aimed at affective interaction, by analyzing the voice and the

50

Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 20,2023 at 07:05:53 UTC from IEEE Xplore. Restrictions apply.
facial expressions of the user as well as her voice intonation or Indic speaking). The GoogleTTS module relies on the well-
and the spoken text. Moreover, SARA is a large, complex known Google Cloud Speech API [12]. Despite depending
system that could be hardly squeezed into a cheap, small-sized on cloud services and requiring a paid subscription, Google
and resource-constrained hardware. Several other architectures Cloud Speech API offers the benefits of supporting multiple
and implementations were proposed in the literature [8] [9], languages and featuring both male and female, very natural,
however, in our knowledge, so far no embodied, vision- voices.
and speech- enabled virtual agents have been presented and
released to the public, that are able to recognize the users A. The Red Virtual Assistant
discriminating their faces and to run on inexpensive and small In order to demonstrate the proposed architecture, in this
sized consumer devices. paper a specific virtual assistant is introduced, that exploits
The lack of face identification abilities of most virtual agent 6 service modules, a graphical frontend and a coordination
software is probably due to the shortage, in the past, of suitable module. The virtual assistant is named ‘Red’, after the name
lightweight, yet effective and accurate, face identification ap- given to the frontend interactive character (a red fox). The
proaches. In most cases, until a few years ago, face recognition structure of Red is shown in Fig. 1.
was inaccurate or required powerful computational resources
for the recognition process or for the off-line enrollment of
the user pictures [10]. The recent application of deep neural
networks is producing a disruptive change in this trend, al-
lowing the development of effective, accurate and lightweight
techniques for face recognition [11], that are currently also
exploited for user identification in smartphones. It is about
time to integrate those technologies into a virtual assistant.
III. S YSTEM A RCHITECTURE
The proposed architecture is fully modular. It is composed
of a set of services, a graphical frontend and a coordinator that
leverages on the services to offer to the user a multimodal and
involving interaction with the connected home automation and
smart assistant systems. In general, each service corresponds
to a class of service modules, all offering the same service and
exposing the same interface, but characterized by different per-
formance, computational intensity, memory footprints, degree
of portability (some modules may rely on proprietary services) Fig. 1. Schematic representation of the modular structure of Red.
and by their potential dependence on external (cloud) services.
As a consequence, according to the specific requirements of
the smart environment and user needs, a customized smart B. The Graphical Frontend and the Speech recognition mod-
assistant can be built by combining a suitable set of modules. ule
The modules communicate through sockets and RESTFul The Graphical Frontend is based on an HTML5 document,
connections, so, if needed, different modules can be allocated containing the Javascript code needed to communicate through
on different processing nodes. In principle, each module might a Websocket connection with the Coordination module and
be allocated on a separate node, however, most developed through a RESTful API with the Speech synthesis and the
modules have limited requirements on computational power Speech recognition modules. The character chosen for the
and memory resources. As a consequence, in most cases, a interface, the Red fox, was animated in order to create ad
single node is sufficient to run all the involved modules. hoc, expressive and meaningful videos, that are combined and
An example of the described approach is the Text To synchronized with the speech at run time. This results in an
Speech (TTS) service. In the proposed architecture, the asso- involving and realistic interaction with the user (see Fig. 2).
ciated class currently contains three different modules, called The HTML5 document is delivered by an HTTPS server,
Flite2 module, MaryTTS module and GoogleTTS module. locally installed, to a Chromium web browser, that was specifi-
The MaryTTS module produces very natural utterances and cally chosen as it is open source and fully supports the Google
supports multiple languages and voices. However, it has Speech To Text (GSTT) service. The GSTT service was
quite a large footprint (up to 1GB RAM), thus it is not selected as it provides an effective and free solution for reliable
suitable when other memory-intensive modules are involved. and multilingual speech recognition. As a consequence, the
The Flite2 module produces natural, clear and understandable Speech recognition module was realized by means of an
utterances, it is lightweight in terms of memory footprint external service. An alternative module based on the Mozilla
and computational needs, but currently supports just a few open source implementation of DeepSpeech [13], that can run
languages and offers few voices (mostly male and US English locally, is currently being investigated.

51

Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 20,2023 at 07:05:53 UTC from IEEE Xplore. Restrictions apply.
neural network that is the foundation of the Snowboy software,
and thus is not fully portable. Conversely, the second module is
open source and fully portable, but the procedure for encoding
the wake word is cumbersome. Both modules were tested with
Red, producing very similar performance using “Hey, Red!”
as the Wake Word.

E. The Face detection module


The Face detection service allows the virtual assistant to
detect the presence of a user in front of the device, thus
contributing to enable the kind of interactivity that is totally
missing in the most common virtual assistants. A Face de-
tection module continuously scans the video input from a
connected webcam and, whenever it detects a human face, it
alerts the Coordination module. When a face is detected, Red
turns its attention to the user facing the camera and greets her.
If the identity of the user is available (as the user has been
Fig. 2. The Red interactive character. recognized by the Face recognition module), the user is called
by name and every change in the identity of the user in front
of the camera is signaled by a corresponding utterance (“Hi
The Speech recognition module is normally idle, not record- <name of the user>, how can I help you?”).
ing nor sensing audio input, in order to preserve the user’s The Face detection module chosen for Red is the Haar
privacy. It is explicitly activated by the Graphical Frontend module, based on the fast and lightweight face detector by
in predetermined phases of the interaction (e.g., after posing a Viola and Jones [19] and optimized in the most recent version
question to the user) and when the Coordinator module signals of the OpenCV library [20], but other modules were added to
that the Wake Word has been detected by the Wake Word the Face Detection class to be tested with Red.
module.
F. The Face recognition module
C. The Speech synthesis module
The Face recognition service allows the virtual assistant
The Speech synthesis service provides the speech ability
to recognize the user in front of the camera by matching
to the Graphical Frontend. Two Speech synthesis modules
her face against a set of photographs that reside in a local
were taken into consideration for Red: MaryTTS, based on the
database. The number of photographs is not fixed and new
MaryTTS software [14], and the Flite2 module, a lightweight
ones can be added at any time, as the service protocol has
RESTful TTS server integrating the popular Flite open source
a specific command for adding a new photograph to the
TTS software [15] in its version 2.0. Due to the main memory
collection, together with the associated identity. Although
limitations of the adopted hardware, featuring only 2GB of
the service is not intended as a security tool, it has to be
RAM, the lighter Flite2 module was chosen. Although during
very accurate in order to keep the necessary reliability and
the tests Red interacted only in English language, an Italian
the trust of the user. Moreover, a correct identification of
voice developed for an earlier version of Flite [16] is currently
the user is needed for maintaining the interaction context
being ported to this module. A version of the Flite2 module
(the portion of the past interaction needed to correctly deal
was also released as an open source node.js package1 .
with the current dialogue). As a consequence, only the most
D. The Wake Word detection module accurate and recent approaches to face recognition were taken
The Wake Word detection service allows the user to trig- into account for the implementation of the Face recognition
ger the Speech recognition service and give commands or modules. The technique adopted for the module used for Red
ask questions to Red. A Wake Word module continuously is based on an image metric realized through a ResNet-34
scans the audio input and, when a predetermined utterance deep neural network [21], as modified and trained by Davis
is detected (similarly to “Hey, Alexa” for Amazon Alexa, or King [22] for the Dlib library [23]. This technique approaches
“Hey, Google” for Google Assistant), it alerts the Coordina- an accuracy of 99.38% on the standard “Labeled Faces in the
tion module. There are currently two modules in the Wake Wild” benchmark [24].
Word class, namely, the Snowboy module, integrating the G. The Smart assistant module
SnowBoy deep neural networks-based software [17], and the
PocketSphinx module, based on the homonymous open source A smart assistant is a software agent that can perform tasks
software developed by Carnegie Mellon University [18]. The or services for human users. Modern smart assistants rely on
first module relies on a proprietary library containing the deep natural language processing (NLP) for parsing the input from
the user and evincing commands and requests, and, in some
1 npm package named ianniTTS (https://www.npmjs.com/package/iannitts) cases, exploit artificial intelligence to elaborate articulated and

52

Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 20,2023 at 07:05:53 UTC from IEEE Xplore. Restrictions apply.
meaningful answers when needed. The user input is usually and animations. Specifically for the Red prototype, 5 anima-
textual (e.g. chatbots), vocal (as in the case of modern smart tions were produced, each one representing a different facial
speakers). In some cases, it can also use images, that can be expression:
submitted to the agent by taking pictures with a smartphone. • Vacant expression, shown when no user is detected in
The Smart assistant module in Red acts as an interface to a front of the camera. The agent shows curiosity for the
smart assistant platform, thus integrating its services into the environment, looks around, searches for someone.
Virtual Assistant. Currently the class only contains the Mycroft • Surprised expression, shown when a user, different from
module, that integrates the Mycroft open source smart assistant the last one seen before, is recognized in front of the
platform [25] into Red. Mycroft provides basic services and camera. The user is also greeted with her name. If the
a minimal conversational ability, but it is modular and new user is unknown, i.e., Red does not have a picture of her
services can be added by easily installing new “skills”. Thanks in its database, the user is named “Stranger”.
to the Mycroft module, Red can answer to questions about the • “No no” expression, used to reject undesirable interac-
weather or the time worldwide, retrieve information from a tions (such as those containing contemptible language).
web search engine or from Wikipedia, directly turn on a lamp • Silent expression, adopted when some user is detected in
or interface to a home automation platform, leave a message front of the camera. Shows attention to the user.
for another user, set an alarm, and more. • Chatting expression, used while Red is speaking. This
expression is realistic and synchronized with the speech.
IV. E XPERIMENTAL RESULTS
The users were given 30 minutes for getting accustomed to
The development, and, as a consequence, the experimen-
the presence of the virtual agent on their desk, before the real
tation of the Red virtual assistant, are still ongoing work.
tests began. After a few minutes, depending on their age and
In order to produce a preliminary evaluation of both Red
curiosity, the users stopped being distracted by the device, that
performance and user experience, three identical prototypes
had gradually become a part of the environment. However, as
were deployed, all running the same software, in a students
soon as a chance to ask for its services appeared, the users
lab. Each prototype ran on a Raspberry PI 3 mod B card,
never hesitated to bring it into play.
equipped with an USB webcam with a microphone, a small
The actual test session was composed of a set of 8 tasks,
speaker and a 3.5” color LCD display (see Fig. 3). In the lab,
involving different services and interactions. A total of 15 sub-
each prototype was positioned on a different desk, facing the
jects (9 women and 6 men) were involved in the experiments.
user’s chair, slightly to the side.
The subjects were 10 students, 18-24 years old, not engaged
in technology-oriented studies, three young researchers, 25-28
years old, and two professors, 40 and 50 years old. Each user
was asked to perform the whole set of tasks once, so each task
was performed 15 times over the whole experiment.
For each task, measures were acquired regarding the number
of fully successfully tasks, the number of partial failures
(attempts that needed to repeat the interaction at least once
to get the task successfully completed) and the number of full
failures (attempts that produced a wrong result). The measures
are reported in Table I.

TABLE I
S UCCESSFULNESS OF THE TASK , IN PERCENTAGE ( SESSION I). 15 TRIALS
FOR EACH TASK WERE PERFORMED .

Partial
Task Successful Failure Failure
Face detection 100% - -
Face recognition 100% - -
Ask for the time in New York 100% - -
Ask for the weather in Sidney 100% - -
Ask about Arthur C. Clarke 93.33% 6.67% -
Switch on the light on the desk 100% - -
Ask who was sitting at the
desk 10 minutes before 100% - -
Set an alarm in 10 minutes 93.33% 6.67% -

Fig. 3. A picture of one of the three Red prototypes. Given the low number of participants to the tests, the very
high score in face recognition was easily expected. A more
The smart assistant architecture has a modular and cus- realistic figure for the face recognition approach is provided
tomizable graphical interface, based on HTML5 documents in [22]. Nevertheless, this is indeed a typical scenario for home

53

Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 20,2023 at 07:05:53 UTC from IEEE Xplore. Restrictions apply.
automation, where the number of potential users is, in most In order to highlight the effects of character animation and
cases, that of the family components. The partial failures for user face detection and recognition on the user interaction, the
the cases “Ask about Arthur C. Clarke” and “Set an alarm in experiments were repeated with a disembodied version of Red.
10 minutes” (in both cases 1 partial failure out of the whole The face detection and recognition modules and the graphical
set of tests) might have been partly originated by the low interface were deactivated and the device was positioned
quality of the audio input. The adopted hardware platform horizontally, in order to ‘hide” it in the environment. The
does not provide high quality audio input or noise canceling. users could still evoke the agent by calling it by name (“Hey,
Most commercial devices, such as the Amazon Alexa, feature Red!”), ask their questions and get their responses, however,
noise cancelling microphone arrays, but such a hardware is they would not be recognized as different persons and would
expensive and it was not considered necessary for our setup, not get a visual feedback from the agent.
as the user is supposed to be sitting in front of the device For the sake of avoiding any interference with the previous
during interaction. testing session, a new group of 12 users was selected, 5 male
As the virtual assistant is speech-enabled, it was decided and 7 female, in the range 19-25 years old. Again, an initial
to make it directly pose the evaluation questions to the users. 30 minutes interval was granted to the users to get accustomed
Red asked a number of question and automatically recorded to the presence of the disembodied virtual agent, which would
the answers of each user, in textual form, on an Excel file. respond to its name and reply to direct questions. After that
Although making the subject of the evaluation ask questions interval, the real tests were performed.
about its own performance might produce biased results, this
was considered as part of the experiment, aimed at further TABLE III
revealing the degree of involvement of the users with the S UCCESSFULNESS OF THE TASK , IN PERCENTAGE ( SESSION II). 12 TRIALS
FOR EACH TASK WERE PERFORMED .
“subject” of the evaluation.
The users showed strong interest in this approach and Partial
answered the questions with attention and accuracy. Table II Task Successful Failure Failure
Ask for the time in New York 100% - -
reports the results of the interviews. Ask for the weather in Sidney 91.67% 8.33% -
The users were Italian speakers, while the interaction with Ask about Arthur C. Clarke 100% - -
Red was entirely in English. As a consequence, the small Switch on the light on the desk 100% - -
Set an alarm in 10 minutes 100% - -
fraction of partial failures, also pointed out in the responses
to the user experience questionnaires, might also be originated
by an imperfect pronunciation. The results obtained from this second experimental session
In order to attain a high realism, and therefore a better “per- are reported in Table III. Unsurprisingly, the relevant accuracy
sona effect”, the assistant face is always slightly moving. This results did not deviate significantly from those in Table I, as
was in some cases considered “a bit creepy”, and somehow nothing changed in the smart agent architecture. Indeed, an
distracting. Although occurring very rarely, this might be a annotation was taken by the observers during this session,
case of trespassing to the “uncanny valley” [26], despite the reporting that the users tended to speak more slowly and
fact that the chosen character is not human. articulated their words more than during the first session.
Besides the inability to recognize the user, and thus the
TABLE II inability to maintain a reliable context-aware interaction, no
U SER E XPERIENCE Q UESTIONNAIRE ( SESSION I). A NSWERS REPORTED other significant differences were found between the two
IN PERCENTAGE
sessions. However, while comparing the interaction logs from
the two experimental sessions, some more interesting details
More Yes More No emerged: before the first session, during the initial 30 minutes
Task Yes than No than Yes No
Did you enjoy interval, after the first few minutes needed to get accustomed to
the overall experience? 100% - - - the presence of the virtual assistant, the users actually started
Did you clearly understand “chatting” with the virtual assistant. A number of attempts
my spoken messages? 93.33% 6.67% - -
Did I promptly catch were made to establish some kind of informal interaction, with
your commands? 93.33% - 6.67% - questions such as “how old are you?” or “what are you?”,
Would you like to have “where do you come from” and “do you like music”. Luckily
me on your desk at home? 100% - - -
Would you enjoy my services enough, a very rudimentary approach to conversational interac-
on a daily basis? 100% - - - tion had already been added to the coordination module, so the
Were you able to concentrate final experience for the users was not too frustrating. Notably,
on your work while I was
sitting on your desk? 86.67% 6.67% 6.67% - the described attempts to informal interaction did not appear
in the logs of the second experimental session. As the users
were different from those of the first session, such attempts
Overall, the experience was considered very positive and were instead expected. A plausible explanation is that the
the users declared to be interested in continuing the experi- disembodied agent did not tempt the users into considering it
mentation with the virtual assistant. as a potential conversation partner. This hypothesis brings both

54

Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 20,2023 at 07:05:53 UTC from IEEE Xplore. Restrictions apply.
good and bad news. The good news is that the embodied agent showing that the ability to detect and recognize the user,
actually succeeded in making the interaction significantly more as well as the graphical interface, largely improve the user
natural and attractive. The bad news is that such a more natural experience.
interaction requires an effective natural language processing The presented architecture is still under development and
interface and a context-aware conversational agent [27] in several improvements are being added. In particular, the virtual
order to fulfill the increased expectations of the users. assistant ability to acquire visual information on the user and
The importance of the information gathered from the session its surroundings also paves the way for further important appli-
logs was confirmed by the results of the interviews with the cations, such as fall detection [28] or anomalous or dangerous
users involved in the second experimental session. As reported behavior [29]. Each functionality will be implemented through
in Table IV, the users were much less impressed by the a dedicated service module, possibly running on a separate
disembodied agent than by the embodied agent in the first node and connected with the others in real-time through an
session. 5 users out of 12 declared that the agent was nothing adequate wireless communication protocol [30], able to offer
new with respect to the virtual assistant in their smartphone, reduced energy consumption and packet loss rate [31].
and that the latter was faster, even thought it could not turn
a lamp on. Overall, the net outcome was that the ability to R EFERENCES
visually interact with the user is definitely a significant plus [1] X. Lei, G. Tu, A. X. Liu, C. Li, and T. Xie, “The insecurity of
for a virtual agent. home digital voice assistants - amazon alexa as a case study,” CoRR,
vol. abs/1712.03327, 2017.
[2] J. Gratch, N. Wang, J. Gerten, E. Fast, and R. Duffy, “Creating rapport
TABLE IV with virtual agents,” in Intelligent Virtual Agents (C. Pelachaud, J.-C.
U SER E XPERIENCE Q UESTIONNAIRE ( SESSION II). A NSWERS REPORTED Martin, E. André, G. Chollet, K. Karpouzis, and D. Pelé, eds.), (Berlin,
IN PERCENTAGE Heidelberg), pp. 125–138, Springer Berlin Heidelberg, 2007.
[3] J. C. Lester, S. A. Converse, S. E. Kahler, S. T. Barlow, B. A. Stone,
and R. S. Bhogal, “The persona effect: Affective impact of animated
More Yes More No pedagogical agents,” in Proceedings of the ACM SIGCHI Conference
Task Yes than No than Yes No on Human Factors in Computing Systems, CHI ’97, (New York, NY,
Did you enjoy USA), pp. 359–366, ACM, 1997.
the overall experience? 33.33% 66.67% - - [4] Embodied Conversational Agents. Cambridge, MA, USA: MIT Press,
Did you clearly understand 2000.
my spoken messages? 91.66% 8.33% - - [5] E. André and C. Pelachaud, Interacting with Embodied Conversational
Did I promptly catch Agents, pp. 123–149. Boston, MA: Springer US, 2010.
your commands? 91.66% - 8.33% - [6] B. Weiss, I. Wechsung, C. Kühnel, and S. Möller, “Evaluating embodied
Would you like to have conversational agents in multimodal interfaces,” Computational Cogni-
me on your desk at home? 33.33% 66.67% - - tive Science, vol. 1, p. 6, Aug 2015.
Would you enjoy my services [7] Y. Matsuyama, A. Bhardwaj, R. Zhao, O. Romeo, S. Akoju, and
on a daily basis? 33.33% 66.67% - - J. Cassell, “Socially-aware animated intelligent personal assistant agent,”
Were you able to concentrate in Proceedings of the 17th Annual Meeting of the Special Interest Group
on your work while I was on Discourse and Dialogue, pp. 224–227, Association for Computational
sitting on your desk? 100% - - - Linguistics, 2016.
[8] M. Schroeder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen,
M. ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud,
B. Schuller, E. de Sevin, M. Valstar, and M. Wllmer, “Building au-
V. C ONCLUSIONS AND FUTURE WORK tonomous sensitive artificial listeners,” IEEE transactions on affective
computing, vol. 3, pp. 165–183, 4 2012. eemcs-eprint-22932.
In this paper a software architecture for building [9] B. Martinez and M. F. Valstar, Advances, Challenges, and Opportunities
lightweight, vision and speech-enabled virtual assistants for in Automatic Facial Expression Recognition, pp. 63–100. Cham:
Springer International Publishing, 2016.
smart home and automation applications was presented. A [10] F. Battaglia, G. Iannizzotto, and L. Lo Bello, “A person authentication
complete prototype application was also developed, featuring system based on rfid tags and a cascade of face recognition algorithms,”
a realistic graphic assistant able to show facial expressions and IEEE Transactions on Circuits and Systems for Video Technology,
vol. 27, pp. 1676–1690, Aug 2017.
enabled with speech synthesis and recognition, face detection [11] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-
and face recognition for user identification. The assistant ding for face recognition and clustering,” in 2015 IEEE Conference on
was also connected to a smart home assistant platform, thus Computer Vision and Pattern Recognition (CVPR), pp. 815–823, June
2015.
building a complete “embodied” virtual home assistant that, [12] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,
differently from most common smart speakers, is able to “see” Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian-
and “be seen” by the user and engage her in a multimodal nakis, and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on
Mel Spectrogram Predictions,” ArXiv e-prints, Dec. 2017.
interaction. [13] A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,
An explorative experimentation was carried out and was R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng,
reported in the paper. The experimental results are satisfactory “Deep speech: Scaling up end-to-end speech recognition,” CoRR,
vol. abs/1412.5567, 2014.
both in terms of accuracy and reliability, and in terms of user [14] M. Schröder and J. Trouvain, “The german text-to-speech synthesis sys-
experience. In particular, the users appreciated the experience tem mary: A tool for research, development and teaching,” International
and the feeling with the interface and expressed their willing- Journal of Speech Technology, vol. 6, pp. 365–377, Oct 2003.
[15] A. W. Black and K. A. Lenzo, “Flite: a small fast run-time synthesis
ness to continue working with it. A simple counter-experiment engine,” in 4th ITRW on Speech Synthesis, Perthshire, Scotland, UK,
was set up with a disembodied version of the virtual agent, August 29 - September 1, 2001, p. 204, 2001.

55

Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 20,2023 at 07:05:53 UTC from IEEE Xplore. Restrictions apply.
[16] P. Cosi, F. Tesser, R. Gretter, C. Avesani, and M. Macon, “Festival speaks
italian!,” in EUROSPEECH 2001 Scandinavia, 7th European Conference
on Speech Communication and Technology, 2nd INTERSPEECH Event,
Aalborg, Denmark, September 3-7, 2001 (P. Dalsgaard, B. Lindberg,
H. Benner, and Z.-H. Tan, eds.), pp. 509–512, ISCA, 2001.
[17] KITT.AI, “Snowboy hotword detection.” https://github.com/kitt-ai/
snowboy, 2018.
[18] D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, M. Ravishankar,
and A. I. Rudnicky, “Pocketsphinx: A free, real-time continuous speech
recognition system for hand-held devices,” in 2006 IEEE International
Conference on Acoustics Speech and Signal Processing Proceedings,
vol. 1, pp. I–I, May 2006.
[19] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J.
Comput. Vision, vol. 57, pp. 137–154, May 2004.
[20] Itseez, “Open source computer vision library.” https://github.com/itseez/
opencv, 2015.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 770–778, June 2016.
[22] D. E. King, “High quality face recognition with deep metric learning.”
http://blog.dlib.net/2017/02/high-quality-face-recognition-with-deep.
html, 2017.
[23] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Machine
Learning Research, vol. 10, pp. 1755–1758, 2009.
[24] E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua,
Labeled Faces in the Wild: A Survey, pp. 189–248. Cham: Springer
International Publishing, 2016.
[25] MycroftAI, “Mycroft, an open source artificial intelligence for every-
one.” https://github.com/MycroftAI/mycroft-core, 2018.
[26] M. Mori, “Bukimi no tani [The uncanny valley],” Energy, vol. 7, no. 4,
pp. 33–35, 1970.
[27] M. Jain, R. Kota, P. Kumar, and S. N. Patel, “Convey: Exploring the
use of a context view for chatbots,” in Proceedings of the 2018 CHI
Conference on Human Factors in Computing Systems, CHI ’18, (New
York, NY, USA), pp. 468:1–468:6, ACM, 2018.
[28] F. Cardile, G. Iannizzotto, and F. La Rosa, “A vision-based system for
elderly patients monitoring,” in 3rd International Conference on Human
System Interaction, pp. 195–202, May 2010.
[29] G. Iannizzotto and L. Lo Bello, “A multilevel modeling approach for
online learning and classification of complex trajectories for video
surveillance,” International Journal of Pattern Recognition and Artificial
Intelligence, vol. 28, 08/2014 2014.
[30] G. Iannizzotto, F. La Rosa, and L. Lo Bello, “A wireless sensor network
for distributed autonomous traffc monitoring,” in 3rd International
Conference on Human System Interaction, pp. 612–619, May 2010.
[31] E. Toscano and L. Lo Bello, “A topology management protocol with
bounded delay for wireless sensor networks,” in 2008 IEEE Interna-
tional Conference on Emerging Technologies and Factory Automation,
pp. 942–951, Sept 2008.

56

Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY LAHORE. Downloaded on November 20,2023 at 07:05:53 UTC from IEEE Xplore. Restrictions apply.

You might also like