Intelligent Virtual Assistants
Intelligent Virtual Assistants
1 Introduction 4
2 History 6
2.1 Early decades: 1910 - 1980 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Birth of smart virtual assistants: 1990s—Present . . . . . . . . . . . . . . . . 7
4 Future Advancements 9
4.1 Multi-Modal Intelligent Virtual Assistants: . . . . . . . . . . . . . . . . . . . 9
4.1.1 Structure of General Dialogue System . . . . . . . . . . . . . . . . . . 10
4.1.2 The Proposed IVAs Systems . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Telepathic Virtual Assistants: . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.2 Proposed work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.3 Working of the proposed model: . . . . . . . . . . . . . . . . . . . . . 14
4.2.4 Module I- Brain Computer Interface: . . . . . . . . . . . . . . . . . . 15
4.2.5 Module II – Request Processing: . . . . . . . . . . . . . . . . . . . . 17
7 Conclusion 21
1
List of Tables
1 Types of Alexa cloud-native data . . . . . . . . . . . . . . . . . . . . . . . . 20
List of Figures
1 The Structure of General Dialogue System . . . . . . . . . . . . . . . . . . . 9
2 Structure of next generation virtual assistant . . . . . . . . . . . . . . . . . 11
3 The graph model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 The gesture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 The ASR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 The next generation Virtual Assistant . . . . . . . . . . . . . . . . . . . . . 15
7 Architecture of the proposed system . . . . . . . . . . . . . . . . . . . . . . 16
8 The Brain Computer Interface . . . . . . . . . . . . . . . . . . . . . . . . . 17
9 The Request Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2
Nomenclature
AI Artificial intelligence
DM Dialogue Manager
EEG Electroencephalogram
3
Intelligent Virtual Personal Assistant
Prince Nandha
Abstract
Artificial Intelligence (AI) has made great progress and continues to expand its
potential. Natural Language Processing is one of AI’s applications (NLP). Voice assis-
tants employ cloud computing to combine AI and can converse with users in natural
language. Voice assistants are simple to use, which is why there are millions of gadgets
with them in homes today. The goal of this research is to investigate every facet of the
new Intelligent Virtual Assistant technology (IVA). This article comprises a study of
the beginning of IVA and its future progress.We’ve talked about the next generation
of IVAs, multimodal IVAs, which can be accessible by several types of inputs. This
study also provides an improvement in the form of telepathic IVA, which functions
by thought.Security and privacy from the outside world are two of the most pressing
concerns as technology advances. This paper investigates the chances of a security
breach and offers some potential solutions to make it more secure for users.
1 Introduction
Human lifestyle is upheaval with the springing up of technology. Enlightenment requires
knowledge. With the passage of time, the methods of obtaining knowledge have evolved.
The library, which was formerly regarded as a centre of information, is now only used by
scholars as a source of information. It is due to the vast amount of information presently
available through search engines. Even computers are being phased out in favour of smart
phones, and virtual assistants are taking over the labour of manual search.The use of a
virtual assistant has grown in popularity. Rather than operating it manually, it makes the job
easier. Personal assistants are mostly used for searching, navigating between programmes,
and completing activities that the user has asked. A speech or text-based bot can be used
as a more generic sort of personal assistant[8].
AI-based assistant one of a kind of software with the work is to fulfill the task based on
given orders or inquiries. IVAs working on chat-based systems are also known as chatbots.
Online chat services are occasionally used solely for entertainment. By means of artificial
speech, some IVAs can communicate with humans in an understanding manner. Due to this
functionality, IVAs can fulfill the tasks such as controlling home automation devices, media,
answer to the asked questions, and some basic tasks like reading emails, to-do lists, etc by
means of vocal commands. The dialogue system is based on a similar idea, but with a few
modifications.[9].
4
According to the report published in 2017, IVAs capabilities and daily life usage are in
upheaval due to the springing up due to the new gadgets and effective vocal-based interface
with humans. Tech Giant companies like Apple and Google have introduced IVAs in smart-
phones, Amazon has a good base of uses with smart speakers, Microsoft has a great base of
personal computers all over the globe.nConversica’s intelligent virtual assistants for business
have received over 100 million email and SMS engagements. [4].
Nowadays, spoken conversation systems are in demand due to their ability to communi-
cate with humans in more human ways and capability to assist the human user in performing
the task quickly. In addition, these systems are now part of smartphones, Televisions, and
personal computers, and Automated as well as normal vehicles. Microsoft’s Cortona, Ap-
ple Siri, Google Assistant, Amazon Alexa are a few of the most popular spoken conversion
systems used to make the device more communicable through vocal commands.[1].
Amazon’s powerful deep learning capabilities, such as ASR for converting speech to
text and NLU for determining text intent, enable developers to create apps with extremely
engaging user interfaces and lifelike conversational interaction.[1].
Virtual assistants use the following methods to communicate with their clients:
IVAs can communicate with humans in many means. A few of the ways are text especially
online chatbots, SMS, emails, and some other textbases communication channels. Some of
the examples are Conversica’s IVA used in business. Voice, the advanced way of communi-
cation with IVA. These are instrumented by Tech giants. A few examples are Apple Siri,
Amazon Alea, Google Assistant, etc. Some of these can be utilized by different types of
input, i.e. we can also access Google assistant by searching pictures, by text, and by vocal
commands.[9].
NLP is used by virtual assistants to match user text or voice input to executable com-
mands. Many people are always learning through artificial intelligence approaches such as
machine learning. Some of these assistants, such as Google Assistant and Samsung Bixby,
can also perform image processing to distinguish items in photographs, allowing users to
obtain better results from the images they’ve clicked.
A wake word is a vocal command that may be used to activate a virtual assistant.
This phrase includes ”Hey Siri,” ”OK Google” or ”Hey Google,” ”Alexa,” and ”Hey Mi-
crosoft.” As virtual assistants grow more common, the legal risks they pose are becoming
more apparent.[8].
The IoT is in upheaval. The market of IoT is expected to be $2 trillion by now. Which
is about 20% in CAGR and experts claimes that in 2 years 40% of homes will use one or
more IVAs. In the IoT industry, an IVA is a popular service for connecting with people using
voice commands. Smartspeakers, smart refrigerators, connected vehicles, and other gadgets
can all contain an IVA.
Famous IVAs like Amazon Alexa and Google Assistant rely on cloud computing for max-
imum performance and effective data management. A huge number of behavioural traces,
including a user’s voice activity history with extensive descriptions, can be saved in remote
cloud servers inside an IVA ecosystem during this process. If those data are stolen or exposed
as a result of a cyberattack, such as a data breach, a malicious individual may be able to not
5
only capture extensive IVA service usage history, but also divulge other user-related informa-
tion using various data analysis techniques[7]. This document illustrates and categorises the
many sorts of user-generated Alexa data. We examine a multi-month experimental dataset
using Alexa-enabled devices such as the Echo Dot, Dash Wand, Fire HD, and Fire TV. This
study demonstrates how to gain fresh insights into personal information such as likes and
life patterns using a number of data analysis approaches. Furthermore, the findings of this
study have significant ramifications for both IVA vendors and end users in terms of privacy
risks[7].
Chatbots, which do pre-determined tasks, are a variation on the personal assistant con-
cept. Multilingual personal assistants are available on the market to aid various groups of
people. Personal assistants that can complete tasks by reading the user’s mind are still a
pipe dream. Telepathy is only regarded as a folklore and is not now in practise. However,
the phenomenon is feasible because to the advancement of technology. Virtual assistants
make it difficult for those with hearing and speech impairments. Because of the risk of being
misunderstood, the usage of a virtual assistant causes issues in vital and tense situations.
We can prevent these unfavourable situations by employing telepathic personal helpers[2].
Recent news has aroused the question of the reliability of IVAs like Alexa, Google Home,
Apple Siri, and many other IVAs. Reports show that all the IVAs are not always reliable.
For the strong support, I have given a news cutlet here. A child once shared her passion for
dollhouses and cookies with a new echo dot of her family. This information prompted Alexa
to order a $160 dollhouse and some cookies without the parent’s astonishment. When this
news is reported on TV, the users reported that their echo dots also tried to order doll house
by listening to the voice of the reporter. We will look at some security problems coming
with the technology advancements.
2 History
2.1 Early decades: 1910 - 1980
Radio rex was the first toy with a vocal base activation system invented in the late 1920s.
A wooden sogs comes out upon calling its name.
The Automatic Digit Recognition system, dubbed ”Audrey” by Bell Labs, was unveiled in
1952. It took up a six-foot-high relay rack, drank a lot of power, had a lot of cables, and had
all the problems that come with intricate vacuum-tube electronics in terms of maintenance.
It was able to distinguish between phonemes, which are the basic units of speech. It was
limited to a strict digit recognizer with few designated users only. It could thus be used
for voice dialling, but in most circumstances, instead of reciting the successive digits, push-
button dialling was cheaper and faster. The IBM Shoebox voice-activated calculator, which
was introduced to the general public during the 1962 Seattle World’s Fair after its original
market introduction in 1961, was another early gadget that could perform digital speech
recognition. This early computer, which was developed nearly 20 years before the first IBM
Personal Computer was introduced in 1981, could detect 16 spoken phrases and the numerals
6
0 to 9[4].
MIT scientist Joseph Weizenbaum created first NLP-based chatbot called. ELIZA in
the late 1960s. It was invented to show the connection between humans and machines
is superficial. ELIZA employed pattern matching and replacement methods in scripted
responses to simulate conversation, giving the impression that the software understood what
was being said [4].
Weizenbaum’s assistant allegedly asked him to leave the room so she and ELIZA could
have a proper talk. Weizenbaum was taken aback, noting subsequently, ”I had not imag-
ined... that extremely brief exposures to a pretty simple computer programme might cause
significant delusional thinking in fairly normal persons.”[4]
The ELIZA effect, also known as anthropomorphisation, is a phenomena that occurs
in human interactions with virtual assistants. It is the tendency to automatically assume
computer activities are equivalent to human behaviours.
The next milestone was achieved in the 1970s at Carnegie Mellon University in Pitts-
burgh, Pennsylvania, where Us based DARPA agency stated half decade long research on
Speech Understanding with the goal of achieving a vocabulary of 1,000 words. IBM, Carnegie
Mellon University (CMU), and Stanford Research Institute were among the companies and
universities that participated in the programme.[4]
The result of the research was called ”Harpy” with the knowledge of 1000 words and
was able to understand some phrases. Harpy was able to do the analysis of speech using
preprogrammed vocabulary, speech, and grammar for the identification of word sequences
making sense together, and minimizing the speech recognition errors.[4]
Another example of such IVA is ”TANGORA”, which is a voice-recognition typewriter.
it was launched in 1986 as an advanced form of the ”SHOEBOX”. It has knowledge of about
20k words to predict the outcome of the query based on the historic inputs. One interesting
fact about ”Tangora” is that it is named after the fastest typewriter at that time. The IBM
used Markov Model to deal with the statistics of digital signal processing techniques. The
approach allows you to forecast which phonemes are most likely to follow a given and Each
user had to teach the ”TANGORA” to identify his or her vocals and phrases among the
words.[4]
7
Two of the IVAs, ”COLLOQUIS” and ”SmarterChild” were made public in 2001 on the
platform of some messengers.These two are able to play games, search the facts, and are able
to check for the weather. Although they are only text-based IVAs.
Apple made a debut in the IVA market with its Siri as a part of the iPhone 4S in 2011.
It was the first modern IVA to be part of smartphones. after the acquisition of Siri inc.
by Apple, it goes into upheaval due to the collection of the fund in research from the US
Department of Defense and DARPA a US-based research center. The aim of Siri was to
make text answers, phone calls, weather reports, putting alarms, making internet search and
the further aim was extended to give suggestions also.[4].
Amazon debuted Alexa alongside the Echo in November 2014.Amazon launched a service
in April 2017 that allows users to create conversational interfaces for any sort of virtual
assistant or interface[4].
8
3.2 Voice based IVA:
A virtual assistant is a programme that recognises voice instructions and does tasks on the
user’s behalf. Virtual assistants can be found on most smartphones and tablets, as well as
desktop computers and standalone devices such as the Amazon Echo and Google Home.
They use a combination of specialised computer chips, microphones, and software to
listen for precise spoken commands from you and respond in the voice you choose.
The programme coverts the human spoken words to text and then fed that to cloud based
systems. As given below:
4 Future Advancements
4.1 Multi-Modal Intelligent Virtual Assistants:
With time, the only text-based IVAs couldn’t satisfy human needs and hence the new re-
searches are made to form such and IVAs that can accept the input by means of speech,
by touch, by pen, by gestures, and by the movement of body parts. Such IVAs are then
referred to as the multi-modal IVAs as there is a number of input modes.This technology,
which includes a touch screen and a speech recogniser, is used to handle numerous non-
critical automotive functions, such as weather, navigation queries, phone calls, etc, in the
Ford Model U Concept Vehicle, for example. with this improvement, the newer ways like
speech, command-and-control interface are introduced in some car systems which ruled the
car market with this new technology. the prototype provides the human language dialogue
interface with an attractive graphical interface for the user[1].
”Semio is building a cloud-based platform to allow humans to use robots through natural
communication—speech and body language,” according to a statement from the University
9
of Southern California. They propose a method for designing the Next-Generation of Virtual
Personal Assistants, based on the Multi-modal dialogue system, employs techniques such as
input by means of gestures, images, videos or speech, a vast dialogue and conversational
knowledge base, and a general knowledge base to increase user-computer interaction. Fur-
thermore, our method will be applied to a variety of jobs, including educational support,
medical support, robots and vehicles, disability systems, home automation, and security
access controls[10].
The method includes some novel features that distinguish this device, such as using
the TV by displaying data on a screen or connecting it to one, watching shows with the
translation of language, textbase conversation with others in any language, understanding
body language, and movements, and playing games with speech and gesture recognition; it
can also be used to read facial and speech expressions[10].
ASR Model, Gesture Model, Graph Model, Interaction Model, User Model, Input Model,
Output Model, Inference Engine, Cloud Servers, and Knowledge Base are added to the
original structure of general dialogue systems to change the general model to Multi-modal
dialogue systems, allowing the Next-Generation of Virtual Personal Assistants to be designed
with high accuracy.
10
Figure 2: Structure of next generation virtual assistant
A) Knowledge Base
Generally, 2 types of knowledge bases are there. First is the online base and the remaining
is the local knowledge base, the common thing is that both contain all the data and facts
pertaining to each model, for example, facial data and gesture data for the next model i.e.
gesture model, vocal recognition in a knowledge base, vocabulary and spoken phrases for
the ASR model, graphical data for graph model and some other useful information and
configuration of systems are stored in this base.
B) Graph Model :
Graph model uses graphical input such as images and videos in real-time base, it extracts
the frames from the videos collected by the camera and the connected input model and then
it further connects those frames and images to graph modals and applications running on
cloud servers for the data analysis and then applications return the results. Refer Fig. 3.
C) Gesture model
This model deals with the reading of the human body movement and facial expressions
and gestures made by them using the camera and some sensors in the input models, then
data is further connected to gesture model and apps on cloud servers for data analysis and
delivers the result. Refer Fig. 4.
D) ASR Model
11
This model deals with the reading of speech input and recognition in real-time base with
the help of a microphone in the connected input model along with the ASR model in cloud
servers for vocal recognition and then covert the vocal data into the text, then the text is
further connected to the applications in a cloud server for analysis and result is received.
Refer Fig. 5.
E) Interaction Model
This is the main model that is used to establish the interaction between the system and
the models by forwarding data of input model after evaluating and determining which model
to send this data depending upon the duties, then to arrive at the result that can be used
to arrive at the final result.
F) Inference Engine
In the chain of conditions and derivations, the inference engine collaborates with the
Interaction Model to derive the conclusion. They examine all of the facts and rules, sort
them, and then come up with a solution.
G) User Model
This model contains all of the information on the system’s users. Personal information
such as users’ names and ages, hobbies, skills and expertise, objectives and ambitions, pref-
erences and dislikes, and statistics about their behaviour and interactions with the system
can all be included.
H) Input Model
This model will coordinate the operation of all input devices used by the system to gather
data from the microphone, camera, and Kinect. In addition, before delivering the data to
the Interaction Model, this model contains intelligence algorithms to arrange the input data.
I )Model of Output
This model receives the final decision of the Interaction model with the explanation of
the results, then its duty is to choose the perfect output device to show the data to the user,
for example, screen, speaker, etc.
THE SYSTEM IS BEING TESTED
The entire system will analyse user inputs and build queries to cloud servers and knowl-
edge sources in order to complete tasks and obtain data for the response generating models
to output.
In this way the multi-modal Intelligent Virtual assistant can be developed[1].
12
Figure 3: The graph model
make it difficult for persons with hearing and speech impairments. Because of the risk of
being misunderstood, the employment of a virtual assistant causes problems in important
and tense situations.
We may prevent these unfavourable situations by employing telepathic personal helpers.
The BCI is used to decode the user’s ideas without the need of physical or vocal inputs. We
connect the BCI with IVA to increase their effectiveness in receiving inputs from users, and
we employ Bone Conduction Technology to allow persons with auditory difficulties to enjoy
the virtual assistant environment. The emotional classifier aids in determining the user’s
emotional state and allowing the virtual assistant to respond appropriately[2].
13
Figure 5: The ASR Model
2. Bone conduction: For decades, bone conduction technology has been routinely em-
ployed. The gadget turns sound into vibration, which is the eardrum’s fundamental
function, and the vibrations reach the cochlea, which is attached to the auditory nerve,
which then communicates the sounds to our brain. As a result, those with hearing dif-
ficulties so those seeking a unique listening experience utilise this[2].
3. Sentimental Classifier: Positive, negative, or neutral is the act of detecting and classi-
fying material in order to ascertain the user’s current feeling. It is commonly used by
online retailers and data analysts to categorise items based on customer feedback[2].
4. Voice Assistance: Bixby Voice is a popular virtual assistant. Samsung’s Bixby Voice
is an IVA. It conducts activities that an IVA could undertake for a client using the
customer’s selected voice, much like any other voice assistant[2].
5. EEG Headset: It aids in the monitoring and reading of brain impulses. It is made of
of sensors that monitor brain activity and communicate the data to computers. One
such EEG headgear is the NeuroSkyMindWave Mobile BrainWave[2].
14
Figure 6: The next generation Virtual Assistant
attention, such as focusing on the notion of pressing any letter on the keyboard. As a result,
the user’s attention is crucial in producing complicated dialogues. The sentimental classifier
is then given the text generated by translating the signals, which analyses the user’s emotions
using the training data set. This sentimental analysis aids in a more effective Understanding
of the user by elucidating the user’s motivations and feelings in relation to the text. Bone
conduction technology has taken the responsibility to communicate in an unobstructed way
and to take the technology to such level that it can be helpful for the peoples with auditory
problems and are diable to use normal IVAs.
15
Figure 7: Architecture of the proposed system
16
trained. All of the symbols appear in a fast, random order. Based on the data recorded by
EEG during flashing, the computer is able to differentiate the desired sign from all others
after it has flashed many number times. The EEG headset would be connected to Open
BCI, and the data get processed and sent via Bluetooth link to the programme. EEGLAB
is the software we use to analyse and alter information.
17
Figure 9: The Request Processing
18
5.1 Possible Security & Privacy Threats
1. Wiretapping an IVA ecosystem:
Sniffing the traffic between companion apps and the IVA can reveal the ecosystem’s
communication mechanisms, even if the apps use encrypted network connections.
Not all network traffic between IVA-enabled devices and cloud-hosted services is sent
through a secure protocol, according to our analysis. Many devices don’t use en-
crypted communications to check network connectivity, making IVA devices detectable
in a home network.Firmware image data could be delivered in unencrypted packets,
making the system vulnerable to man-in-the-middle attacks and malicious image al-
teration. Even if firmware images aren’t modified, having access to them creates a
security risk because it allows unauthorised people to view an IVA-enabled device’s
internal functionality.[3].
19
1. Amazon Alexa Ecosystem
We focused our efforts on Amazon Alexa and its ecosystem, as previously stated.
Various Alexa-enabled devices are required to engage with the Alexa cloud service.
Alexa IVA may be used for a number of tasks, including as managing to-do lists, playing
playlists, setting mornings alarms, placing shopping orders, searching for information,
and checking traffic updates. The Alexa cloud creates and retains several forms of
digital traces connected to a user’s behaviour during this process[7].
3. Data Description
We gathered data from a participant’s daily life over the course of three months, using
a number of Alexa-connected devices, including Echo Dots, Dash Wand, Fire Tablet,
and Fire TV[7].
Similarly, the other findings can be derived in same manner using the previous data. Some
of this findings are the timing of alarms, users interested, daily schedule, driving routines.
The privacy of users is one of the most serious threats they face. If IVA data may reveal
all of this information about a person, privacy becomes the primary concern. Some of the
proposals include allowing the IVA provider to permanently remove the user’s history so
that the data cannot be used to predict lifestyle[7].
20
7 Conclusion
This paper concludes that, as technology advances, Intelligent Virtual Assistants are be-
coming an increasingly important part of everyone’s life. Although security is a worry, we
may gradually increase privacy and security. with the help of emerging technologies like
gesture recognition, image recognition, video recognition, speech recognition, the IVAs are
continuously evolving to smoothen the human-machine interaction. These systems can also
be utilized for a variety of functions, including education, healthcare sector, medical aid,
automated vehicles, home automation, etc. It can also be a potential solution for customer
service, training or education, facilitating transactions, travel information, counseling, on-
line ticket booking, reservation bookings, remote banking, information inquiries, and many
more services. We can provide virtual support for disabled people with the use of telepathic
IVAs, so they can communicate with IVAs using their thoughts. With the springing up in
technology, cyber security methods are also improving to keep the systems more and more
safe for the user.
21
References
[1] V. Këpuska and G. Bohouta, ”Next-generation of virtual personal assistants (Microsoft
Cortana, Apple Siri, Amazon Alexa and Google Home),” 2018 IEEE 8th Annual Com-
puting and Communication Workshop and Conference (CCWC), 2018, pp. 99-103, doi:
10.1109/CCWC.2018.8301638.
[3] H. Chung, M. Iorga, J. Voas and S. Lee, ”“Alexa, Can I Trust You?”,” in Computer, vol.
50, no. 9, pp. 100-104, 2017, doi: 10.1109/MC.2017.3571053.
[5] CX Today. (2021, May 24). How do bots and chatbots work? Retrieved October 17,
2021, from https://www.cxtoday.com/contact-centre/how-do-bots-and-chatbots-work/
[6] Hoy, Matthew B. (2018). ”Alexa, Siri, Cortana, and More: An Introduc-
tion to Voice Assistants”. Medical Reference Services Quarterly. 37 (1): 81–88.
doi:10.1080/02763869.2018.1404391. PMID 29327988. S2CID 30809087.
[7] Chung, H. (2018, February 28). Intelligent virtual assistant knows your life. ArXiv.Org.
Retrieved October 11, 2021, from https://arxiv.org/abs/1803.00466
[10] Sudhakar Reddy, A., Vyshnavi, M., & Raju Kumar, C. (2020,
March). VIRTUAL ASSISTANT USING ARTIFICIAL INTELLIGENCE.
https://www.jetir.org/papers/JETIR2003165.pdf
22
Acknowledgment
I would like to take this opportunity to convey to Sir Sunny Bodiwala my gratitude and
deep appreciation for offering his continuous encouragement, support and motivation during
the work. This work may not come out in time without his encouragement and valuable
suggestions. The analysis methodology for scientific research he taught is an important
experience for me.
It gives me pleasure to express my deep sense of gratitude to Head of the Computer Sci-
ence and Engineering Department Dr. Mukesh A. Zaveri for providing us an opportunity
to present our work. Also I thank all the faculty members of the Computer Science and
Engineering and colleagues as they spent their valuable time guiding me during the work.
Prince Nandha
U19CS045
23