0% found this document useful (0 votes)
24 views6 pages

Artificial Intelligence-Based Chatbot With Voice Assistance

The document discusses the development of an AI-based chatbot with voice assistance, designed for smartphones and other portable devices, utilizing OpenAI's GPT-3.5 model. It highlights the growing popularity of voice assistants and the integration of speech recognition and synthesis technologies to enhance user interaction. The application aims to provide a seamless conversational experience while addressing privacy concerns through a secure API that does not store user data.

Uploaded by

gunjiambika2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views6 pages

Artificial Intelligence-Based Chatbot With Voice Assistance

The document discusses the development of an AI-based chatbot with voice assistance, designed for smartphones and other portable devices, utilizing OpenAI's GPT-3.5 model. It highlights the growing popularity of voice assistants and the integration of speech recognition and synthesis technologies to enhance user interaction. The application aims to provide a seamless conversational experience while addressing privacy concerns through a secure API that does not store user data.

Uploaded by

gunjiambika2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies

CHRIST (Deemed to be University), Pune Lavasa Campus, India. Mar 22-23, 2024

Artificial Intelligence-Based Chatbot with Voice


Assistance
2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies (TQCEBT) | 979-8-3503-8427-7/24/$31.00 ©2024 IEEE | DOI: 10.1109/TQCEBT59414.2024.10545197

1A. 2D. 3S. 4K. Susithra,


Balamurugan, Thiruppathi P. Santhoshkumar,
UG Scholar, UG Scholar Assistant Professor, Assistant Professor,
Rathinam Technical Campus, Rathinam Technical Campus, Rathinam Technical Campus, Rathinam Technical Campus,
Coimbatore, Tamilnadu, India. Coimbatore, Tamilnadu, India. Coimbatore, Tamilnadu, India. Coimbatore, Tamilnadu, India.
[email protected] [email protected] [email protected] [email protected]

Abstract— The use of smartphones, tablets and other questions toefficiently converse with any human. The number
portable devices have increased, and so has the use of voice- of cloud- based chatbot services that are now accessible for the
assisted chatbots such comparable to Alexa, Apple's Siri, and growth and development of the chatbot industry has
Google's Voice Assistant. As simple as these voice assistants are, significantly increased recently [2], including IBM Watson,
they are very helpful in easing users’ day-to-day tasks. Now as Cleverbot, ELIZA chatbot, and many more. Over the last
the use of artificial intelligence (AI) based chatbots such as several years, the art of interaction between humans and
ChatGPT, Bard and Microsoft Bing have started to be artificial intelligence has significantly advanced as these
progressively used more and more to get assistance with
conversational machines have become more sensitive.
increasingly complicated tasks, there is a need for a chatbot that
is both artificial intelligence-based and has the option to provide Personal voice assistants (PVAs) are beneficial, and their
voice assistance. In order to realize this, need we have developed usage rate is rising quickly. For instance, 81% of individuals
a web application that integrates both, an AI- large language in the U.S. possess a smartphone [3], and 21% of people
model constructed using the GPT-3.5 Generative Pre-trained possess a minimum of one internet of things (IoT) based
Transformer. model and voice assistance. Theapplication takes speaker [4]. Users are therefore highly likely to constantly be
in an input in the form of audio clip and converts it into text, within range of at least one PVA. Users might not be mindful
sends the query to the GPT-3.5 model and receives the response of them, have no control over how they behave, or be unable
in text form and converts that to an audioclip.
to disable them. Given that the gadgets can listen to and
Keywords— voice assistants, chatbots, artificial intelligence,
comprehend speech, there is also a concern about the privacy
deep learning, voice chatbot, react, generative pre-trained and security these new technologies offer. To address this
transformers, gpt-3.5, openai issue, we have used a text-to-speech application programming
interface (API) that does not store any data In addition to the
I. INTRODUCTION OpenAI GPT-3.5 model API, which provides transparency
regarding the collection, storage, and handling of user data.
The ever-increasing use of both voice assistants and
artificial intelligence-based chatbots has led to many II. RELATED WORKS
applications developed in the realms of voice assistance and
text-only chatbots. These voice assistants immensely help A. Research Overview
users perform many simple tasks quickly, especially on Research examining possible usage hazards observed
portable devices such as smartphones, tablets, laptops, and the while utilizing the voice aid was released in [5] by Sutcliffe
like. For instance, Apple’s Siri is widely used by users of et al. The research examined shortcomings, related sources of
Apple devices to make calls, set alarms and reminders, and attack, and possible solutions that users of voice-controlled
perform simple calculations, hands-free. Google’s Voice devices may utilize to safeguard themselves.
Assistant also lets Android phone users do the same things on
their phones. We constantly see improvements in the realm of In [6], Sharif K. et al. investigated how conversational
these voice assistants, both in terms of new applications being systems enable interaction between humans and computers by
built with better features and the existing ones also being translating what users want into the operation of thegadgets,
updated with newer features. Voice assistants seem to be analyzing user voice, or interpreting user motions.
mostly used by individuals nowadays for tunes, search, and The paper also examined the key developments, pressing
smart appliance management[1]. Yet, this innovation has the problems, and difficulties associated with a smart virtual
ability to go much farther asthe gateway to a massive network assistant. In the research, a system of classification for IPA
of computing resources and machine learning (ML). categorization was also suggested. The PICOC (population,
In recent times, the use of artificial intelligence-powered intervention, comparison, outcome, and context) parameters
chatbots has also increased immensely. People prefer these were employed in the methodology.
chatbots to looking things up on the internet by themselves as Bérubé C et al. investigated customer sentiment regarding
these can easily provide a thorough, detailed answer to the virtual assistant usage in travel-related services that are
questions that the users want answered. These chatbots prove somewhat utilitarian or hedonistic (air and hotel) in [7, 8]. The
to have the ability to simulate the cognitive abilities that results of the investigation imply that societal factors,
humans have. A chatbot is a software programmed that serves hedonistic incentive, anthropomorphism, desire of results and
as A dialogue that occurs between a user and the automaton. effort, and feelings about artificial intelligence gadgets all
In recent years, this particular form of virtual assistant has have an impact on tourists' adoption of VA. The results of this
experienced a meteoric rise in popularity., largely as a result study imply that although using voice-controlled devices to
of significant advancements inartificial intelligence, machine give practical assistance is okay, using AI gadgets to deliver
learning, as well as foundational fields like neural nets and the hedonistic services may not be the best idea.
processing of natural language. These chatbots use interactive

979-8-3503-8427-7/24/$31.00 ©2024 IEEE 1


Authorized licensed use limited to: Zhejiang University. Downloaded on January 20,2025 at 05:17:54 UTC from IEEE Xplore. Restrictions apply.
An investigation on the use of AI in bots was offered by III. METHODOLOGY
the authors Anirudh Khanna, Bishwajeet Pandey, Kushagra
Vasishta, and Kartik Kalia [9]. It entails the procedure for A. Proposed Methodology
instructing the creation of intelligent machines. It discusses The creation of a web application using OpenAI's GPT-
some of the most recent AI practices and offers an alternative 3.5 API, which is made accessible to the general public, is one
theory for enhancing the currently popular practices as well as of the technologies utilised to create the voice assistant
recommendations for distributed adoption. The development powered by artificial intelligence. A user's input is collected
of artificially intelligent machines—also known as intelligent by the website, which then transmits it to the API, which
machines—came about as a result. It also demonstrates that creates an AI-generated answer and sends it back to the web
AI alone cannot provide a satisfactory outcome; it must also application, where it is spoken and displayed for the user. The
take into account the alternative idea that intelligent machines overall flow of the application can be seen in Figure 2.
will inevitably expand the capabilities of intelligent systems
Similar to widely used voice assistants like Google Voice
in the years to come. Artificial intelligence, or AI, is the term
Assistant, Siri, and Alexa, the online application may accept
used to describe the development of intelligent machines that
user input in the form of speech and text. In addition to
are integrated with the backend algorithms. Intelligent
presenting the answer as text, the web application may also
machines have a wide range of abilities. Leading
provide an articulate response in the form of voice to the user.
developments in the field of AI include deep learning, brain
simulations, NLP processing, face recognition, and work on We decided to construct the website using the React
neural networks and cyber security, among other things. The framework, and we used the voice synthesis module that
creation of bots is one such instance of an AI system. Any such comes with React.
chatbot program may comprehend a variety of human
languages thanks to the idea of NLP processing.
The analysis is the primary focus of Natural Language
Processing (NLP), which is a subfield of artificial intelligence
research. of human- machine communication using natural
language. Such an AI chatbot may also carry out certain
notable functions, such as conducting computations, setting
reminders or alerts for users to inform them of their upcoming
tasks, etc.
B. Kinds of Voice Assistant Employments
The present appeal for consumer smart speakers like Fig. 2. Overall Flow of the Application
Alexa, Home Pod, etc. is to blame for this. According to a
2019 survey, there are smart speakers in 35% of US homes, B. Open AI GPT-3.5 API
with that number expected to rise to 75 percent by 2025 [10]. A tool created by OpenAI called the GPT-3.5 API gives
The usage of humanoids is also widespread since voice programmers a means to include sophisticated language skills
assistant usefulness depends on usability metrics like into their applications. It is a development of the GPT
anthropomorphism [11]. In addition, Fig. 3 demonstrates that (Generative Pre-trained Transformer) series, which is
there has been very little research on voice-controlled devices renowned for its skill in producing text that resembles human
for automobile interfaces. Car interfaces are voice-activated speech in response to input cues.
aids that stand between the motorist and the vehicle. The VA
automobile interface enables users to view vehicle data and do With the use of this API, we have been able to utilize the
tasks without diverting their attention away from the road.The GPT-3.5 model to carry out a number of language-related
audio companion software that is integrated into computers or activities, including writing emails, producing code,
mobile devices is the fourth form of technological interaction. responding to inquiries, developing conversational bots, and
The research that we have gatheredeither utilized version of more. The model can emulate human language patterns and
the application interface that is commercially available, such provide logical and contextually appropriate replies since it
as Alexa and Siri, or other people created new voice- has been trained on a wide variety of online literature.
controlled interfaces which are simple to use by consumers
because of the widespread use of mobile phones and machine We took use of this language model's capability by using
assistants via code and abilities. Both, however,take the shape the GPT-3.5 API to respond to users of this application's
of various application agents. queries. The model receives a prompt or a text input from the
API and responds with a text completion or answer depending
on the input. It's comparable to having a flexible writing
assistance who can handle various assignments.
It's important to keep in mind, however, that there are
certain difficulties with the GPT-3.5 API. A notable
shortcoming of the model for our use case is that, although it
often generates information that is reasonable but factually
inaccurate or inappropriate for the context. To guarantee that
the produced material satisfies their needs, users must
carefully model the queries they pose to the voice assistant.
Fig. 1. Popular Methods of Implementation of Voice Assistants

2
Authorized licensed use limited to: Zhejiang University. Downloaded on January 20,2025 at 05:17:54 UTC from IEEE Xplore. Restrictions apply.
C. React Framework The application starts listening for input once the user
A well-liked JavaScript package used to create user activates the application to listen to them. The application
interfaces for web applications is called the React framework. actively converts all the speech the user inputs into text and
React, a Facebook creation, is now widely used because it is repeats this process for as long as there is speech input from
effective at producing interactive and dynamic UI the user. Once the user has stopped speaking, the application
components. takes a moment to make sure there is no more input and
updates the web application with the speech it has recognized
React has a component-based design that enables as the chat input from the user. After this, the converted text-
developers to build independent UI element units. These form input is sent to the OpenAI GPT-3.5 API to generate a
elements may be compared to building bricks that can be put response powered by artificial intelligence.
together to create complex user interfaces. This method
improves the modularity, reuse, and maintainability of the E. Speech Synthesis in React
code, which makes it simpler to handle complicated projects. React's voice synthesis module is a tool that enables web
apps to produce spoken language out of text. By adding an
The usage of a virtual DOM (Document Object Model) is audio component to the user interface, this feature improves
one of Reacts distinguishing characteristics. React can accessibility and user engagement.We have included the Web
effectively update and render components by only changing Speech API, a widely used technology that makes text-to-
the essential UI elements thanks to its virtual representation of speech capability possible in web browsers, into the React
the real DOM, which improves speed and responsiveness. framework. This web application that speaks the text-form
This has made it simple for us to construct the web answer from the OpenAI GPT-3.5 API was made possible by
application since user inquiries may be submitted to the server this API.
and shown on the screen without forcing the browser to reload We have given users the opportunity to hear material read
the page.
aloud by using the voice synthesis module. This is useful for
In addition, the React ecosystem provides tools like React designing interesting experiences where users may multitask
Router and Redux for managing application routing and state or take in information without having to read, as well as for
management, respectively. These extra tools support Reacts those with visual impairments.
features and make it easier to create extensive web apps. The API gives developers control over variables like voice
D. Input Speech Recognition selection, pitch, pace, and loudness, enabling them to
We have used the React Speech Recognition module precisely tailor the synthesized speech to the context of the
available in React to convert the voice input from the users of application and the intended user experience.
the application to text. The converted text is sent to the The voice synthesis module should be implemented
OpenAI GPT-3.5 API properly, and a response is received carefully, taking into account the demands and preferences of
from the API in text form. The text is then also updated on the the user. Because some users might prefer a silent browsing
webpage seamlessly and is synthesized to a spoken output. experience, developers should also offer the option to disable
or customize the speech synthesis feature. For this reason, we
have also provided users with the option to switch from voice
assistant mode to text-only chatbot mode whenever they want.
IV. MODULES
The modules used in the application include firstly, the
GPT-3.5 model. The model has been implemented through the
Open AI GPT-3.5 algorithm.
A React-based web application has been built
implementing the React Speech Recognition and React
Speech Synthesis modules available in the React Framework.
These act as the intermediates between the input by the user
to the application and the output response sent by the OpenAI
GPT-3.5 API. These modules are explained in detail in the
following subsections.
GPT-3.5 Model
Although GPT-3.5 bases itself on GPT-3, it adheres to
particular ethical principles and has just 1.3 billion parameters
as opposed to the previous version's 100X more. The
architecture of this GPT-3.5 model can be seen in Figure.
Also referred to as Instruct, this model was developedusing
identical datasets as GPT-3 but with extra customization that
included a notion dubbed "reinforcement learning with human
feedback" or RLHF.
Artificial intelligence (AI) has a subsection called (RLHF)
that focuses on incorporating human input to enhance
Fig. 3. Input Flow of the Application machine learning methods. In RLHF, the machine learning

3
Authorized licensed use limited to: Zhejiang University. Downloaded on January 20,2025 at 05:17:54 UTC from IEEE Xplore. Restrictions apply.
algorithm receives input from the human and uses it tomodify during a training session. The word models are saved, and
the behavior of the model. The restrictions of guided and when a word has to be recognized later on, its structure will
unregulated training—where machine learning algorithms are be compared to the models that have been saved, and the word
only able to learn from labelled or unlabeled data—are with the most similarities is chosen. Pattern identification is
addressed by this method. the term often used to describe this method.
Input from humans may take on many different forms,
including praising or penalizing the model's activities,
labeling unidentified information, or changing parameter
values. The purpose of RLHF is to enhance the performance
and problem-solving capabilities of machine learning
algorithms by incorporating human experience and
knowledge.
Therefore, the models in the "GPT 3.5" series were trained
using a combination of text and code that was presentbefore
Q4 2021. The GPT-3.5 series includes the following models:
Code-davinci-002 is a rudimentary model that is well-
suited for tasks that solely require code completion. The
Instruct model text-davinci-002 is based oncode-davinci-002.
Fig. 5. Speech Recognition Architecture
Text-davinci-003 enhances text-davinci-002 in many
ways. Speech Synthesis
A refined variation on GPT-3.5 or instruct, which isitself a NLP as well as Digital Signal Processing (DSP)
tweaked variation of GPT-3, can participate in talks about a components make up the two components of the TTS engine
range of subjects. This is the very well-known ChatGPT as demonstrated in Figure 1. The NLP module includes
interaction architecture with an average zero-shot accuracy prosodic, phonetic, and text analysis [13].
between 77.40% and 82.89% for certain use cases [12].
Text analysis: consists of three separate obligations. The
content structure identification component which sets thestage
for the remaining components, is the initial one. The text
standardization process, which is the second component,
transforms raw text that includes representations like numbers
and acronyms to the corresponding form of words written
down. The third job is linguistic evaluation, which identifies
the meaning and grammatical characteristics of words.
Phonetic analysis: splits and classifies the writtenmaterial
into prosodic components, such as clauses, phrases, and
sentences; provides a transcription of the phonetics for every
term. There are two methods for phonetic transformation:
rules-based and dictionary-based methods. For unfamiliar
terms, a rule-driven method is utilized, but forknown words,
Fig. 4. GPT-3.5 Architecture
an approach based on the dictionary is used.
A. Speech Recognition Prosody analysis: This branch of linguistic analysis
Figure 5 illustrates the main parts of a conventional voice examines the pitch and melodic elements of speech.Feelings,
recognition system, which comprise the lexicon, language state of mind, and presenter disposition may all have an impact
model, acoustic front-end, and decoder. The task of on prosody [14].
transforming the voice signal into relevant characteristics is It establishes the tone, loudness, and length of speech
handled by the auditory front-end, which offers information models. The Digital Signal Processing (DSP) component
that is helpful for recognizing. The extraction of features is a includes data-driven (concatenation) and rule-driven voice
method by which a source sound signals from a recording generation techniques [15].
device is transformed into a series of fixed-size acoustic
vectors. The audio vectors of the data being trained are used The general architecture of a text-to-speech module is
to estimate the characteristics of word and phone models. The shown in figure 6.
decoder works by examining every potential word The speech recognition module provided by the React
combination in order to identify the one that is most probable framework for getting the user input is able to accurately
to produce. An acoustic model P (O|W) is used to define recognize the user speech and convert it into text and send it
probability, and a language model determines P (W). as input to the OpenAI GPT-3.5 API.
A variety of speech features are extracted from the audio The response received from the OpenAI GPT-3.5 API as
voice stream for every single word or syllable unit as part of the output is displayed as text on the web application and is
the automated speech identification system's capability. The spoken out aloud by the React Speech Synthesis module in a
speech parameters create a pattern that characterizes the word clear, easy to understand voice, making the application a great
or sub-word by describing it via their fluctuation over time. voice assistant for everyday use.
The operator reads every word in the application's lexicon

4
Authorized licensed use limited to: Zhejiang University. Downloaded on January 20,2025 at 05:17:54 UTC from IEEE Xplore. Restrictions apply.
OpenAI GPT-3.5 API is capable of responding quickly and as
expected Table:1.

TABLE I.

Fig. 6. General Text-to-Speech Architecture

The overall user interface of the web application built is


easy to use and intuitive with an optional dark mode for when
the user needs a night mode theme on the application.
A few sample screenshots of the web application’s
functionalities are given in figures 7 to 9.

VI. CONCLUSION
As the demand for voice assistants and artificial
intelligence based chatbots has increased drastically in recent
times, there has been a need for applications that integrate the
functionalities and practicality of both together.
Fig. 7. Sample Output - 1
With the aid of React's voice recognition and speech
synthesis plugins, the web application we created can
interpretuser inquiries quite well and provide understandable
answersin response.
The average accuracy of the GPT-3.5 model has been
shown to be between 77.40% and 82.89% for certain language
related use cases, which is good enough for general purpose
questions that users need to be answered from chatbots.
This application could further be developed to be able to
work with Internet of Things (IOT) based devices such as
Fig. 8. Sample Output - 2 smart home appliances, and control them from the application
similar to how voice assistants that don’t integrate artificial
intelligence into them work.
The application could also be implemented with the
OpenAI GPT-4.0, which would improve the accuracy of the
response from the model much further.
REFERENCES
[1] T. Ammari, J. Kaye, J. Y. Tsai, and F. Bentley. “Music, Search, and IoT:
How people (really) use voice assistants,” ACM Trans. Comput.-Hum.
Interact., vol. 26.3, pp. 1-28, 2019.
[2] A M Rahman, Abdullah Al Mamun, Alma Islam. "Programming
Fig. 9. Sample Output – 3 challenges of chatbot: Current and future prospective", 2017 IEEE
Region 10 Humanitarian Technology Conference (R10-HTC), 2017.
V. RESULTS [3] Mobile Fact Sheet, 2021, [online] Available:
https://www.pewresearch.org/internet/fact-sheet/mobile/.
The web application was built using the React framework
implementing the pre-trained GPT-3.5 model for the chatbot [4] Gangadevi, E., Rani, R. S., Dhanaraj, R. K., & Nayyar, A. (2023). Spot-
out fruit fly algorithm with simulated annealing optimized SVM for
to respond and answer user queries as much as a human detecting tomato plant diseases. In Neural Computing and
companion as possible. The model implemented through the

5
Authorized licensed use limited to: Zhejiang University. Downloaded on January 20,2025 at 05:17:54 UTC from IEEE Xplore. Restrictions apply.
Applications. Springer Science and Business Media LLC.
https://doi.org/10.1007/s00521-023-09295-1
[5] The Smart Audio Report, Aug. 2018, [online] Available:
https://www.nationalpublicmedia.com.
[6] Sutcliffe A, Gault B. Heuristic evaluation of virtual reality applications.
Interact Comput 16. 2004;4:831–49.
https://doi.org/10.1016/j.intcom.2004.05.00.
[7] Sharif K, Tenbergen B. Smart home voice assistants: a literature survey
of user privacy and security vulnerabilities. Complex Syst Inform
Model Quart. 2020;24:15–30.
[8] Lugano G. Virtual assistants and self-driving cars. In: 2017 15th
International Conf erence on ITS Telecommunications (ITST) (pp 1–
5). IEEE; 2017.
[9] K. Sujigarasharma, M. L. Shri, E. Gangadevi, R. K. Dhanaraj, N. C and
B. Balusamy, "Detection and Classification of Speech Disorder using
FOA-SCNet," 2023 3rd International Conference on Computing and
Information Technology (ICCIT), Tabuk, Saudi Arabia, 2023, pp. 391-
395, doi: 10.1109/ICCIT58132.2023.10273910
[10] Bérubé C, Schachner T, Keller R, Fleisch E, Wangenheim F, Barata F,
Kowatsch T. Voice-based conversational agents for the prevention and
management of chronic and mental health conditions: systematic
literature review. J Med Internet Res. 2021;23(3): e25933.
[11] Anirudh Khanna, Bishwajeet Pandey, KushagraVashishta, Kartik
Kalia, “A study of Today’s A.I through chatbots and Rediscovery of
Machine Intelligence”, International Journal of u- and e- Service,
Science and Technology, vol.8, No.7,2015.
[12] Moar JS. Cov id-19 and the Voice Assistants Market. Juniper Research.
Retrieved Nov ember 25, 2021, from
https://www.juniperresearch.com/blog/august-2021/covid-19-and-the-
voice-assistants-market.
[13] Vailshery LS. Topic: Smart speakers. Statista. Retrieved November 25,
2021, from https://www.statista.com/topics/4748/smart-
speakers/#:~:text=As%20of%202019%20an%20estimated,increase%
20to%20around%2075%20percent.
[14] M. Lawanya Shri, E. Ganga Devi, Balamurugan Balusamy, and Jyotir
Moy Chatterjee, Ontology based information retrieval and matching in
IoT applications, Apple academic press and CRC press, Hard ISBN:
9781771888646
[15] Penteado, Maria & Perez, Fábio. (2023). Evaluating GPT-3.5 and GPT-
4 on Grammatical Error Correction for Brazilian Portuguese.
[16] Isewon, O. Oyelade, and O. Oladipupo, “Design and implementation
of text to speech conversion for visually impaired people,”
International Journal of Applied Information Systems, vol. 7, no. 2, pp.
26–30, 2012.
[17] P. Taylor, Text-to-speech synthesis. Cambridge university press, 2009.
[18] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech
synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064,
2009.

6
Authorized licensed use limited to: Zhejiang University. Downloaded on January 20,2025 at 05:17:54 UTC from IEEE Xplore. Restrictions apply.

You might also like