Artificial Intelligence Powered Voice To Text and Text To Speech Recognition Model - A Powerful Tool For Student Comprehension of Tutor Speech
Artificial Intelligence Powered Voice To Text and Text To Speech Recognition Model - A Powerful Tool For Student Comprehension of Tutor Speech
Artificial Intelligence Powered Voice To Text and Text To Speech Recognition Model - A Powerful Tool For Student Comprehension of Tutor Speech
Fig.1: CRISP-ML(Q) Methodological Framework, outlining key components and steps visually
(Source:- Mind Map - 360DigiTMG)
As data plays a paramount role in development of any Below are details of data we used to train the
ML or deep learning solution. Since, the goal is to solve the algorithm.
accent problem faced by students, the data is collected from
various sources through the internet. The dataset was Table 1 Data used to Train the Algorithm.
constructed by combining various audio files available on Number of audio hours 20
the internet, majorly focusing on audio with thick accent. A Number of text files 20
diversity was maintained to get a generalized transcription Size of all the files 401 MB
for different audio files. Since diversity in the dataset can
make the model more robust, however, this was not the case B. Model architecture
for transcription. On initial assessment, it was observed that Since the focus of the research is to come up with a
the transcripts generated for thick accents were of poor good transcript for better understanding of students, a simple
quality or below-average standard. To overcome this issue, architecture is maintained by using some pre-trained
several preprocessing and filtering techniques are used to models[6]. The main aim for the usage of pretrained models
boost the transcript quality. is that the model gets deployed easily on a local computer
without the hassle of worrying about huge parameters.
Students travel to other places for quality education
which will eventually boost their financial status. However, For a high level picture the user will provide a link and
not able to understand the course due to different native the model will automatically find its corresponding
places is a major issue most of the student faces and hence transcript. The generated transcript is passed to Google text
reducing their understanding to speech which will convert the texts into spoken audio
files.
II. METHODS AND TECHNIQUES
A. Data Dimensions
The data is constructed from various sources on the
internet. The main focus is on Asian accents including
Chinese, Singaporean, south Asian and other accents. This
results in a diverse dataset covering a broad audio
distribution from various environments. However, diversity
in audios can help give a robust model but diversity in
transcript will only lead to misleading outputs.
Since the model uses WhisperAI for the transcription, Since the aim is to convert the text into audio, we
its architecture uses the encoder-decoder part of the further do some simple audio preprocessing to understand
Transformer[7]. The audio is resampled to 16000 Hz and an the nature of sound visually. For doing this, Librosa[13] is
80 channel log-magnitude mel spectrogram is computed on used. It is a powerful package for music and audio analysis,
25 millisecond windows with 10 milliseconds stride. Once a widely used in fields such as music transcription, speech
log mel spectrogram[8] is computed, the input then passes recognition and information retrieval. Once the audio
through 2 convolution layers having filter width of 3 and signals are loaded using Librosa, it converts the audio files
GELU[9] function. The two convolution layer has 0 and 2 into NumPy arrays making further computation easy and
stride respectively. The output is passed to a positional fast. After the audio signals are converted into numerical
encoding after which the encoder of Transformer follows. representation, a waveform can be plotted to visually
Encoder output is normalized before passing it to decoder compare the original audio and the generated audio. We
blocks. The decoder uses position embeddings and the token further plotted a spectrogram to retrieve the frequency
representations. The encoder and decoder have the same content of a signal as it changes over time. For feature
number of blocks. extraction MFCC[14] is used. It stands for Mel-Frequency
Once the transcript is generated using the above model, Cepstral Coefficients. It is majorly used in speech
it is made to pass through a Google Text-to-Speech recognition and speaker identification tasks. Derived from
(gTTS)[10], a Python library and unofficial Google API for short-time Fourier transform (STFT), it is used for spectral
converting text to speech. Once a request has been made by characteristics of the signal. The x-axis represents time
the user for the input text, the API returns audio data in the index and y-axis represents MFCC coefficients. The color
requested format. The supported formats are mp3 or wav as intensity represents the value of the corresponding MFCC
provided. The language of the audio file can be set by using coefficients. The higher intensity represents higher MFCC
the hyper parameter ‘lang’. The whole flow is automatically values and the lower intensity represents lower MFCC
done using pipelines.[Fig.2] coefficients. Overall MFCC graph helps to identify patterns
in MFCC values for future interpretation.[Fig.3]
C. Data Preprocessing Fig. 3: MFCC Coefficients Chart, showing How the Spectral
For a simpler approach the data preprocessing is done Information of the Sound and Showing the Power Spectrum
using NLP techniques[11]. Once transcription is generated for any Analyzed Sound
we sort out data preprocessing which follows multiple steps
starting from tokenization, stemming and bag of word D. Deployment
representation. Each sentence is broken down into word For easy usability and accessibility, organizations use
tokens and special characters are removed using the Regex cloud deployments. Here, we used AWS as it provides a
library[12] in Python, a powerful library to normalize any wide variety of tools. One major aspect when it comes to
text. Once our tokens are generated, base word is found deployment is to avoid data leakage. AWS comes with
using stemming. To understand the patterns in the text, we powerful data security along with not having to deal with
visually represent the words using Word cloud. investing in hardware.[Fig.2]
E. Implementation
The model is a solution which translates based on the
understanding of the local accent. It generates output as well
as learns from the user input and user reported errors. The
product is deployed in amazon web services(aws), it allows
users the freedom of deployed region, scale, security,
isolation etc. The model can be run on both cpu and gpu
which offers different options for end users as well as cost
Fig.4: The Image here is the Formula for Word Error Rate
saving measures.
used for ASR
(Source - https://sonix.ai/articles/word-error-rate )
The product caters to various students, teachers,
educational institutions and other interested groups. Such
models also learn from constant feedback loop which makes Once transcription is generated, our next aim is to take
the transcription and generate an audio. For a seamless flow,
it better as time passes. This allows the students to focus on
gTTs is used. The output will show a transcription, followed
studying and improving their grades and allows for much
by an audio file and accuracy. Here, we have provided both
better communications between students and teachers, thus
leading to overall development in the society. The models transcription and audio for better understanding.
used in the paper are open source platforms, this reduces the
overall time and cost taken as we train only on small dataset We measure the quality of the synthetic audio
generated with the metric called Mean Opinion
using transfer learning.
Score(MOS)[16] It is an arithmetic mean of rating given by
human experts who give their opinion after they listen to the
Frontend: it is a user interface designed using streamlit
sample audio based on the naturalness of the audio and
application in which the user either uploads either text or
intelligibility. it is of a scale from 1 to 5 where 1 means bad
audio
and 5 means excellent[Fig.5]
Middleware: amazon web service (aws) is where the all
compute takes place, it is the heart of the project
Backend: this is the place where we upload all the files
including python scripts, models and their weights as
well as streamlit files.
Database: It is used to store both structured and
unstructured data. NoSQL uses unstructured data while
SQL uses structured data. Fig. 5: The Image shown here is the Formula for Mean
Opinion Score
III. RESULTS AND DISCUSSION (Source -
https://en.wikipedia.org/wiki/Mean_opinion_score)
A good model is that which will generate a high-
quality transcription. A detailed report will help the student We had around 4 people who listened to the audio of
to grasp good knowledge of a particular topic. The end different parts and rated them based on their naturalness,
result generates a transcription along with an audio file. Any clean accent, intelligibility, pleasantness, articulation,
accent is stripped off in the audio. To boost the confidence overall impression etc…[17] We achieved a MOS of 4.2.
of the student regarding the reliability of the system, [Fig.5]
accuracy is shown at the end.
IV. FUTURE SCOPE
The overall workflow starts by asking the user to enter
a URL of the audio file. Once the corresponding file is To ensure the model works well in the real world, it
received, transcription takes place. Here, whisper API is has to be trained and fine-tuned on larger and much more
used to access the speech-to-text model. The API key used diverse dataset. We recognized that at least 1000 hours of
is a paid key and hence poses a usability challenge. To deal training is required. More training will help the model
with video files FFmpeg is used which is a powerful open- reduce WER[Fig.4] as close to zero as possible. For a robust
source software known for its audio extraction capabilities. model which delivers high quality content, further fine
The audio file is then sent to the Whisper model to get the tuning is required to meet the performance standards.
transcription. [Fig.2]
To optimize model performance for real-world [8]. A. Meghanani, A. C. S. and A. G. Ramakrishnan, "An
applications, we recognize the need for further training on a Exploration of Log-Mel Spectrogram and MFCC
larger and more diverse dataset, ideally encompassing at Features for Alzheimer’s Dementia Recognition from
least 1000 hours of data. By expanding the training data, we Spontaneous Speech," 2021 IEEE Spoken Language
aim to enhance the model's accuracy, aiming for a Word Technology Workshop (SLT), Shenzhen, China, 2021,
Error Rate (WER) as close to zero as possible. Additionally, pp. 670-677, doi: 10.1109/SLT48900.2021.9383491.
we plan to fine-tune subsequent models to ensure they meet keywords: {Neural networks;Speech
our performance standards. This approach ensures that our recognition;Predictive models;Mel frequency cepstral
models are robust and capable of delivering high-quality coefficient;Root mean
results in various contexts. square;Spectrogram;Dementia;log-Mel
spectrogram;MFCC;transfer
V. CONCLUSION learning;Alzheimer;dementia;MMSE;CNN;LSTM;Res
Net18},
The implementation of this paper aims for a future [9]. Hendrycks, D., & Gimpel, K. (2016). Gaussian error
where children don't have to experience any hardships when linear units (gelus). arXiv preprint arXiv:1606.08415.
they pursue education. as more of the youth around the [10]. Mankar, Shruti & Khairnar, Nikita & Pandav, Mrunali
globe starts to pursue higher studies from the best available & Kotecha, Hitesh & Ranjanikar, Manjiri. (2023). A
educational institutions and travel both local and abroad to Recent Survey Paper on Text-To-Speech Systems.
pursue them. There are also online courses where we can International Journal of Advanced Research in
access the best education without language barriers. Our Science, Communication and Technology. 77-82.
paper aims to bridge the skill gap with the use of artificial 10.48175/IJARSCT-7954.
intelligence and transform the way students learn. This [11]. Davide Falessi, Giovanni Cantone, and Gerardo
paper is only a fist step towards where difference of Canfora. 2010. A comprehensive characterization of
language and accent does not matter. NLP techniques for identifying equivalent
requirements. In Proceedings of the 2010 ACM-IEEE
REFERENCES International Symposium on Empirical Software
Engineering and Measurement (ESEM '10).
[1]. Radford, A., Kim, J. W., Xu, T., Brockman, G., Association for Computing Machinery, New York,
McLeavey, C., & Sutskever, I. (2023, July). Robust NY, USA, Article 18, 1–10.
speech recognition via large-scale weak supervision. In https://doi.org/10.1145/1852786.1852810
International Conference on Machine Learning (pp. [12]. Uzun, Erdinç & Yerlikaya, Tarık & Kirat, Oğuz.
28492-28518). PMLR. (2018). Comparison of Python Libraries used for Web
[2]. gTTS — gTTS documentation Data Extraction. 24. 87-92.
[3]. Chang Jungwon, Nam Hosung. Exploring the [13]. McFee, Brian & Raffel, Colin & Liang, Dawen &
feasibility of fine-tuning large-scale speech recognition Ellis, Daniel & Mcvicar, Matt & Battenberg, Eric &
models for domain-specific applications: A case study Nieto, Oriol. (2015). librosa: Audio and Music Signal
on Whisper model and KsponSpeech dataset. Analysis in Python. 18-24. 10.25080/Majora-
Phonetics Speech Sci. 2023;15(3):83-88. 7b98e3ed-003.
https://doi.org/10.13064/KSSS.2023.15.3.083 [14]. Tiwari, Vibha Tiwari. (2010). MFCC and its
[4]. Sally Boyd (2003) Foreign-born Teachers in the applications in speaker recognition. Int. J. Emerg.
Multilingual Classroom in Sweden: The Role of Technol.. 1.
Attitudes to Foreign Accent, International Journal of [15]. Ahmed Ali and Steve Renals. 2018. Word Error Rate
Bilingual Education and Bilingualism, 6:3-4, 283-295, Estimation for Speech Recognition: e-WER. In
DOI: 10.1080/13670050308667786 Proceedings of the 56th Annual Meeting of the
[5]. Studer, S.; Bui, T.B.; Drescher, C.; Hanuschkin, A.; Association for Computational Linguistics (Volume 2:
Winkler, L.; Peters, S.; Müller, K.-R. Towards CRISP- Short Papers), pages 20–24, Melbourne, Australia.
ML(Q): A Machine Learning Process Model with Association for Computational Linguistics.
Quality Assurance Methodology. Mach. Learn. Knowl. [16]. Streijl, R.C., Winkler, S. & Hands, D.S. Mean opinion
Extr. 2021, 3, 392-413. score (MOS) revisited: methods and applications,
https://doi.org/10.3390/make3020020 limitations and alternatives. Multimedia Systems 22,
[6]. Qian, Yao & Bianv, Ximo & Shi, Yu & Kanda, 213–227 (2016). https://doi.org/10.1007/s00530-014-
Naoyuki & Shen, Leo & Xiao, Zhen & Zeng, Michael. 0446-1
(2021). Speech-Language Pre-Training for End-to-End [17]. M. Seufert, "Fundamental Advantages of Considering
Spoken Language Understanding. 7458-7462. Quality of Experience Distributions over Mean
10.1109/ICASSP39728.2021.9414900. Opinion Scores," 2019 Eleventh International
[7]. Verma, P., & Berger, J. (2021). Audio transformers: Conference on Quality of Multimedia Experience
Transformer architectures for large scale audio (QoMEX), Berlin, Germany, 2019, pp. 1-6, doi:
understanding. adieu convolutions. arXiv preprint 10.1109/QoMEX.2019.8743296.
arXiv:2105.00335.