0% found this document useful (0 votes)

18 views10 pages

Iaesarticle

Uploaded by

midare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views10 pages

Iaesarticle

Uploaded by

midare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

IAES International Journal of Artificial Intelligence (IJ-AI)

Vol. 99, No. 1, Month 2099, pp. 1∼1x

ISSN: 2252-8938, DOI: 10.11591/ijai.v99.i1.pp1-1x ❒ 1

Application of Deep Learning in Speech-to-Text Systems for

Touchscreen Keyboards
Abraham K.S. Lenson1 , Gregorius Airlangga1
1 Department of Information System, Faculty of Engineering, Atma Jaya Catholic University of Indonesia, Tangerang, Indonesia

Article Info ABSTRACT

Article history: In the realm of modern machine learning, speech recognition technology has
emerged as a paramount area of interest, driven by its vast potential applica-
Received month dd, yyyy
tions. However, there exists a noticeable gap in literature specifically addressing
Revised month dd, yyyy comparative analyses of various speech recognition models. This paper provides
Accepted month dd, yyyy a detailed comparison of multiple leading speech recognition systems, evaluat-
ing them based on their performance and accuracy. OpenASR Librispeech was
Keywords: used as the analysis dataset. To translate our research into a tangible outcome,
we have developed a proof-of-concept application. This application is created
Speech recognition in the form of an android keyboard and utilizes the most promising model iden-
Machine Learning tified in our study.
Audio
OpenAI Whisper This is an open access article under the CC BY-SA license.

Corresponding Author:
Abraham K.S. Lenson
Department of Information System, Faculty of Engineering, Atma Jaya Catholic University of Indonesia
Tangerang, Indonesia
Email: [email protected]

1. INTRODUCTION
Deep Learning is a subset of Machine Learning [1] that uses an artificial neural network to learn from
data [2]. A properly trained deep learning model is very robust, being able to complete tasks such as speech
recognition, language identification, translation, and many more [3]. This has been used to increase accessibil-
ity and is particularly useful to those with disabilities [4]. Studies previously done in regards to application are
often related to relatively specialised usecases including education and disabilities. Examples include Explo-
ration of Automatic Speech Recognition for Deaf and Hard of Hearing Students in Higher Education Classes
[5] and Speech disabilities in adults and the suitable speech recognition software tools - a review [6]. In this
paper we are going to focus on the usage of deep learning for speech to text solutions in regard to general usage
typing.
When looking at the current trajectory of previous trends, it has been noted that there is an increase in
the trend of using End-to-End (E2E) models for speech recognition [7]. E2E models differs from the ”conven-
tional” speech recognition pipeline. Whilst the ”conventional” method splits the process into multiple different
algorithms (acoustic model, language model, etc), E2E model transforms the output from input using a single
continuous process[8].
Industry giants including Nvidia has also shifted focus to the future of performance optimised machine
learning hardware [9]. Said factors have resulted in the explosive growth of consumer level deep learning
models. With models, often open source, being able to run not only on server farms but also consumer grade
hardware. Examples of models are OpenAI Whisper [3], Mozilla DeepSpeech [10], Kaldi [11] Vosk, Coqui
STT, PocketSphinx, PaddleSpeech [12], and many more.

Journal homepage: http://ijai.iaescore.com

2 ❒ ISSN: 2252-8938

The reasoning behind the creation of this project is due to the lack of comprehensive analysis compar-
ing the difference between available models. The current available research are either relatively outdated, such
as in the research paper ”Comparing open-source speech recognition toolkits” [13], or very specific in scope.
The aim of this project is twofold. The first goal being analysing and comparing current available
models, especially from a performance standpoint as end-users have relatively limited hardware capabilities
when compared to enterprise data centres. The second goal is to do a proof of concept implementation on
low-end devices, namely a virtual keyboard on mobile phones.

2. RELATED WORK AND BACKGROUND CONCEPTS

Speech-to-text, typically known under an umbrella term called speech recognition, is by itself not an
entirely new concept. Multiple implementations has already been made, example includes Google implement-
ing the feature in 2008 with Google Search by Voice[14]. While implementations between different models
vary, models typically follows the Input-Process-Output (IPO) model. The input being user speech and the
output being the transcript.
The research paper by Jinyu Li [7] documents the advances of speech recognition ever since the
adoption of deep neural network (DNN) based hybrid modeling approach. The paper dives deep into the
technical implementation of multiple speech models. Back when speech recognition system was first improved
upon at scale, it relies on heavily on the traditional hybrid approach. The hybrid approach consists of multiple
different parts and algorithms doing different things to achieve the desired result (ie. acoustic model, language
model, lexicon model, etc.). This model requires specialised implementation and expert knowledge. The
recent breakthrough in end-to-end (E2E) modeling changes the landscape by allowing the IPO process to be
done using a single network. This in turn (1) optimise the whole network stack, (2) Simplifies the translation
pipeline, and typically (3) more compact and easily deployable compared to the traditional hybrid model.

2.1. End-to-end speech recognition

End-to-end (E2E) speech recognition model transcribes speech input into human-readable output in a
single network. According to Yanzhang, it is a ”good candidates for on-device speech recognition”[15]. The
comparative analysis and implementation of E2E models will be the main focus of this research paper. The
speech models that will be compared are:
1. OpenAI Whisper
2. Mozilla Deepspeech
3. Kaldi Vosk
4. Coqui STT
5. PocketSphinx

2.2. Performance testing

Performance Testing is considered a non-functional type of testing and is used to know how well
the system performs [16]. The test will be a quantitative test where the performance of each models will be
compared against one another. Multiple factors are tested including Word Error Rate (WER), Real-Time Factor
(RTF), Memory (RAM) usage, processing speed, accuracy, and CPU usage.

2.3. Virtual keyboard

A virtual keyboard is a keyboard displayed on a screen, where the user typically taps on the ”virtual
keys” with either a finger or a stylus[17]. Virtual keyboard are currently the de-facto input method used in most
mobile devices. It has replaced the once ubiquitous 3x4 numeric keypad and eliminated the need for physical
keys.

3. METHOD
3.1. Model testing variables
Each model will be tested and compared against one another. The criterias consists of Word Error
Rate (WER) and Real-Time Factor (RTF), which is used in ”Comparison of Speech Recognition Performance
Between Kaldi and Google Cloud Speech API” [18]. Memory (RAM) usage, CPU usage, and Storage Re-
quirements will also be added as additional supporting factors.

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

Int J Artif Intell ISSN: 2252-8938 ❒ 3

Word Error Rate is calculated with the equation [19]:

S+D+I
W ER = (1)
N
where
S = number of substitutions
D = number of deletions
I = number of insertions
N = number of words in the reference
Real-Time Factor is calculated with [20]:
time(decode(a))
RT F = (2)
length

The closer the RTF is to 1, the closer the model is to real-time speed (Note that RTF of < 1 may only
be achieved in tests with pre-recorded data, as it is not bottlenecked by data creation speed).
The usage of RAM and CPU will be calculated by monitoring the system using top, the built in
Linux system performance monitoring tool. The data will be snapshotted every 2 seconds and averaged to get
the mean value. Lastly, the storage requirements will be done by calculating the size of the algorithm software
and the pre-trained model.
The analysis is done inside a Virtual Machine (VM) running:
a. OS: Fedora Linux 38 (x86 64)
b. CPU: Intel i7-9750H (6 cores allocated)
c. RAM: 16GB DDR4 2666MHz

3.2. Testing method

3.2.1. Model Acquisition
This research tests 5 models which includes Mozilla Deepspeech, PocketSphinx, Coqui STT, Vosk,
and OpenAI Whisper. The model itself is downloaded based on availability following this order: Sourced from
the official project method ⇒ Package maintained by distribution (Fedora) ⇒ Built from source file. Note
that there may be variances on the file size with respect to platform architecture (ie. x86, ARM) and operating
system.
3.2.2. Testing Data
The data used to test the models is test-clean.tar.gz, sourced from OpenASR LibriSpeech
ASR Corpus [21]. The dataset is picked due to it being used in most speech recognition models for training and
testing. Of all the models tested, only PocketSphinx excludes LibriSpeech from the training data, instead opting
to use pronounciation based detection system named Carnegie Mellon Pronouncing Dictionary (CMUdict).
We opted to limit the test data to 5 out of the 40 folders inside the dataset for efficiency reasons,
resulting in around 40 minutes of testing data. We picked the folders randomly using the command ls |
shuf -n 5. shuf is a random permutation generator built into most Linux distributions through GNU
coreutils. In this case, it resulted in folder 672, 1221, 2830, 4446, and 7176. The audio files are then merged
together with sox and converted to 16-bit wav files using ffmpeg[22]:
ffmpeg -i MIXED.flac -ar 16000 -ac 1 -c:a pcm\_s16le MIXED16.wav
The resulting audio is inputted into each model for testing. Each model is monitored with top -b
-d | grep $PROCESSNAME. The data is finally cleaned up using regex scripts and processed in Python.
3.3. Proof of concept implementation
The model with the best performance/accuracy compromise is then implemented as an android key-
board. The implementation is based on the adapted model by usefulsensors [23] and implementation by nyadla-
sys[24]. With the main UI consisting of 3 parts, that is 2 buttons and a single text field. The ”Record” button is
used to trigger the audio recording backend. The Audio is then processed by pressing the ”Transcribe” button,
with the resulting data outputted inside the text field. Tapping the text field will result in the text being outputted
as a keyboard input.

Application of Deep Learning in Speech-to-Text Systems for Touchscreen Keyboards (Abraham K.S. Lenson)
4 ❒ ISSN: 2252-8938

Figure 1. Proof of Concept Implementation

3.4. Proof of concept comparison method

This implementation is compared to the current most widely used implementation of Android Speech
to text system, Google Gboard. The comparison data used is a single folder picked from the previously unused
OpenASR LibriSpeech Corpus. Using shuf resulted in folder 908 being used. Note that the data used in this
section is relatively small, it is only used as a comparison and does not represent the overall accuracy of each
model.
For this test, we used Edifier R1700BT as our speaker. The phone used in this test is a Xiaomi Poco
X3 Pro, a midrange phone from 2021.

Figure 2. Illustrated Test Environment

To emulate a real world scenario while minimising variables, the phone is mounted on a tripod posi-
tioned 40cm away from the speaker. The speaker volume is set to maximum, with the system volume set to
50%, and the software volume set to either 50% of 70%, corresponding to 60dB (average conversation volume)
and 70dB (loudness of a classroom chatter) respectively[25].

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

Int J Artif Intell ISSN: 2252-8938 ❒ 5

4. RESULT & DISCUSSION

4.1. Model Comparison

Figure 3. CPU Usage Comparison

Out of all tested models, only Whisper is able to use more than a single CPU core for the Speech
Recognition Process (excluding software overhead).

Figure 4. RAM Usage Comparison

The base Vosk model used the most RAM, rising to near 7GB before dropping back to under 5.5GB.
Deepspeech and Coqui STT RAM usage slowly increases over time, using a scorer increases the early and con-
sistent RAM usage by around 600-800MB on both model. Whisper RAM usage spikes early before dropping
to a flat RAM allocation. Vosk small RAM usage slowly rises before stabilising, except for the final small rise
at the end also seen in the base model Vosk. PocketSphinx has the lowest RAM usage, stabilising at just under
200MB.

Application of Deep Learning in Speech-to-Text Systems for Touchscreen Keyboards (Abraham K.S. Lenson)
6 ❒ ISSN: 2252-8938

Table 1. Performance comparison of each model

Model Time (s) RTF WER CPU Memory Model Training Data Total
Usage %Core Usage (MB) Size (MB) Size (MB) Storage (MB)
1 DeepSpeech 1395.51 0.58 0.11 105.7 1335.6 9.2 1100.0 1109.2
2 DeepSpeech (No Scorer) 2900.99 1.2 0.15 102.5 937.4 9.2 180.2 189.4
3 PocketSphinx 506.66 0.21 0.29 98.9 139.5 1.04 36.11 37.15
4 Coqui STT 1082.22 0.45 0.08 100.6 1457.2 3.57 978.2 981.77
5 Coqui STT (No Scorer) 2362.74 0.97 0.11 100.0 1024.4 3.57 45.1 48.67
6 Vosk Small 217.46 0.09 0.1 98.0 388.0 2.4 67.6 70.0
7 Vosk 686.05 0.28 0.05 98.6 5099.7 2.4 2700.0 2702.4
8 Whisper (Tiny 1C) 418.86 0.17 0.06 98.5 448.5 0.92 74.1 75.02
9 Whisper (Tiny 4C) 122.22 0.05 0.06 335.6 458.1 0.92 74.1 75.02
10 Whisper (Base 1C) 941.32 0.39 0.04 98.5 575.8 0.92 141.1 142.01
11 Whisper (Base 4C) 334.56 0.14 0.04 357.4 580.1 0.92 141.1 142.01

The main contenders for implementation when considering the RTF are PocketSphinx, Vosk, Vosk Small,
and Whisper (Tiny). With Whisper leading in multicore environment, Vosk Small in single core, and Pocket-
Sphinx being slower than the two but using the least amount of RAM. However, Vosk is unsuitable for low-end
devices due to requiring large amount of memory.
When viewed from a WER perspective, PocketSphinx has the worst result at 29% WER. Whisper with
the tiny training data is 40% more accurate than Vosk small (6% and 10% WER respectively) at the cost of
double the processing time. This issue is alleviated by Whisper being able to process data with multiple CPU
cores. This resulted in Whisper being better in terms of accuracy while Vosk being able to work in heavily
resource restricted scenario.

Figure 5. Real-Time Factor Comparison

Figure 6. Word Error Rate Comparison

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

Int J Artif Intell ISSN: 2252-8938 ❒ 7

4.2. Proof of concept comparison

Figure 7. Real-Time Factor Comparison

The difference in Real-Time Factor are minimal, with Whisper being faster by 0.02 points in 60dB
and Google being faster by 0.03 points in 70dB testing.

Figure 8. Word Error Rate Comparison

Word Error Rate is where the result diverges. Starting with Whisper, it achieves a 22% WER for
60dB and 20% WER for 70dB. Google struggles in 60dB testing, failing to detect more than 60% of the total
sentences, resulting in a massive WER of 75%. At 70dB the detection rate improves to 75% resulting in a
WER of 33%, still 65% less accurate than Whisper at 70dB.

Table 2. Proof of Concept Performance Comparison

model Average Time (s) Average RTF (s) Word Error Rate
1 Whisper (60dB) 10.2 1.27 22%
2 Whisper (70dB) 10.29 1.29 20%
3 Google (60dB) 10.53 1.3 75%
4 Google (70dB) 10.49 1.26 33%

We speculate that the massive difference in error rate is caused by Google Gboard failing to detect
spoken speech, resulting in the input not getting processed. Meanwhile, the Whisper proof of concept imple-
mentation allows the user to directly control which part is considered speech, directly bypassing the issue.

Application of Deep Learning in Speech-to-Text Systems for Touchscreen Keyboards (Abraham K.S. Lenson)
8 ❒ ISSN: 2252-8938

5. CONCLUSION
From the analysis and testing that have been done, multiple conclusions could be reached. The con-
clusions are:
1. The current best open source local speech recognition system to date is Whisper by OpenAI. Although
Kaldi Vosk may work better in specific situations with restrictive performance constraints at the cost of
accuracy.
2. When compared to Google Voice, Whisper achieves comparable processing time while having higher
overall accuracy.
3. Local on-device speech recognition is a viable alternative to online speech recognition system.
In future projects, researchers could focus on user experience and accessibility testing, as this project
has shown that local speech recognition is no longer an issue when considering technical implementation.

REFERENCES
[1] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, “Deep learning for visual
understanding: A review,” vol. 187, pp. 27–48. [Online]. Available: https://linkinghub.elsevier.com/
retrieve/pii/S0925231215017634
[2] P. S. Kumar, H. Behera, A. K. K, J. Nayak, and B. Naik, “Advancement from neural networks to
deep learning in software effort estimation: Perspective of two decades,” vol. 38, p. 100288. [Online].
Available: https://linkinghub.elsevier.com/retrieve/pii/S1574013720303889
[3] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition
via large-scale weak supervision.” [Online]. Available: http://arxiv.org/abs/2212.04356
[4] M. Oberembt, “The effects of text-to-speech on students with reading disabilities.” [Online]. Available:
https://scholarworks.uni.edu/grp/945
[5] J. Butler, B. Trager, and B. Behm, “Exploration of automatic speech recognition for deaf
and hard of hearing students in higher education classes,” in The 21st International ACM
SIGACCESS Conference on Computers and Accessibility. ACM, pp. 32–42. [Online]. Available:
https://dl.acm.org/doi/10.1145/3308561.3353772
[6] Balaji V and G. Sadashivappa, “Speech disabilities in adults and the suitable speech recognition software
tools - a review,” in 2015 International Conference on Computing and Network Communications
(CoCoNet). IEEE, pp. 559–564. [Online]. Available: http://ieeexplore.ieee.org/document/7411243/
[7] J. Li, “Recent advances in end-to-end automatic speech recognition.” [Online]. Available:
http://arxiv.org/abs/2111.01690
[8] S. Wang and G. Li, “Overview of end-to-end speech recognition,” vol. 1187, no. 5, p. 052068. [Online].
Available: https://iopscience.iop.org/article/10.1088/1742-6596/1187/5/052068
[9] B. Keller, R. Venkatesan, S. Dai, S. G. Tell, B. Zimmer, W. J. Dally, C. Thomas Gray, and B. Khailany,
“A 17–95.6 TOPS/w deep learning inference accelerator with per-vector scaled 4-bit quantization for
transformers in 5nm,” in 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and
Circuits). IEEE, pp. 16–17. [Online]. Available: https://ieeexplore.ieee.org/document/9830277/
[10] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta,
A. Coates, and A. Y. Ng, “Deep speech: Scaling up end-to-end speech recognition.” [Online]. Available:
http://arxiv.org/abs/1412.5567
[11] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek,
Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in
IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing
Society, place: Hilton Waikoloa Village, Big Island, Hawaii, US.

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

Int J Artif Intell ISSN: 2252-8938 ❒ 9

[12] H. Zhang, T. Yuan, J. Chen, X. Li, R. Zheng, Y. Huang, X. Chen, E. Gong, Z. Chen, X. Hu, D. Yu,
Y. Ma, and L. Huang, “PaddleSpeech: An easy-to-use all-in-one speech toolkit.” [Online]. Available:
http://arxiv.org/abs/2205.12007
[13] C. Gaida, P. Lange, R. Petrick, P. Proba, A. Malatawy, and D. Suendermann-Oeft, “Comparing open-
source speech recognition toolkits.” [Online]. Available: http://suendermann.com/su/pdf/oasis2014.pdf
[14] Google. Now you can speak to google mobile app on your iPhone. [Online]. Available:
https://googleblog.blogspot.com/2008/11/now-you-can-speak-to-google-mobile-app.html
[15] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu,
R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S.-y. Chang, K. Rao,
and A. Gruenstein, “Streaming end-to-end speech recognition for mobile devices.” [Online]. Available:
http://arxiv.org/abs/1811.06621
[16] TestingXperts. Performance testing – a complete guide. [Online]. Available: https://www.testingxperts.
com/blog/performance-testing-guide/

[17] M. Silfverberg, “Historical overview of consumer text entry technologies,” in Text Entry Systems. Else-
vier, pp. 3–25. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/B9780123735911500012
[18] T. Kimura, T. Nose, S. Hirooka, Y. Chiba, and A. Ito, “Comparison of speech recognition performance
between kaldi and google cloud speech API,” in Recent Advances in Intelligent Information Hiding and
Multimedia Signal Processing, J.-S. Pan, A. Ito, P.-W. Tsai, and L. C. Jain, Eds. Springer International
Publishing, vol. 110, pp. 109–115, series Title: Smart Innovation, Systems and Technologies. [Online].
Available: http://link.springer.com/10.1007/978-3-030-03748-2 13
[19] S. F. Chen, D. Beeferman, and R. Rosenfeld, “Evaluation metrics for language models,” publisher:
Carnegie Mellon University.
[20] O. Plátek, “Speech recognition using kaldi.”

[21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public
domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 5206–5210. [Online]. Available: http://ieeexplore.ieee.org/document/7178964/
[22] S. Tomar, “Converting video formats with FFmpeg,” vol. 2006, no. 146, p. 10, publisher: Belltown Media
Houston, TX.

[23] N. Yadla, G. Nicholson, and M. Kudlur, “openai-whisper.” [Online]. Available: https://github.com/

usefulsensors/openai-whisper
[24] N. Yadla, V. Ninawe, S. mukthaparapu, and W. A. Khan, “Whisper.tflite.” [Online]. Available:
https://github.com/nyadla-sys/whisper.tflite

[25] OSHA. Occupational noise exposure. [Online]. Available: https://www.osha.gov/noise#loud

Application of Deep Learning in Speech-to-Text Systems for Touchscreen Keyboards (Abraham K.S. Lenson)
10 ❒ ISSN: 2252-8938

BIOGRAPHIES OF AUTHORS

Abraham K.S. Lenson Studied Information System in Atma Jaya Catholic University of Indone-
sia. In 2022, he earned a national scholarship and enrolled as an exchange student in University of
Galway, Ireland. His technical interests includes software development, Linux system management,
automation, and reverse engineering. He can be contacted at email:[email protected].

Patricia Melin received the D.Sc. degree (Doctor Habilitatus D.Sc.) in computer science
from the Polish Academy of Sciences, Warsaw, Poland, with the Dissertation “Hybrid Intelligent
Systems for Pattern Recognition using Soft Computing”. She is a Professor of Computer Science in
the Graduate Division, Tijuana Institute of Technology, Tijuana, Mexico since 1998. In addition, she
is serving as Director of Graduate Studies in computer science and Head of the research group on
Computational Intelligence (2000–present). Her research interests are in Type-2 Fuzzy Logic, Modu-
lar Neural Networks, Pattern Recognition, Neuro-Fuzzy and Genetic-Fuzzy hybrid approaches., She
is currently the President of Hispanic American Fuzzy Systems Association (HAFSA) and is the
founding Chair of the Mexican Chapter of the IEEE Computational Intelligence Society. She can be
contacted at email: [email protected].

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

The Day I Picked Up Dazai Side B
100% (1)
The Day I Picked Up Dazai Side B
32 pages
The Day I Picked Up Dazai Side A
100% (3)
The Day I Picked Up Dazai Side A
45 pages
Internal and External Data Sources For MIS
No ratings yet
Internal and External Data Sources For MIS
2 pages
Learning Together Is Fun!: Learning English Through Sharing Picture Books
No ratings yet
Learning Together Is Fun!: Learning English Through Sharing Picture Books
11 pages
DL Based Speech To Text Converter For Audio Visual Applications
No ratings yet
DL Based Speech To Text Converter For Audio Visual Applications
4 pages
Main (pt2)
No ratings yet
Main (pt2)
13 pages
7sem Projectreport
No ratings yet
7sem Projectreport
33 pages
Wa0002.
No ratings yet
Wa0002.
10 pages
6.python Text To Speech
No ratings yet
6.python Text To Speech
2 pages
1 s2.0 S0957417424009850 Main
No ratings yet
1 s2.0 S0957417424009850 Main
11 pages
Ai Project Sona-1 (1) - 250630 - 194118
No ratings yet
Ai Project Sona-1 (1) - 250630 - 194118
10 pages
Basepaperssummary
No ratings yet
Basepaperssummary
6 pages
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
No ratings yet
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
81 pages
Speech Recognition
No ratings yet
Speech Recognition
13 pages
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
NCSA SPIN Summer 2020 Research Report
No ratings yet
NCSA SPIN Summer 2020 Research Report
5 pages
Presentation 1
No ratings yet
Presentation 1
22 pages
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
From Everand
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Eric Vargas
No ratings yet
Project in DSP Using Python
No ratings yet
Project in DSP Using Python
3 pages
(IJCST-V9I2P18) :swati, Harpreet Kaur
No ratings yet
(IJCST-V9I2P18) :swati, Harpreet Kaur
6 pages
Synopsis
No ratings yet
Synopsis
5 pages
DL Proj Rep
No ratings yet
DL Proj Rep
11 pages
Development of Multilingual Speech
No ratings yet
Development of Multilingual Speech
13 pages
LOTED: a semantic web portal for the management of tenders from the European Community
From Everand
LOTED: a semantic web portal for the management of tenders from the European Community
Francesco Valle
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Lightweight End-To-End Text-To-Speech Synthesis Fo
No ratings yet
Lightweight End-To-End Text-To-Speech Synthesis Fo
6 pages
KY DSV
No ratings yet
KY DSV
7 pages
Onine Speech To Text Engine For Delimited Context
No ratings yet
Onine Speech To Text Engine For Delimited Context
90 pages
Speech Recognition On Mobile Devices
No ratings yet
Speech Recognition On Mobile Devices
27 pages
Final Synopsis PANS
No ratings yet
Final Synopsis PANS
14 pages
Speech Recognition System Using Python Report
No ratings yet
Speech Recognition System Using Python Report
7 pages
Mad Lab Report
0% (2)
Mad Lab Report
27 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
From Everand
Image Collection Exploration: Unveiling Visual Landscapes in Computer Vision
Fouad Sabry
No ratings yet
Performance Evaluation of Offline Speech Recogniti
No ratings yet
Performance Evaluation of Offline Speech Recogniti
16 pages
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
From Everand
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
DG. Junior
No ratings yet
133-138, Tesma0810, IJEAST
No ratings yet
133-138, Tesma0810, IJEAST
6 pages
Literature Survey1
No ratings yet
Literature Survey1
4 pages
Scaling Up Online Speech Recognition Using ConvNets
No ratings yet
Scaling Up Online Speech Recognition Using ConvNets
8 pages
Paper 5728
No ratings yet
Paper 5728
3 pages
Math El
No ratings yet
Math El
17 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Conversionof Image, Value, and Text To Speech by Using Machine Learning
No ratings yet
Conversionof Image, Value, and Text To Speech by Using Machine Learning
16 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Anurag Synop
No ratings yet
Anurag Synop
9 pages
1.1 Background To The Study: Chapter One: Introduction
No ratings yet
1.1 Background To The Study: Chapter One: Introduction
4 pages
Text To Speech Speech To Text Using Translations (Mini Project)
No ratings yet
Text To Speech Speech To Text Using Translations (Mini Project)
46 pages
Speech To Text Conversion
No ratings yet
Speech To Text Conversion
7 pages
AI Based Reading System For Blind Using OCR
No ratings yet
AI Based Reading System For Blind Using OCR
4 pages
Speech To Text
No ratings yet
Speech To Text
17 pages
Transformer-Transducer End-to-End Speech Recognition With Self-Attention
No ratings yet
Transformer-Transducer End-to-End Speech Recognition With Self-Attention
5 pages
Speechrecogn
No ratings yet
Speechrecogn
15 pages
From Zero to Market with Flutter
From Everand
From Zero to Market with Flutter
Viachaslau Lyskouski
No ratings yet
From Zero to Market with Flutter: Desktop, Mobile, and Web Distribution
From Everand
From Zero to Market with Flutter: Desktop, Mobile, and Web Distribution
Viachaslau Lyskouski
No ratings yet
Case Study: Speech Recognition For Virtual Assistants: 1. Problem Identification
No ratings yet
Case Study: Speech Recognition For Virtual Assistants: 1. Problem Identification
8 pages
Presentation 3
No ratings yet
Presentation 3
24 pages
74 Revised Manuscript
No ratings yet
74 Revised Manuscript
9 pages
Deep Speech - Scaling Up End-To-End Speech Recognition
No ratings yet
Deep Speech - Scaling Up End-To-End Speech Recognition
12 pages
Learning .NET High-performance Programming
From Everand
Learning .NET High-performance Programming
Antonio Esposito
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Speech To Image Conversion: Shaik Karishma, Siddu Devi Naga Susmitha, Nanditha Katari, G. Sirisha
No ratings yet
Speech To Image Conversion: Shaik Karishma, Siddu Devi Naga Susmitha, Nanditha Katari, G. Sirisha
5 pages
Session 5 - Speech Recognition
No ratings yet
Session 5 - Speech Recognition
20 pages
Raspberry Pi-Based Ai System For Speech Transcription
No ratings yet
Raspberry Pi-Based Ai System For Speech Transcription
5 pages
Speech Recognition Report
No ratings yet
Speech Recognition Report
46 pages
1.3 Capstone Project Overview - 26.07.2023
No ratings yet
1.3 Capstone Project Overview - 26.07.2023
7 pages
3.2 Game Design Documentation - 26.07.2023
No ratings yet
3.2 Game Design Documentation - 26.07.2023
4 pages
Sorcerer (Alternate) - Sorcerous Origins (Archmage)
No ratings yet
Sorcerer (Alternate) - Sorcerous Origins (Archmage)
15 pages
Breeding Scheme
No ratings yet
Breeding Scheme
15 pages
Interview Evaluation Sheet - V3 - Jatin Bansal
No ratings yet
Interview Evaluation Sheet - V3 - Jatin Bansal
3 pages
DJI FPV Goggles Disclaimer and Safety Guidelines
No ratings yet
DJI FPV Goggles Disclaimer and Safety Guidelines
54 pages
A320 Limitations
No ratings yet
A320 Limitations
19 pages
Full Charm SLD
0% (1)
Full Charm SLD
31 pages
Understanding Operating Systems 7th Edition by Ida Flynn, Ann McIver McHoes 128509655X 978-1285096551instant Download
100% (4)
Understanding Operating Systems 7th Edition by Ida Flynn, Ann McIver McHoes 128509655X 978-1285096551instant Download
85 pages
Vestibular Neuritis and Labyrinthitis - UpToDate PDF
No ratings yet
Vestibular Neuritis and Labyrinthitis - UpToDate PDF
18 pages
Identification: Vulnerable Individual (Assessment)
No ratings yet
Identification: Vulnerable Individual (Assessment)
20 pages
Internal Energy Change Equations
No ratings yet
Internal Energy Change Equations
2 pages
MUX74HC4067 - Codebender
No ratings yet
MUX74HC4067 - Codebender
8 pages
Haas ST-10 Series Lathes: The High-Performance Turning Centers
No ratings yet
Haas ST-10 Series Lathes: The High-Performance Turning Centers
2 pages
Rehan
No ratings yet
Rehan
1 page
Businesses Proposal
No ratings yet
Businesses Proposal
9 pages
Functional Level Strategy of Starbucks
No ratings yet
Functional Level Strategy of Starbucks
25 pages
Farm Life By: Jenny Rose R. Santos
No ratings yet
Farm Life By: Jenny Rose R. Santos
6 pages
Files Reviewer
No ratings yet
Files Reviewer
24 pages
3) Unemployment and Types of Unemployment
No ratings yet
3) Unemployment and Types of Unemployment
4 pages
MSDS Kay
No ratings yet
MSDS Kay
9 pages
MESOPOTAMIA
No ratings yet
MESOPOTAMIA
22 pages
Aroon Kumar: "Award Winning Global Marketer and Digital Business Leader"
No ratings yet
Aroon Kumar: "Award Winning Global Marketer and Digital Business Leader"
6 pages
Team and Team Building
No ratings yet
Team and Team Building
12 pages
B 8145 C 694
No ratings yet
B 8145 C 694
42 pages
High Level Design Service For Prisma SD Wan
No ratings yet
High Level Design Service For Prisma SD Wan
7 pages
SMBTA43-Siemens Semiconductor Group
No ratings yet
SMBTA43-Siemens Semiconductor Group
4 pages
T SC 425 All About Jupiter Powerpoint - Ver - 3
100% (1)
T SC 425 All About Jupiter Powerpoint - Ver - 3
17 pages
MGT610 Objective File For Final Term
No ratings yet
MGT610 Objective File For Final Term
175 pages

Iaesarticle

Uploaded by

Iaesarticle

Uploaded by

IAES International Journal of Artificial Intelligence (IJ-AI)

Vol. 99, No. 1, Month 2099, pp. 1∼1x

Application of Deep Learning in Speech-to-Text Systems for

Article Info ABSTRACT

Journal homepage: http://ijai.iaescore.com

2. RELATED WORK AND BACKGROUND CONCEPTS

2.1. End-to-end speech recognition

2.2. Performance testing

2.3. Virtual keyboard

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

Word Error Rate is calculated with the equation [19]:

3.2. Testing method

Figure 1. Proof of Concept Implementation

3.4. Proof of concept comparison method

Figure 2. Illustrated Test Environment

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

4. RESULT & DISCUSSION

Figure 3. CPU Usage Comparison

Figure 4. RAM Usage Comparison

Table 1. Performance comparison of each model

Figure 5. Real-Time Factor Comparison

Figure 6. Word Error Rate Comparison

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

4.2. Proof of concept comparison

Figure 7. Real-Time Factor Comparison

Figure 8. Word Error Rate Comparison

Table 2. Proof of Concept Performance Comparison

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

[23] N. Yadla, G. Nicholson, and M. Kudlur, “openai-whisper.” [Online]. Available: https://github.com/

[25] OSHA. Occupational noise exposure. [Online]. Available: https://www.osha.gov/noise#loud

Int J Artif Intell, Vol. 99, No. 1, Month 2099: 1–1x

You might also like