Iaesarticle
Iaesarticle
Corresponding Author:
Abraham K.S. Lenson
Department of Information System, Faculty of Engineering, Atma Jaya Catholic University of Indonesia
Tangerang, Indonesia
Email: [email protected]
1. INTRODUCTION
Deep Learning is a subset of Machine Learning [1] that uses an artificial neural network to learn from
data [2]. A properly trained deep learning model is very robust, being able to complete tasks such as speech
recognition, language identification, translation, and many more [3]. This has been used to increase accessibil-
ity and is particularly useful to those with disabilities [4]. Studies previously done in regards to application are
often related to relatively specialised usecases including education and disabilities. Examples include Explo-
ration of Automatic Speech Recognition for Deaf and Hard of Hearing Students in Higher Education Classes
[5] and Speech disabilities in adults and the suitable speech recognition software tools - a review [6]. In this
paper we are going to focus on the usage of deep learning for speech to text solutions in regard to general usage
typing.
When looking at the current trajectory of previous trends, it has been noted that there is an increase in
the trend of using End-to-End (E2E) models for speech recognition [7]. E2E models differs from the ”conven-
tional” speech recognition pipeline. Whilst the ”conventional” method splits the process into multiple different
algorithms (acoustic model, language model, etc), E2E model transforms the output from input using a single
continuous process[8].
Industry giants including Nvidia has also shifted focus to the future of performance optimised machine
learning hardware [9]. Said factors have resulted in the explosive growth of consumer level deep learning
models. With models, often open source, being able to run not only on server farms but also consumer grade
hardware. Examples of models are OpenAI Whisper [3], Mozilla DeepSpeech [10], Kaldi [11] Vosk, Coqui
STT, PocketSphinx, PaddleSpeech [12], and many more.
The reasoning behind the creation of this project is due to the lack of comprehensive analysis compar-
ing the difference between available models. The current available research are either relatively outdated, such
as in the research paper ”Comparing open-source speech recognition toolkits” [13], or very specific in scope.
The aim of this project is twofold. The first goal being analysing and comparing current available
models, especially from a performance standpoint as end-users have relatively limited hardware capabilities
when compared to enterprise data centres. The second goal is to do a proof of concept implementation on
low-end devices, namely a virtual keyboard on mobile phones.
3. METHOD
3.1. Model testing variables
Each model will be tested and compared against one another. The criterias consists of Word Error
Rate (WER) and Real-Time Factor (RTF), which is used in ”Comparison of Speech Recognition Performance
Between Kaldi and Google Cloud Speech API” [18]. Memory (RAM) usage, CPU usage, and Storage Re-
quirements will also be added as additional supporting factors.
The closer the RTF is to 1, the closer the model is to real-time speed (Note that RTF of < 1 may only
be achieved in tests with pre-recorded data, as it is not bottlenecked by data creation speed).
The usage of RAM and CPU will be calculated by monitoring the system using top, the built in
Linux system performance monitoring tool. The data will be snapshotted every 2 seconds and averaged to get
the mean value. Lastly, the storage requirements will be done by calculating the size of the algorithm software
and the pre-trained model.
The analysis is done inside a Virtual Machine (VM) running:
a. OS: Fedora Linux 38 (x86 64)
b. CPU: Intel i7-9750H (6 cores allocated)
c. RAM: 16GB DDR4 2666MHz
Application of Deep Learning in Speech-to-Text Systems for Touchscreen Keyboards (Abraham K.S. Lenson)
4 ❒ ISSN: 2252-8938
This implementation is compared to the current most widely used implementation of Android Speech
to text system, Google Gboard. The comparison data used is a single folder picked from the previously unused
OpenASR LibriSpeech Corpus. Using shuf resulted in folder 908 being used. Note that the data used in this
section is relatively small, it is only used as a comparison and does not represent the overall accuracy of each
model.
For this test, we used Edifier R1700BT as our speaker. The phone used in this test is a Xiaomi Poco
X3 Pro, a midrange phone from 2021.
To emulate a real world scenario while minimising variables, the phone is mounted on a tripod posi-
tioned 40cm away from the speaker. The speaker volume is set to maximum, with the system volume set to
50%, and the software volume set to either 50% of 70%, corresponding to 60dB (average conversation volume)
and 70dB (loudness of a classroom chatter) respectively[25].
Out of all tested models, only Whisper is able to use more than a single CPU core for the Speech
Recognition Process (excluding software overhead).
The base Vosk model used the most RAM, rising to near 7GB before dropping back to under 5.5GB.
Deepspeech and Coqui STT RAM usage slowly increases over time, using a scorer increases the early and con-
sistent RAM usage by around 600-800MB on both model. Whisper RAM usage spikes early before dropping
to a flat RAM allocation. Vosk small RAM usage slowly rises before stabilising, except for the final small rise
at the end also seen in the base model Vosk. PocketSphinx has the lowest RAM usage, stabilising at just under
200MB.
Application of Deep Learning in Speech-to-Text Systems for Touchscreen Keyboards (Abraham K.S. Lenson)
6 ❒ ISSN: 2252-8938
The main contenders for implementation when considering the RTF are PocketSphinx, Vosk, Vosk Small,
and Whisper (Tiny). With Whisper leading in multicore environment, Vosk Small in single core, and Pocket-
Sphinx being slower than the two but using the least amount of RAM. However, Vosk is unsuitable for low-end
devices due to requiring large amount of memory.
When viewed from a WER perspective, PocketSphinx has the worst result at 29% WER. Whisper with
the tiny training data is 40% more accurate than Vosk small (6% and 10% WER respectively) at the cost of
double the processing time. This issue is alleviated by Whisper being able to process data with multiple CPU
cores. This resulted in Whisper being better in terms of accuracy while Vosk being able to work in heavily
resource restricted scenario.
The difference in Real-Time Factor are minimal, with Whisper being faster by 0.02 points in 60dB
and Google being faster by 0.03 points in 70dB testing.
Word Error Rate is where the result diverges. Starting with Whisper, it achieves a 22% WER for
60dB and 20% WER for 70dB. Google struggles in 60dB testing, failing to detect more than 60% of the total
sentences, resulting in a massive WER of 75%. At 70dB the detection rate improves to 75% resulting in a
WER of 33%, still 65% less accurate than Whisper at 70dB.
We speculate that the massive difference in error rate is caused by Google Gboard failing to detect
spoken speech, resulting in the input not getting processed. Meanwhile, the Whisper proof of concept imple-
mentation allows the user to directly control which part is considered speech, directly bypassing the issue.
Application of Deep Learning in Speech-to-Text Systems for Touchscreen Keyboards (Abraham K.S. Lenson)
8 ❒ ISSN: 2252-8938
5. CONCLUSION
From the analysis and testing that have been done, multiple conclusions could be reached. The con-
clusions are:
1. The current best open source local speech recognition system to date is Whisper by OpenAI. Although
Kaldi Vosk may work better in specific situations with restrictive performance constraints at the cost of
accuracy.
2. When compared to Google Voice, Whisper achieves comparable processing time while having higher
overall accuracy.
3. Local on-device speech recognition is a viable alternative to online speech recognition system.
In future projects, researchers could focus on user experience and accessibility testing, as this project
has shown that local speech recognition is no longer an issue when considering technical implementation.
REFERENCES
[1] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, “Deep learning for visual
understanding: A review,” vol. 187, pp. 27–48. [Online]. Available: https://linkinghub.elsevier.com/
retrieve/pii/S0925231215017634
[2] P. S. Kumar, H. Behera, A. K. K, J. Nayak, and B. Naik, “Advancement from neural networks to
deep learning in software effort estimation: Perspective of two decades,” vol. 38, p. 100288. [Online].
Available: https://linkinghub.elsevier.com/retrieve/pii/S1574013720303889
[3] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition
via large-scale weak supervision.” [Online]. Available: http://arxiv.org/abs/2212.04356
[4] M. Oberembt, “The effects of text-to-speech on students with reading disabilities.” [Online]. Available:
https://scholarworks.uni.edu/grp/945
[5] J. Butler, B. Trager, and B. Behm, “Exploration of automatic speech recognition for deaf
and hard of hearing students in higher education classes,” in The 21st International ACM
SIGACCESS Conference on Computers and Accessibility. ACM, pp. 32–42. [Online]. Available:
https://dl.acm.org/doi/10.1145/3308561.3353772
[6] Balaji V and G. Sadashivappa, “Speech disabilities in adults and the suitable speech recognition software
tools - a review,” in 2015 International Conference on Computing and Network Communications
(CoCoNet). IEEE, pp. 559–564. [Online]. Available: http://ieeexplore.ieee.org/document/7411243/
[7] J. Li, “Recent advances in end-to-end automatic speech recognition.” [Online]. Available:
http://arxiv.org/abs/2111.01690
[8] S. Wang and G. Li, “Overview of end-to-end speech recognition,” vol. 1187, no. 5, p. 052068. [Online].
Available: https://iopscience.iop.org/article/10.1088/1742-6596/1187/5/052068
[9] B. Keller, R. Venkatesan, S. Dai, S. G. Tell, B. Zimmer, W. J. Dally, C. Thomas Gray, and B. Khailany,
“A 17–95.6 TOPS/w deep learning inference accelerator with per-vector scaled 4-bit quantization for
transformers in 5nm,” in 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and
Circuits). IEEE, pp. 16–17. [Online]. Available: https://ieeexplore.ieee.org/document/9830277/
[10] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta,
A. Coates, and A. Y. Ng, “Deep speech: Scaling up end-to-end speech recognition.” [Online]. Available:
http://arxiv.org/abs/1412.5567
[11] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek,
Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in
IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing
Society, place: Hilton Waikoloa Village, Big Island, Hawaii, US.
[12] H. Zhang, T. Yuan, J. Chen, X. Li, R. Zheng, Y. Huang, X. Chen, E. Gong, Z. Chen, X. Hu, D. Yu,
Y. Ma, and L. Huang, “PaddleSpeech: An easy-to-use all-in-one speech toolkit.” [Online]. Available:
http://arxiv.org/abs/2205.12007
[13] C. Gaida, P. Lange, R. Petrick, P. Proba, A. Malatawy, and D. Suendermann-Oeft, “Comparing open-
source speech recognition toolkits.” [Online]. Available: http://suendermann.com/su/pdf/oasis2014.pdf
[14] Google. Now you can speak to google mobile app on your iPhone. [Online]. Available:
https://googleblog.blogspot.com/2008/11/now-you-can-speak-to-google-mobile-app.html
[15] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu,
R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S.-y. Chang, K. Rao,
and A. Gruenstein, “Streaming end-to-end speech recognition for mobile devices.” [Online]. Available:
http://arxiv.org/abs/1811.06621
[16] TestingXperts. Performance testing – a complete guide. [Online]. Available: https://www.testingxperts.
com/blog/performance-testing-guide/
[17] M. Silfverberg, “Historical overview of consumer text entry technologies,” in Text Entry Systems. Else-
vier, pp. 3–25. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/B9780123735911500012
[18] T. Kimura, T. Nose, S. Hirooka, Y. Chiba, and A. Ito, “Comparison of speech recognition performance
between kaldi and google cloud speech API,” in Recent Advances in Intelligent Information Hiding and
Multimedia Signal Processing, J.-S. Pan, A. Ito, P.-W. Tsai, and L. C. Jain, Eds. Springer International
Publishing, vol. 110, pp. 109–115, series Title: Smart Innovation, Systems and Technologies. [Online].
Available: http://link.springer.com/10.1007/978-3-030-03748-2 13
[19] S. F. Chen, D. Beeferman, and R. Rosenfeld, “Evaluation metrics for language models,” publisher:
Carnegie Mellon University.
[20] O. Plátek, “Speech recognition using kaldi.”
[21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public
domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 5206–5210. [Online]. Available: http://ieeexplore.ieee.org/document/7178964/
[22] S. Tomar, “Converting video formats with FFmpeg,” vol. 2006, no. 146, p. 10, publisher: Belltown Media
Houston, TX.
Application of Deep Learning in Speech-to-Text Systems for Touchscreen Keyboards (Abraham K.S. Lenson)
10 ❒ ISSN: 2252-8938
BIOGRAPHIES OF AUTHORS
Abraham K.S. Lenson Studied Information System in Atma Jaya Catholic University of Indone-
sia. In 2022, he earned a national scholarship and enrolled as an exchange student in University of
Galway, Ireland. His technical interests includes software development, Linux system management,
automation, and reverse engineering. He can be contacted at email:[email protected].
Patricia Melin received the D.Sc. degree (Doctor Habilitatus D.Sc.) in computer science
from the Polish Academy of Sciences, Warsaw, Poland, with the Dissertation “Hybrid Intelligent
Systems for Pattern Recognition using Soft Computing”. She is a Professor of Computer Science in
the Graduate Division, Tijuana Institute of Technology, Tijuana, Mexico since 1998. In addition, she
is serving as Director of Graduate Studies in computer science and Head of the research group on
Computational Intelligence (2000–present). Her research interests are in Type-2 Fuzzy Logic, Modu-
lar Neural Networks, Pattern Recognition, Neuro-Fuzzy and Genetic-Fuzzy hybrid approaches., She
is currently the President of Hispanic American Fuzzy Systems Association (HAFSA) and is the
founding Chair of the Mexican Chapter of the IEEE Computational Intelligence Society. She can be
contacted at email: [email protected].