Speech Recognition of Accented Mandarin Based on Improved Conformer
Abstract
:1. Introduction
- (1)
- Embedding SE block into the Conformer model enables the network to recalibrate the extracted channel features;
- (2)
- Without changing the parallelism of the Conformer model, temporal information is modeled using a temporal convolutional network (TCN) to enhance the acquisition of location information by the model and reduce the disappearance of location information in the post layer;
- (3)
- A state-of-the-art performance is achieved in four public datasets, especially in terms of the character error rate.
2. Related Works
2.1. Temporal Convolutional Network
2.2. Squeeze-Excitation Block
3. Materials and Methods
3.1. Dataset
3.2. Experimental Environment
3.3. Model Construction and Speech Recognition
3.3.1. SE-Conv
3.3.2. MHSA-TCN
3.4. Audio Augmentation and Loss Function
4. Experiments
4.1. Evaluation Metrics and Hyperparameter Settings
4.2. Baseline Preparation
- GMM-HMM: Phoneme recognition model based on a hidden Markov model and a hybrid Gaussian model.
- TDNN [39]: Time-delay neural networks, where phoneme-recognition-based implementation enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time.
- DFSMN-T [40]: Lightweight speech recognition system consisting of acoustic model DFSMN and language model transformer with fast decoding speed.
- CTC/Attention [41]: Uses a hybrid of the syllable, Chinese character, and subword as modeling units for end-to-end speech recognition system based on the CTC/attention multi-task learning.
- TCN-Transformer [42]: Transformer-based fusion of temporal convolutional neural networks and connected temporal classification.
5. Results
5.1. Recognition Results of the Model
5.2. Parameter Sensitivity
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 2022, 11, e8. [Google Scholar] [CrossRef]
- Yi, J.; Wen, Z.; Tao, J.; Ni, H.; Liu, B. CTC regularized model adaptation for improving LSTM RNN based multi-accent mandarin speech recognition. J. Signal Process. Syst. 2018, 90, 985–997. [Google Scholar] [CrossRef]
- Wang, Z.; Schultz, T.; Waibel, A. Comparison of acoustic model adaptation techniques on non-native speech. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP’03), Hong Kong, China, 6–10 April 2003; p. I. [Google Scholar]
- Zheng, Y.; Sproat, R.; Gu, L.; Shafran, I.; Zhou, H.; Su, Y.; Jurafsky, D.; Starr, R.; Yoon, S.-Y. Accent detection and speech recognition for shanghai-accented mandarin. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
- Chen, M.; Yang, Z.; Liang, J.; Li, Y.; Liu, W. Improving deep neural networks based multi-accent mandarin speech recognition using i-vectors and accent-specific top layer. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Senior, A.; Sak, H.; Shafran, I. Context dependent phone models for LSTM RNN acoustic modelling. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 4585–4589. [Google Scholar]
- Yi, J.; Ni, H.; Wen, Z.; Tao, J. Improving blstm rnn based mandarin speech recognition using accent dependent bottleneck features. In Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Jeju, Republic of Korea, 13–16 December 2016; pp. 1–5. [Google Scholar]
- Fung, P.; Liu, Y. Effects and modeling of phonetic and acoustic confusions in accented speech. J. Acoust. Soc. Am. 2005, 118, 3279–3293. [Google Scholar] [CrossRef] [PubMed]
- Gao, Y.; Parcollet, T.; Zaiem, S.; Fernandez-Marques, J.; de Gusmao, P.P.; Beutel, D.J.; Lane, N.D. End-to-end speech recognition from federated acoustic models. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7227–7231. [Google Scholar]
- Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is all you need in speech separation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–12 June 2021; pp. 21–25. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Aidatatang-200zh. Beijing DataTang Technology Co., Ltd. 2022. Available online: https://www.datatang.com/opensource (accessed on 8 September 2022).
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Republic of Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
- CASIA-1: Southern Accent Speech Corpus. Chinese Academy of Sciences. 2003. Available online: http://www.chineseldc.org/doc/CLDC-SPC-2004-016/intro.htm (accessed on 1 April 2003).
- CASIA-2: Northern Accent Speech Corpus. Chinese Academy of Sciences. 2003. Available online: http://www.chineseldc.org/doc/CLDC-SPC-2004-015/intro.htm (accessed on 1 April 2003).
- Sarangi, S.; Sahidullah, M.; Saha, G. Optimization of data-driven filterbank for automatic speaker verification. Digit. Signal Process. 2020, 104, 102795. [Google Scholar] [CrossRef]
- Yao, Z.; Wu, D.; Wang, X.; Zhang, B.; Yu, F.; Yang, C.; Peng, Z.; Chen, X.; Xie, L.; Lei, X. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv 2021, arXiv:2102.01547. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Wu, Z.; Liu, Z.; Lin, J.; Lin, Y.; Han, S. Lite transformer with long-short range attention. arXiv 2020, arXiv:2004.11886. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
- Luong, M.-T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
- Wang, W.; Sun, Y.; Qi, Q.; Meng, X. Text Sentiment Classification Model based on BiGRU-attention Neural Network. Appl. Res. Comput. 2019, 36, 3558–3564. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, R.; Yuan, Z.; Liu, T.; Xiong, Z. End-to-end lane shape prediction with transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3694–3702. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2019; Volume 32. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Ko, T.; Peddinti, V.; Povey, D.; Khudanpur, S. Audio augmentation for speech recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
- SoX, Audio Manipulation Tool. 2015. Available online: https://sox.sourceforge.net/ (accessed on 25 March 2015).
- Lee, J.; Watanabe, S. Intermediate loss regularization for ctc-based speech recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–12 June 2021; pp. 6224–6228. [Google Scholar]
- Romero-Fresco, P. Respeaking: Subtitling through speech recognition. In The Routledge Handbook of Audiovisual Translation; Routledge: Abingdon, UK, 2018; pp. 96–113. [Google Scholar]
- Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 328–339. [Google Scholar] [CrossRef]
- Hu, Z.; Jian, F.; Tang, S.; Ming, Z.; Jiang, B. DFSMN-T:Mandarin Speech Recognition with Language Model Transformer. Comput. Eng. Appl. 2022, 58, 187–194. [Google Scholar]
- Chen, S.; Hu, X.; Li, S.; Xu, X. An investigation of using hybrid modeling units for improving end-to-end speech recognition system. Proceedings of ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–12 June 2021; pp. 6743–6747. [Google Scholar]
- Xie, X.; Chen, G.; Ssun, J.; Chen, Q. TCN-Transformer-CTC for End-to-End Speech Recognition. Appl. Res. Comput. 2022, 39, 699–703. [Google Scholar]
Dataset | Hour-Long | Male Speaker | Female Speaker | Speaker Regions | Sentences |
---|---|---|---|---|---|
Aidatatang [14] | 200 | 299 | 301 | 34 | 150,000 |
Aishell-1 [15] | 178 | 188 | 212 | 12 | 120,098 |
CASIA-1 [16] | 80 | 205 | 202 | 8 | 100,000 |
CASIA-2 [17] | 55 | 150 | 150 | 9 | 75,000 |
CTC Weights | SER (%) | CER (%) |
---|---|---|
0.1 | 39.9 | 6.9 |
0.2 | 41.6 | 7.5 |
0.3 | 39.7 | 6.8 |
0.4 | 38.6 | 6.2 |
0.5 | 39.1 | 6.6 |
0.6 | 41.1 | 7.0 |
0.7 | 40.9 | 6.7 |
0.8 | 42.1 | 7.6 |
0.9 | 42.5 | 7.7 |
Dataset | Metric | GMM-HMM | TDNN | DFSMN-T | CTC/Attention | TCN-Transformer | SE-Conformer-TCN |
---|---|---|---|---|---|---|---|
Aidatatang_200zh | CER | 12.3 | 7.4 | 7.8 | 6.3 | 6.2 | 4.9 |
SER | 43.2 | 39.6 | 38.7 | 37.1 | 37.5 | 36.5 | |
CASIA-1 | CER | 15.8 | 8.1 | 7.9 | 6.3 | 6.8 | 5.6 |
SER | 47.1 | 39.9 | 40.1 | 37.3 | 38.2 | 37.3 | |
CASIA-2 | CER | 16.6 | 8.2 | 8.4 | 7.2 | 6.9 | 7.1 |
SER | 46.7 | 40.3 | 40.7 | 38.3 | 37.8 | 37.2 |
Dataset | Model | SER (%) | CER (%) |
---|---|---|---|
Aishell-1 | Conformer(baseline) | 37.4 | 5.7 |
SE-Conformer-TCN | 35.3 | 4.5 |
Reduction Ratios | SER (%) | CER (%) |
---|---|---|
4 | 37.9 | 5.7 |
8 | 37.6 | 5.8 |
16 | 37.4 | 5.6 |
32 | 38.3 | 6.0 |
Param | Size | SER (%) | CER (%) |
---|---|---|---|
TCN Unit | 1 | 40.7 | 6.3 |
2 | 38.5 | 6.0 | |
3 | 38.3 | 6.0 | |
4 | 39.1 | 6.2 | |
Filter | 128 | 40.6 | 6.1 |
256 | 38.3 | 6.0 | |
Kernel | 3 | 38.3 | 6.0 |
5 | 39.6 | 6.1 | |
7 | 40.3 | 6.1 |
Dataset | Model | SER (%) | CER (%) |
---|---|---|---|
Aidatatang_200zh | Conformer(baseline) | 38.6 | 6.2 |
SE-Conformer | 37.4 | 5.6 | |
Conformer-TCN | 37.9 | 6.0 | |
SE-Conformer-TCN | 36.5 | 4.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, X.-Y.; Zhang, S.-D.; Xiao, R.; Yu, J.; Li, Z.-Y. Speech Recognition of Accented Mandarin Based on Improved Conformer. Sensors 2023, 23, 4025. https://doi.org/10.3390/s23084025
Yang X-Y, Zhang S-D, Xiao R, Yu J, Li Z-Y. Speech Recognition of Accented Mandarin Based on Improved Conformer. Sensors. 2023; 23(8):4025. https://doi.org/10.3390/s23084025
Chicago/Turabian StyleYang, Xing-Yao, Shao-Dong Zhang, Rui Xiao, Jiong Yu, and Zi-Yang Li. 2023. "Speech Recognition of Accented Mandarin Based on Improved Conformer" Sensors 23, no. 8: 4025. https://doi.org/10.3390/s23084025