0% found this document useful (0 votes)

22 views8 pages

Modeling of Speech Recognition Based On Deep Learning: Min Zhang

This document discusses the modeling of speech recognition systems based on deep learning, focusing on the design and implementation of a Chinese speech recognition framework. It outlines the processes from data collection and preprocessing to feature extraction and model training, emphasizing the importance of acoustic and language models. The study highlights the potential for advancements in speech recognition technology and its applications in various fields, including intelligent voice assistants and smart devices.

Uploaded by

darbyava27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views8 pages

Modeling of Speech Recognition Based On Deep Learning: Min Zhang

Uploaded by

darbyava27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Received: 23 December 2024 | Revised: 22 January 2025 | Accepted: 2 February 2025 | Published online: 6 February 2025

International Journal of Advance in Applied Science Research, Volume 4, Issue 2, 2025

https://h-tsp.com/

Modeling of Speech Recognition Based

on Deep Learning
Min Zhang*

School of Computer Science, Xianyang Normal University, Xianyang, Shaanxi 712000, China
*Author to whom correspondence should be addressed.

Abstract: As technology continues to advance, the application of speech recognition is becoming increasingly
pervasive, and the significance of intelligent speech recognition cannot be overstated. This article delves into
the intricate workings and classifications of speech recognition systems, meticulously outlining the process of
designing the system's development environment and framework. It meticulously charts the course from the
collection of speech datasets to the preprocessing of speech data, and then progresses to the crucial stages of
feature extraction and the construction of both acoustic and language models tailored for deep learning-based
Chinese speech recognition. This comprehensive study not only enables the system to record speech
autonomously or upload pre-recorded speech to a server for Chinese recognition but also boasts the capability
to translate the recognized Chinese speech into English. This functionality underscores the study's potential to
pave the way for further in-depth exploration and advancements in the realm of speech recognition, establishing
a solid foundation for future research endeavors.

Keywords: Deep learning; Speech recognition; Feature extraction; DFSMN model.

1. Introduction

Speech recognition technology, also known as Automatic Speech Recognition (ASR), has gradually
become closely related to people's lives. The goal of speech recognition is to convert people's voices into
binary language that computers can understand and process accordingly. Like text classification and
machine translation, speech recognition is a subfield of natural language processing (NLP) in artificial
intelligence. In the highly popular era of artificial intelligence, from Siri to Xiaodu, from Xiaobing to
Xiaona, and then to Xiaoai Tongxue, these intelligent voice assistants are integrating into people's lives.
The application fields of speech recognition technology are very wide, including smart homes, mobile
devices, intelligent customer service, in car systems, intelligent healthcare, industrial control, intelligent
toys, etc. Its core is to interact with machines through voice and enable them to complete related tasks.

In recent years, advancements in various fields have led to significant breakthroughs in technology and
research. Long et al. [1] presented a study in September 2024 at the IEEE 7th International Conference
on Information Systems and Computer Aided Education (ICISCAE), enhancing educational content
matching using Transformer models and InfoNCE loss. Their work aimed to improve the accuracy and
efficiency of content matching in educational platforms. Meanwhile, Huang et al. [2] conducted research
on a multi-agency collaboration medical images analysis and classification system based on federated
learning, discussed at the 2024 International Conference on Biomedicine and Intelligent Technology.
This study explored the potential of federated learning in facilitating secure and efficient collaboration

© The Author(s) 2025.

Published by High-Tech Science Press. This is an open access article under the CC BY License (https://creativecommons.org/licenses/by/4.0/).
International Journal of Advance in Applied Science Research, Volume 4, Issue 2, 2025

among medical institutions. Ukey et al. [3] published an article in the World Wide Web journal in 2023,
introducing an efficient continuous kNN join algorithm for dynamic high-dimensional data. Their work
addressed the challenges associated with processing and querying large-scale, high-dimensional
datasets. Chen et al. [4], at the 20th USENIX Symposium on Networked Systems Design and
Implementation (NSDI 23), presented a study on channel-aware 5G RAN slicing with customizable
schedulers, contributing to the advancement of network slicing technologies in 5G networks. Peng et al.
[5] proposed a dual-augmentor framework for domain generalization in 3D human pose estimation at
the IEEE/CVF Conference on Computer Vision and Pattern Recognition in 2024. Their framework aimed
to improve the robustness and generalization ability of 3D human pose estimation models. Yan et al.
[6], in a study published in Sustainability in 2024, examined the impact of CEO power on green
innovation and organizational performance in manufacturing firms, adopting a mediational approach.
Ren et al. [7], in the Alexandria Engineering Journal in 2025, presented an IoT-based system for 3D pose
estimation and motion optimization for athletes, leveraging C3D and OpenPose technologies. Their
work demonstrated the potential of IoT in sports training and performance improvement. Fan et al. [8],
in an arXiv preprint, conducted research on the online update method for retrieval-augmented
generation (RAG) models with incremental learning.

2. Speech Recognition System

2.1 Working mechanism of speech recognition system

The task of speech recognition is to convert speech sequences into text sequences. There are two ways
of speech recognition conversion, one is to directly convert speech into text, and the other is to first
convert speech into phonemes (or pinyin) and then convert it into text. Due to the large number of
homophones in Chinese, the same sound can represent different characters, which greatly tests the
contextual context of the input speech. While training the text results, it is also necessary to consider
contextual coherence, which greatly increases the difficulty of training and disperses the training effort,
ultimately resulting in low recognition accuracy. Therefore, the feasibility of the first method is not high.
Compared to the first recognition method, the second method can allocate training content reasonably,
that is, only training the conversion of speech sequences into phonemes and only training the
conversion of phonemes into text. This can greatly improve the speech recognition rate. The speech
recognition design in the article will be developed around the second method [2-3].

2.2 Classification of Speech Recognition Systems

Speech recognition technology can be divided into three categories based on different application
scenarios: restricting the way users speak, limiting the range of words users use, and limiting the
system's usage Household objects [4-5].

Limit the way users speak. According to the limitations of speech recognition systems on users'
speaking styles, they can be divided into isolated word recognition systems, continuous speech
recognition systems, and improvisational oral speech recognition systems. Continuous speech
recognition system refers to a recognition system that uses medium to large-scale vocabulary but uses
subwords as basic recognition units; Due to the random nature of the input, the speech content is also
random, accompanied by many random events such as swallowing, stuttering, repetition, hesitation,
coughing, wheezing, etc. These characteristics make improvisational speech recognition challenging.
Limit the range of words used by users. The scope of vocabulary can be divided into three types: small
vocabulary, medium vocabulary, large vocabulary, and unlimited vocabulary.

9
International Journal of Advance in Applied Science Research, Volume 4, Issue 2, 2025

2.3 Main issues of speech recognition

(1) Human factors:

Different people have different ways of speaking, accents, speed, and volume; The way the same person
speaks will also change according to the speaker's emotional changes and physical condition. Angry
people speak with a faster pace and higher pitch, while sick people speak with a slower pace and lower
volume. Each person's speaking style will change over time, which can lead to a decrease in speech
recognition rate.

(2) Environmental noise:

In actual recording scenarios, there are often different environmental noises such as car horns, wind
and rain, white noise, and the sound of others speaking. When these noises are recorded into the audio,
they have a serious impact on speech recognition, leading to a decrease in recognition rate.

(3) Hardware factors:

Different recording devices have different recording performance and recording methods, and the
sampling frequency and speech signal of the collected audio are also different, which affects the correct
recognition of speech.

(4) The understanding of semantics by speech recognition systems:

Human speech has different forms such as vocabulary and homophones, and the same pronunciation
sometimes corresponds to different vocabulary in different contexts. Therefore, it is necessary to
establish rules for understanding semantics.

3. Design Of Chinese Speech Recognition Framework Based On Deep Learning

The Chinese speech recognition architecture based on deep learning is divided into two parts: acoustic
model training and model utilization. The training of acoustic models first requires collecting a dataset,
followed by preprocessing of the dataset, and then extracting feature vectors from the speech data one
by one, which are input into the acoustic model. The parameters of the neural network are calculated
using the Error Back Propagation algorithm (BP algorithm), and finally an ideal model with less acoustic
loss is trained. After training the acoustic model, you can input speech data into the trained acoustic
model by yourself. After the acoustic model obtains the pinyin recognition result, it inputs the pinyin
recognition result into the language model. The language model calculates the most likely word
combination based on simple word frequency statistics, queries the word frequency dictionary, converts
pinyin into Chinese characters, and finally outputs the entire text. The architecture of the Chinese speech
recognition system is shown in Figure 1.

10
International Journal of Advance in Applied Science Research, Volume 4, Issue 2, 2025

Figure 1: Framework structure of Chinese speech recognition system

4. Collection And Processing Of Voice Data

4.1 Collection of Voice Data

The speech mentioned in the article, including speech datasets, device recorded audio, and self
uploaded audio, has a sampling rate of 16kHz. Collecting speech datasets is the first step in speech
recognition, and many commercial speech recognition datasets are not developed for the public.
However, there are still some publicly available high-quality datasets for training. Common Chinese
speech datasets include THCHS-30, AISHELL, Magicdata, Primewords Chinese Corpus Set 1,
Aidatatang_200zhST CMDS, etc. These six datasets have a total of approximately 1385 hours. Due to
limited computer resources, the Chinese speech recognition system chose to use three datasets, THCHS-
30, ST-CMDS, and Primewords, to train the acoustic model.

4.2 Preprocessing of Voice Data

Speech data preprocessing mainly involves dividing the dataset and calibrating label data. While
preparing the voice data, it is necessary to calibrate the label data for the voice data. Each training result
needs to be compared with the correct result, and the backpropagation algorithm is used to
backpropagate the error of the loss function from the output layer to the hidden layer and back to the
input layer layer layer by layer. The error is distributed to the units of each layer, thus achieving the
goal of updating and solving the weight value W and bias value b. When training the acoustic model
with input speech data, simply enter txt text, which includes labels. The directory of a speech file
corresponds to its corresponding pinyin sequence and Chinese text sequence. In the data tag, one piece
of data references two txt tag texts, so there are a total of 12 txt tag texts. The data tags are shown in
Figure 2.

11
International Journal of Advance in Applied Science Research, Volume 4, Issue 2, 2025

Figure 2: Sample of Original Label

The method for improving the calibration label data in the article is shown in the specific form of
waf_data_cath \ tpinyinlist \ thanzitlist, as shown in Figure 3. You can see that the pinyin characters in
the tag have phonetic symbols, but when the acoustic model outputs, it actually generates a series of
serial numbers. You can create a JSON dictionary package yourself, which is the model \
model_1anguage \ pinyin-dict.jsn file, responsible for corresponding these serial numbers to a string of
letters with numbers at the end, such as na4, ni3, etc., instead of pinyin characters with built-in phonetic
symbols.

Figure 3: Sample Label in the Text

The text above the voice only represents a label for a voice data, with 9 label files, namely thchs_train.exe
of THCHS-30 Thchs_dev.txt, thchs_test. txt, stCMD_train. txt for ST-CMDS Stcmd_dev. txt, stcmd_test.
txt, and Prime's prime_train. txt prime_dev.txt、prime_test.txt, Each txt file contains the corresponding
number of data labels. The WAV speech dataset file and label text are placed together, and during
acoustic model training, the root directory of the dataset, datasets, can be directly introduced.

4.3 Speech data feature processing

When training a certain content, it is necessary to extract the features of the content to be trained, and
the field of speech recognition is no exception. First, the speech features are extracted, and the neural
network will make judgments and classifications based on these characteristics during training. In terms
of feature information, Fbanks have more features than MFCC, which includes steps such as Discrete
Cosine Transform (DCT) and requires more computation. This is actually a loss of speech information,
resulting in a significant loss of sound details. This Chinese speech recognition uses the Fbank feature
extraction method.

5. Modeling Of Chinese Speech Recognition Based On Deep Learning

5.1 Building Acoustic Models

Acoustic models are used in automatic speech recognition systems to represent the relationship between
audio signals and phonemes or other language units that make up speech. Modern speech recognition
systems use acoustic models and language models to represent the statistical properties of speech. The
acoustic model simulates the functional relationship between the processed speech features and speech
in language. Then, using a language model, the top-level word sequence corresponding to the given
audio clip will be obtained. The establishment of an acoustic model is the most critical part of the entire
speech recognition system, and the quality of a speech recognition system largely depends on the
quality of the acoustic model in the system. Most modern speech recognition systems run on small
pieces of audio, namely frames, with a duration of approximately 10ms per frame. The original audio
signal of each frame can be transformed using feature extraction methods such as mel frequency cepstral
analysis. The coefficients of this conversion are commonly referred to as Mel Frequency Cepstral
Coefficients (MFCC) s and are used as inputs to the acoustic model along with other features. The audio
of the acoustic model can use different sampling rates, and the sampling rate used for training the
acoustic model should ideally have the same sampling rate and bit recording as the recognized speech
audio in order to achieve the best speech recognition performance.

12
International Journal of Advance in Applied Science Research, Volume 4, Issue 2, 2025

5.2 Building Language Models

To build a statistical language model, it is necessary to first collect a sufficiently large amount of word
frequency statistical text, including monosyllabic words, disyllabic words, etc. Using probability and
statistics methods to construct a language model, inputting a Pinyin sequence, one can obtain the
sequence of Chinese characters with the highest probability of occurrence, and then output it as the
most reasonable sentence. In statistical language models, the appearance of each word is only
considered to be related to the preceding word. Usually, considering the previous word or the first two
words, a sufficiently high accuracy can already be obtained, which are called statistical unary language
models and statistical binary language models, respectively. In rare cases, ternary, quaternary, and
other language models are considered. However, the higher the level of the element, the higher the
computational time complexity. When dealing with long pinyin texts, ordinary computer accountants
find it very difficult to calculate, resulting in unavoidable time costs. We collected word frequency
statistical dictionaries for unary and binary words in this language model, with data volumes of 6880
and 568647, respectively.

Based on Markov chain, achieve the conversion of Pinyin to text. Markov chain is implemented based
on dynamic programming algorithm, similar to the algorithm for finding the shortest path. The
matching between Chinese characters and Pinyin can be seen as a communication between homophones
and Pinyin, with matching done from left to right, as shown in Figure 4.

Figure 4: Directed Pinyin to Chinese Character Conversion Diagram

5.3 Instructions for Using Training Model Code Files

13
International Journal of Advance in Applied Science Research, Volume 4, Issue 2, 2025

The folder "modelCode" for training models contains the following files:

cnn_dfsmn_ctc: Store the trained acoustic model.

datasets: Store voice dataset and label files.

model: Contains the model_1anguage folder, the acoustic models AcousticVNet py and LanguageVNet
py, and the text required for the acoustic and language models in the model_1anguage folder, including
the pinyin sequence pinyin-dictjson, the single word frequency statistics text language-word1.txt, the
double word frequency statistics text language-word2.txt, and the pinyin dictionary dict. txt.

plain.py: Contains functions for drawing time-domain, frequency-domain, and spectrogram graphs.

train_and_test.py: Contains functions for training and loading acoustic models. Speech recognition
testing can be done by calling the load acoustic model function.

wav_speech_recorder.py: Contains functions for recording WAV speech.

6. Conclusion

The use of the DFSMN framework in the article, which can model the dependencies between sequences
before and after, has improved the recognition rate of the model. The paper first introduces the speech
recognition system, explains its working mechanism and classification, and then designs the system
development environment and overall framework. Finally, a detailed description of the steps and
methods involved in the design process was provided, including the collection of speech datasets,
preprocessing of speech data, feature extraction of speech data, construction of acoustic models, and
construction of language models. From an application perspective, this application can achieve the
function of self recording voice on the web or uploading voice to the server for Chinese recognition, and
support the function of translating recognized Chinese into English. From a research perspective,
although speech recognition technology involves complex disciplines and techniques, the overall
architecture, including dataset collection, feature extraction, acoustic model framework selection, neural
network design, and language model establishment, is relatively scientific and can lay the foundation
for further in-depth research.

References

[1] Long, Y., Gu, D., Li, X., Lu, P., & Cao, J. (2024, September). Enhancing Educational Content
Matching Using Transformer Models and InfoNCE Loss. In 2024 IEEE 7th International Conference
on Information Systems and Computer Aided Education (ICISCAE) (pp. 11-15). IEEE.
[2] Huang, S., Diao, S., Wan, Y., & Song, C. (2024, August). Research on multi-agency collaboration
medical images analysis and classification system based on federated learning. In Proceedings of
the 2024 International Conference on Biomedicine and Intelligent Technology (pp. 40-44).
[3] Ukey, N., Zhang, G., Yang, Z., Li, B., Li, W., & Zhang, W. (2023). Efficient continuous kNN join
over dynamic high-dimensional data. World Wide Web, 26(6), 3759-3794.
[4] Chen, Y., Yao, R., Hassanieh, H., & Mittal, R. (2023). {Channel-Aware} 5g {RAN} slicing with
customizable schedulers. In 20th USENIX Symposium on Networked Systems Design and
Implementation (NSDI 23) (pp. 1767-1782).

14
International Journal of Advance in Applied Science Research, Volume 4, Issue 2, 2025

[5] Peng, Q., Zheng, C., & Chen, C. (2024). A Dual-Augmentor Framework for Domain Generalization
in 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (pp. 2240-2249).
[6] Yan, Q., Yan, J., Zhang, D., Bi, S., Tian, Y., Mubeen, R., & Abbas, J. (2024). Does CEO power affect
manufacturing firms’ green innovation and organizational performance? A mediational approach.
Sustainability, 16(14), 6015.
[7] Ren, F., Ren, C., & Lyu, T. (2025). Iot-based 3d pose estimation and motion optimization for athletes:
Application of c3d and openpose. Alexandria Engineering Journal, 115, 210-221.
[8] Fan, Y., Wang, Y., Liu, L., Tang, X., Sun, N., & Yu, Z. (2025). Research on the Online Update Method
for Retrieval-Augmented Generation (RAG) Model with Incremental Learning. arXiv preprint
arXiv:2501.07063.

6719975bc17c8e2469d28744 59585697174
No ratings yet
6719975bc17c8e2469d28744 59585697174
2 pages
Survey of Deep Learning Paradigms For Speech Processing
No ratings yet
Survey of Deep Learning Paradigms For Speech Processing
37 pages
NESA - Enterprise - Computing - 11!12!2022 (S6)
No ratings yet
NESA - Enterprise - Computing - 11!12!2022 (S6)
27 pages
Automatic Speech Recognition Using Deep Neural Networks
No ratings yet
Automatic Speech Recognition Using Deep Neural Networks
6 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
Speech Recognition
100% (4)
Speech Recognition
576 pages
Mestrado-Engenharia Informatica-Eduardo Farofia Medeiros
No ratings yet
Mestrado-Engenharia Informatica-Eduardo Farofia Medeiros
103 pages
Seminar - NEUROMORPHIC COMPUTING
100% (3)
Seminar - NEUROMORPHIC COMPUTING
14 pages
IET - Applications of Machine Learning in Wireless Communications
No ratings yet
IET - Applications of Machine Learning in Wireless Communications
492 pages
Deep Learning Methods and Application
No ratings yet
Deep Learning Methods and Application
100 pages
Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems
No ratings yet
Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems
37 pages
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
No ratings yet
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
37 pages
300+ Artificial Intelligence MCQ Questions & Answers - Letsfindcourse
100% (2)
300+ Artificial Intelligence MCQ Questions & Answers - Letsfindcourse
10 pages
Application of Deep Learning in Stock Market - Recent Progess
No ratings yet
Application of Deep Learning in Stock Market - Recent Progess
97 pages
Adaptation Algorithms For Neural Network-Based Speech Recognition An Overview
No ratings yet
Adaptation Algorithms For Neural Network-Based Speech Recognition An Overview
34 pages
Developing A Negative Speech Emotion Recognition Model For Safety Systems Using Deep Learning
No ratings yet
Developing A Negative Speech Emotion Recognition Model For Safety Systems Using Deep Learning
31 pages
Trade Project
No ratings yet
Trade Project
34 pages
A Comprehensive Survey On Automatic Speech Recognition Using Neural Networks
No ratings yet
A Comprehensive Survey On Automatic Speech Recognition Using Neural Networks
46 pages
2017 Phrase Mining From Massive Text and Its Applications
No ratings yet
2017 Phrase Mining From Massive Text and Its Applications
89 pages
Full Text 01
No ratings yet
Full Text 01
54 pages
Christoph Bensch Master Thesis
No ratings yet
Christoph Bensch Master Thesis
67 pages
BD Case Study
No ratings yet
BD Case Study
3 pages
1 s2.0 S0957417424009850 Main
No ratings yet
1 s2.0 S0957417424009850 Main
11 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
Research Method and Presentation (Mini Project Proposal)
No ratings yet
Research Method and Presentation (Mini Project Proposal)
26 pages
Article 111432
No ratings yet
Article 111432
13 pages
Database Management System - (Chapter 5)
No ratings yet
Database Management System - (Chapter 5)
10 pages
End-To-End Speech Recognition Models
No ratings yet
End-To-End Speech Recognition Models
94 pages
Ec32022 200
No ratings yet
Ec32022 200
9 pages
Aman
No ratings yet
Aman
71 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
Intelligent Speech Recognition Algorithm in Multimedia Visual Interaction Via BiLSTM and Attention Mechanism
No ratings yet
Intelligent Speech Recognition Algorithm in Multimedia Visual Interaction Via BiLSTM and Attention Mechanism
13 pages
Bus Matrix
No ratings yet
Bus Matrix
9 pages
DS
No ratings yet
DS
13 pages
Here Is A Complete Set of MySQL Notes
No ratings yet
Here Is A Complete Set of MySQL Notes
6 pages
Signals and Communication Technology
No ratings yet
Signals and Communication Technology
22 pages
FF7: A Code Package For High-Throughput Calculations and Constructing Materials Database
No ratings yet
FF7: A Code Package For High-Throughput Calculations and Constructing Materials Database
9 pages
Integrated Method of Deep Learning and Large Language Model in Speech Recognition
No ratings yet
Integrated Method of Deep Learning and Large Language Model in Speech Recognition
6 pages
Chapter 6 - Artificial Intelligence Notes
No ratings yet
Chapter 6 - Artificial Intelligence Notes
13 pages
Ann SVM
No ratings yet
Ann SVM
26 pages
Sound Conveyors For Stealthy Data Transmission: Sachith Dassanayaka
No ratings yet
Sound Conveyors For Stealthy Data Transmission: Sachith Dassanayaka
8 pages
Implementing A Hidden Markov Model Speech Recognit
No ratings yet
Implementing A Hidden Markov Model Speech Recognit
12 pages
Machine Learning
100% (4)
Machine Learning
134 pages
10 XI November 2022
No ratings yet
10 XI November 2022
8 pages
Temporal Pattern Classification Using Spiking Neural Networks
No ratings yet
Temporal Pattern Classification Using Spiking Neural Networks
67 pages
A Review On Automatic Speech Recognition Architect
No ratings yet
A Review On Automatic Speech Recognition Architect
13 pages
David Magar 20048430
No ratings yet
David Magar 20048430
15 pages
Speech Recognition Using Deep Neural Networks: A Systematic Review
No ratings yet
Speech Recognition Using Deep Neural Networks: A Systematic Review
23 pages
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
No ratings yet
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
6 pages
5 J MST 87 2023 40 49 7279
No ratings yet
5 J MST 87 2023 40 49 7279
10 pages
Human-Computer Interaction Based On Speech Recogni
No ratings yet
Human-Computer Interaction Based On Speech Recogni
9 pages
Rohit
No ratings yet
Rohit
14 pages
10 - Recurrent Neural Network Based Speech Emotion
No ratings yet
10 - Recurrent Neural Network Based Speech Emotion
13 pages
2597 Full
No ratings yet
2597 Full
7 pages
COSC406
No ratings yet
COSC406
3 pages
Bca 6 Sem Information Security 75014 Dec 2018
No ratings yet
Bca 6 Sem Information Security 75014 Dec 2018
2 pages
Automatic Speech Recognition Using Deep Neural Networks
No ratings yet
Automatic Speech Recognition Using Deep Neural Networks
6 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
Irjet V7i6804
No ratings yet
Irjet V7i6804
7 pages
Speech Recognition System
No ratings yet
Speech Recognition System
5 pages
Ashrith Resume
No ratings yet
Ashrith Resume
2 pages
1 - Introduction To NoSQL
No ratings yet
1 - Introduction To NoSQL
2 pages
Meruka AI Proposal
No ratings yet
Meruka AI Proposal
2 pages
VLSI
No ratings yet
VLSI
14 pages
AMIT
No ratings yet
AMIT
3 pages
EHaCON - 2019 Paper 8
No ratings yet
EHaCON - 2019 Paper 8
20 pages
Applsci 13 05389
No ratings yet
Applsci 13 05389
2 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
5 Final Project Thesis Report Book Format
No ratings yet
5 Final Project Thesis Report Book Format
4 pages
Speech Recognition Using Artificial Neural Network: - A Review
100% (1)
Speech Recognition Using Artificial Neural Network: - A Review
4 pages
An Unsupervised Deep Domain Adaptation Approach For Robust Speech Recognition PDF
No ratings yet
An Unsupervised Deep Domain Adaptation Approach For Robust Speech Recognition PDF
12 pages
Mca (Management) 2020 Pattern
No ratings yet
Mca (Management) 2020 Pattern
74 pages
Multimedia
No ratings yet
Multimedia
4 pages
Suggested Readings/ Books:: Detailed Contents: Unit 1
No ratings yet
Suggested Readings/ Books:: Detailed Contents: Unit 1
2 pages
CSE 408 Fall 2021 Final Exam Topics & Questions
No ratings yet
CSE 408 Fall 2021 Final Exam Topics & Questions
8 pages
IRJET Speech Scribd
No ratings yet
IRJET Speech Scribd
3 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Progress - Report - of - Intership MD Shams Alam
No ratings yet
Progress - Report - of - Intership MD Shams Alam
4 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
Development of Speech Recognition System Based On CMUSphinx For Khmer Language
No ratings yet
Development of Speech Recognition System Based On CMUSphinx For Khmer Language
6 pages
DBMS Syllabus
No ratings yet
DBMS Syllabus
2 pages
A Review On Deep Learning Applications
No ratings yet
A Review On Deep Learning Applications
11 pages
Text Summarization and Conversion of Speech To Text
No ratings yet
Text Summarization and Conversion of Speech To Text
5 pages
Assignment Brief 1 Statistics For Management
No ratings yet
Assignment Brief 1 Statistics For Management
4 pages
Speech Recognition Using Neural Networks: A Review: Dhavale Dhanashri, S.B. Dhonde
No ratings yet
Speech Recognition Using Neural Networks: A Review: Dhavale Dhanashri, S.B. Dhonde
4 pages
Incorporating Knowledge Sources Into Statistical Speech Recognition
No ratings yet
Incorporating Knowledge Sources Into Statistical Speech Recognition
20 pages
A Study On Automatic Speech Recognition
100% (1)
A Study On Automatic Speech Recognition
2 pages
Đề thi cuối kỳ 20212 IT2030 - Techinical Writing and Presentation
No ratings yet
Đề thi cuối kỳ 20212 IT2030 - Techinical Writing and Presentation
3 pages
Speech Recognition Using Deep Learning Techniques
No ratings yet
Speech Recognition Using Deep Learning Techniques
5 pages
A Review On Feature Extraction and Noise Reduction Technique
No ratings yet
A Review On Feature Extraction and Noise Reduction Technique
5 pages
Standardization Problem Statement
No ratings yet
Standardization Problem Statement
5 pages
Malayalam Speech Recognition
No ratings yet
Malayalam Speech Recognition
3 pages
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Speech Recognition: Fundamentals and Applications
From Everand
Speech Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet

Modeling of Speech Recognition Based On Deep Learning: Min Zhang

Uploaded by

Modeling of Speech Recognition Based On Deep Learning: Min Zhang

Uploaded by

Received: 23 December 2024 | Revised: 22 January 2025 | Accepted: 2 February 2025 | Published online: 6 February 2025

International Journal of Advance in Applied Science Research, Volume 4, Issue 2, 2025

Modeling of Speech Recognition Based

Keywords: Deep learning; Speech recognition; Feature extraction; DFSMN model.

© The Author(s) 2025.

2. Speech Recognition System

2.1 Working mechanism of speech recognition system

2.2 Classification of Speech Recognition Systems

2.3 Main issues of speech recognition

(1) Human factors:

(2) Environmental noise:

(3) Hardware factors:

(4) The understanding of semantics by speech recognition systems:

3. Design Of Chinese Speech Recognition Framework Based On Deep Learning

Figure 1: Framework structure of Chinese speech recognition system

4. Collection And Processing Of Voice Data

4.1 Collection of Voice Data

4.2 Preprocessing of Voice Data

Figure 2: Sample of Original Label

Figure 3: Sample Label in the Text

4.3 Speech data feature processing

5. Modeling Of Chinese Speech Recognition Based On Deep Learning

5.1 Building Acoustic Models

5.2 Building Language Models

Figure 4: Directed Pinyin to Chinese Character Conversion Diagram

5.3 Instructions for Using Training Model Code Files

cnn_dfsmn_ctc: Store the trained acoustic model.

datasets: Store voice dataset and label files.

wav_speech_recorder.py: Contains functions for recording WAV speech.

You might also like