0% found this document useful (0 votes)
12 views13 pages

Intelligent Speech Recognition Algorithm in Multimedia Visual Interaction Via BiLSTM and Attention Mechanism

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Intelligent Speech Recognition Algorithm in Multimedia Visual Interaction Via BiLSTM and Attention Mechanism

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Neural Computing and Applications (2024) 36:2371–2383

https://doi.org/10.1007/s00521-023-08959-2 (0123456789().,-volV)(0123456789().
,- volV)

S . I . : M A C H I N E L E A R N I N G A N D B I G D AT A A N A L Y T I C S F O R I O T S E C U R I T Y A N D
PRIVACY (SPIOT 2022)

Intelligent speech recognition algorithm in multimedia visual


interaction via BiLSTM and attention mechanism
Yican Feng1

Received: 17 February 2023 / Accepted: 15 August 2023 / Published online: 29 November 2023
 The Author(s) 2023

Abstract
With the rapid development of information technology in modern society, the application of multimedia integration
platform is more and more extensive. Speech recognition has become an important subject in the process of multimedia
visual interaction. The accuracy of speech recognition is dependent on a number of elements, two of which are the acoustic
characteristics of speech and the speech recognition model. Speech data is complex and changeable. Most methods only
extract a single type of feature of the signal to represent the speech signal. This single feature cannot express the hidden
information. And, the excellent speech recognition model can also better learn the characteristic speech information to
improve performance. This work proposes a new method for speech recognition in multimedia visual interaction. First of
all, this work considers the problem that a single feature cannot fully represent complex speech information. This paper
proposes three kinds of feature fusion structures to extract speech information from different angles. This extracts three
different fusion features based on the low-level features and higher-level sparse representation. Secondly, this work relies
on the strong learning ability of neural network and the weight distribution mechanism of attention model. In this paper, the
fusion feature is combined with the bidirectional long and short memory network with attention. The extracted fusion
features contain more speech information with strong discrimination. When the weight increases, it can further improve the
influence of features on the predicted value and improve the performance. Finally, this paper has carried out systematic
experiments on the proposed method, and the results verify the feasibility.

Keywords Speech recognition  Multimedia visual interaction  BiLSTM  Attention

1 Introduction developed from keyboard input to touch input [3, 4].


However, these human–computer interaction methods have
Speech is the most natural way of interaction for human gradually failed to meet the needs, and the development of
communication. People get a lot of information from the speech recognition technology can make people and
outside world through speech interaction every day. machines interact better [5].
According to statistics, about 75% of communication in Let the machine understand the world is the initial
human daily life is completed by voice [1, 2]. At the purpose of studying speech recognition. In the process of
beginning of the invention of computer, human–computer human and machine interaction, the first step is to let the
interaction can only be realized through keyboard, mouse machine hear human voice. The second step is to let the
and buttons. With the development of smart tablet, smart machine translate the sound into words to achieve the
phone and other devices, human–computer interaction has purpose of accurate listening. The third step is to let the
machine accurately understand the emotion expressed by
& Yican Feng
human voice to achieve the purpose of understanding
[email protected] [6, 7]. Speech recognition technology includes near-field
speech technology and far-field speech technology, which
1
Advanced Film School, Chung-Ang University, are bounded by 30 cm. Due to the limitations of near-field
Seoul 156756, Korea

123
2372 Neural Computing and Applications (2024) 36:2371–2383

speech, the research focus of speech recognition is on the present, intelligent voice has become one of the most
far-field speech interaction technology. This technology important multimedia human–computer interaction modes
includes voice wake-up, recognition and understanding in the consumer market [20].
[8, 9]. With the development of artificial intelligence, This work proposes a new method for speech recogni-
speech recognition technology is also improving. Deep tion in multimedia visual interaction. The contributions
learning has been used for modeling tasks in speech are:
recognition and achieved good results. These technologies
• First, this work proposes three kinds of feature fusion
are also applied to intelligent products [10].
structures to extract speech information from different
Using voice to realize multimedia human–computer
angles. This extracts three different fusion features
interaction is the key research direction of artificial intel-
based on the low-level features and higher-level sparse
ligence. This mainly includes automatic speech recogni-
representation.
tion, natural language processing, speech synthesis and
• Secondly, the fusion feature is combined with the
other technologies. Automatic speech recognition tech-
BiLSTM with attention mechanism to extract more
nology, also known as speech recognition technology, is
discriminative feature.
responsible for the conversion of speech to text [11, 12].
• Finally, this work has carried out systematic experi-
This is an interdisciplinary task, involving multiple disci-
ments on the proposed method, and the experimental
plines such as signal processing, pattern recognition,
results verify the feasibility of this work.
physiology, psychology, computer science and linguistics.
Automatic speech recognition technology is the key first The second section of this paper will carry out the
step to realize multimedia human–computer interaction corresponding literature analysis. The third section will
using speech and is the core of this paper. Speech recog- elaborate the proposed method. The fourth section is
nition technology is divided into two main development related experiments and analysis. The fifth section is the
directions according to different computing platforms. One conclusion.
is a large vocabulary continuous speech recognition system
based on computer platform. The other is the specific
control instruction recognition system based on the special 2 Related work
speech recognition chip [13, 14]. The processor of special
speech recognition chip is a low power consumption, low Reference [21] applied GMM-HMM to speech recognition
cost and small size intelligent chip. Compared with com- model, and speech recognition model based on neural
puters, its operation processing speed and storage capacity network also originated at this time. However, due to the
are very limited. This is mainly used to identify short voice influence of factors such as computing power and data at
control commands and some fixed voice data of specific that time, this did not achieve particularly good results.
people. The large vocabulary continuous speech recogni- Literature [22] proposed a connection timing classification
tion system via computer platform can adjust model scale model. This used a single model to directly model the
according to the use scenario, which has more powerful speech frame sequence, and the one-way model could be
functions [15]. applied to the field of streaming speech recognition. Lit-
Although automatic speech recognition has developed erature [23] proposed LAS based on attention mechanism.
rapidly for more than half a century, it still has many This translated speech recognition task into Seq2Seq
application problems in real life [16]. On the one hand, this modeling task, which directly mapped audio sequence to
is caused by the defects of speech recognition itself. Yet, corresponding text sequence. With the continuous
academics have great aspirations for voice recognition, improvement of computing performance, the neural net-
despite the fact that the research in this area is inconsistent. work framework was also iteratively optimized. In the
The use of deep learning in automatic voice recognition has aspect of speech recognition based on DNN-HMM
made significant strides as we have entered the third stage framework, there were various structural optimizations for
of human evolution [17, 18]. The recognition ability of acoustic models. Literature [24] proposed CTC extended
machine systems has been substantially improved under structure, which was a kind of cyclic neural network con-
ideal settings as a result of the rapid development of deep verter. This integrated the input sequence and the historical
learning technology, which has sparked a technological output sequence, which could optimize the acoustic and
revolution in the field of voice recognition. The recognition language models at the same time. This solved the defects
rate of many advanced technologies can even exceed 95% of the CTC output independence assumption, and the use of
[19]. Under this technical background, various kinds of one-way coding structure could be applied to streaming
smart voice products such as mobile phone voice assistant speech recognition. CTC, LAS and RNN-T were the three
and smart speaker have rapidly opened the market. At main end-to-end speech recognition frameworks. Its basic

123
Neural Computing and Applications (2024) 36:2371–2383 2373

assumptions, training objectives, training process and rea- DFSMN, which added jump connections between memory
soning process were different. Literature [25] proposed layers. This alleviated the problem of gradient disappear-
time-delay neural network, which would apply convolu- ance when building a deep model structure. Literature [34]
tional neural network, which was brilliant in the field of proposed a streaming multi-level truncated attention
image recognition, to speech recognition. It used hole model, which introduced a special multi-level attention
convolution for selective input, which reduced the amount mechanism.
of computation and increased the field of view of the top
input. Literature [26] proposed LSTMP, which was based
on the traditional LSTM by introducing an affine layer 3 Speech recognition via BiLSTM
between the memory block and the output layer. This and attention mechanism
effectively reduced the number of model parameters while
retaining the advantages of LSTM for serialization mod- First, this work proposes three kinds of feature fusion
eling. Literature [27] proposed the maximum mutual structures to extract speech information from different
information training criterion to take into account the angles. This extracts three different fusion features based
correct path and the confused path at the same time. This on the low-level features and higher-level sparse repre-
increased the scoring difference between the correct path sentation. Secondly, the fusion feature is combined with
and other paths. This could improve the correct path score BiLSTM with attention mechanism. The extracted fusion
and more match the recognition target. features contain more speech information with strong
The end-to-end speech recognition framework was rel- discrimination.
atively simple. The unified optimization goal avoided the
cumulative error problem caused by multiple modules and 3.1 Speech sparse representation extraction
did not require forced alignment. At the same time, single
words could be used as the modeling unit, without addi- When extracting fusion features, it is necessary to extract
tional pronunciation dictionary, and the model had good corresponding sparse representation features from the
extensibility. Literature [28] proposed a code-decode underlying features. As a high-level feature, the sparse
structure, which had been widely used in the field of representation feature has good uniqueness and distinc-
machine translation for the first time and achieved good tiveness. In this paper, KSVD dictionary learning algorithm
experimental results. Literature [29] extended the applica- and BPDN sparse decomposition algorithm are used to
tion of code-decode structure to the field of speech extract sparse features. KSVD algorithm is improved to
recognition. Therefore, in addition to CTC, sequence-to- solve the problem of slow learning of dictionary learning
sequence end-to-end speech recognition framework had algorithm under a large amount of voice data.
also become one of the mainstream methods. It was not In this paper, KSVD algorithm is used to learn the
necessary to assume the alignment of input and output dictionary. The dictionary learned by this algorithm from
sequences in advance, and this could learn encoding, the training data can be regarded as a reduced-dimension
decoding and how to align at the same time. Cyclic neural representation of the training set. Therefore, this can well
network needed to calculate the response of each time represent the characteristics of training data. In this algo-
frame in sequence at the training time. Its main defect was rithm, first of all, data are randomly selected from the
that it was easy to lose the historical information of long training data as the initial dictionary atom. Then, the OMP
words. This had disadvantages for speech recognition tasks algorithm is used to solve the sparse representation while
with monotonic relationship between audio and text. In the dictionary is left unmodified. Then, the SVD algorithm
addition, the training should to calculate each time frame in is used to update the dictionary atoms while maintaining
time sequence. Literature [30] proposed the Transformer the sparsity. As long as there is significant discrepancy
model, which replaced RNN with self-attention mechanism between the sample data and the data rebuilt using the
to model the global timing signal. At the same time, it created dictionary, iterate until a minimal error is achieved.
could train in parallel and improve the calculation speed of The speech training data chosen in this research at the
the model. Literature [31] applied Transformer model to frame level is substantial, and the system requires frequent
specific mandarin speech recognition tasks and achieved iteration. Due to the sluggish acquisition of sparse repre-
results beyond LAS model. Literature [32] combined CTC sentation from the learning dictionary, this paper randomly
with LSTM model and inserted the vector output from the selects the training dictionary from each voice training
top of LSTM into the CTC model. LSTM-CTC model was sample. This study refines the method of training a dic-
constructed using CTC decoding method. The model not tionary in an effort to address this issue. The enhanced
only reduced the word error rate, but also shortened the KSVD chooses a specified amount of data from each type
training time of the model. Reference [33] proposed LFR- of voice data to create a short training dataset that is then

123
2374 Neural Computing and Applications (2024) 36:2371–2383

used to train the dictionary’s first sample. The dictionary is where ai is sparse parameter.
then used to obtain the sparse representation of all the voice BPDN algorithm can find the global optimal solution by
data. minimizing the reconstruction error while obtaining the
The basic idea of the algorithm is to update all atoms rarest solution of the signal. After a learning dictionary is
until the overall reconstruction error is minimal. The target obtained by using KSVD algorithm, all training samples
function of KSVD is as follows: are processed by BPDN algorithm to obtain corresponding
JK ¼ minkX  DAk2F ð1Þ sparse representation. Finally, the sparse representation
D;A
features are sent into the recognition model for
where X is the training data, D is the initial dictionary, A is classification.
the sparse sample.
KSVD algorithm is divided into three stages. First is the 3.2 Structural design of multi-level feature
initialization stage, which generates initial MFCC feature fusion
training samples according to the improved method pro-
posed in this paper. These samples contain different pho- This paper proposes several structures of feature fusion,
neme categories. Each category takes several samples from which can extract the features of speech signals from dif-
the training data to form a small training sample. This ferent angles. When one mode feature is damaged, other
initial training data covers all phoneme types. Then, select mode features can help fill in the missing information to
some samples from the initial samples as the atoms of the enhance the robustness of the signal. Finally, the interac-
initial dictionary. Next is the sparse coding stage, which tion between the characteristics of different patterns can be
fixes the dictionary obtained in the previous step and nor- considered as an enhancement factor.
malizes each column of atoms in the dictionary. This uses For the same segment of voice data, feature extraction is
the sparse decomposition algorithm OMP to solve sparse performed from frequency domain and cepstrum domain,
data. Finally, dictionary update stage, in which the dic- respectively. First, the CQT spectrum corresponding to the
tionary atoms and sparse features need to be updated at the voice data can be obtained by CQT transformation of the
same time. Generally, it is assumed that one optimization initial audio data. CQT transform is different from Fourier
variable is fixed, another variable is optimized, and the transform in that its spectrum is nonlinear and the length of
variable is updated alternately. filtering window is variable. This can be changed according
For solving the problem of sparse representation, the to the change of spectral line frequency to obtain better
OMP algorithm in the greedy algorithm adopted in docu- performance. In addition, this paper calculates Mel cep-
ment [35]. The flaw of this algorithm type is that during the strum coefficients of speech to obtain MFCC features, so as
iterative process, only the local optimal solution is guar- to achieve the extraction of underlying features from dif-
anteed. At the same time, this sparsity is pre-fixed by the ferent angles. The two extracted low-level features are
matching pursuit algorithm. Complex speech data is rep- fused to form a low-level fusion feature. Then, the fused
resented with varying degrees of sparsity across distinct features are extracted from the sparse representation by
speech areas. Consequently, it is not appropriate to apply dictionary learning to obtain the fused sparse features. The
the same sparsity to the sparse representation of each voice fusion structure is shown in Fig. 1:
frame when dealing with voice data. In light of these The specific fusion expression can be expressed as:
issues, this article employs the algorithm of base tracking  2  
 ðiÞ  0  ðiÞ 
noise reduction in order to address the sparse representa- Fll ¼ min xðiÞmc  D mc a ll  þl all  ð4Þ
ðiÞ 2 1
Dm c;all
tion. The optimization problems of conventional sparse
representation are as follows: where xmc is the fusion sample, Dmc is the fusion dictionary
A ¼ arg min ai0 ð2Þ after dictionary learning after the fusion of low-level and
A low-level features.
where ai is sparse parameter. This method fuses two kinds of low-level features
In the actual solution process, the above problems are extracted from different angles, and then extracts the high-
usually converted into the problem of solving the L1 norm. level sparse representation. The advantage of this method is
When noise is considered, the sample signal becomes a that it combines the information of the two underlying
superimposed signal. If BP algorithm, namely, BPDN modes before learning together. The sparse representation
algorithm [36], is used for noise suppression, the opti- after learning can capture the relationship between the two
mization problem is: modal features. This can also play a complementary role
and make up for the lack of single type features.
A ¼ arg min ai1 ð3Þ
A

123
Neural Computing and Applications (2024) 36:2371–2383 2375

Fig. 1 Fusion of low-level xm


feature and low-level feature

MFCC

Fusion Dmc Sparse

CQT

xmc auu
xc

Fusion of high-level feature and high-level feature where am and ac are high-level sparse representations
adopts the method of directly sparse coding the underlying extracted from the two features, respectively.
features. First of all, after a series of preprocessing of the The method of fusing two high-level features is a simple
original data of the same voice segment, two different method of encapsulating features from two modes. This
underlying features, MFCC and CQT, are extracted. The increases the effective information contained in the feature
method of implementing parallel sparse coding after to a certain extent, but completely discards the underlying
extracting the underlying features. This is to learn a dic- effective information. This can’t conduct joint training on
tionary for the low-level features of each mode and then voice data like the fusion between the bottom layer and the
conduct sparse coding to obtain two different high-level bottom layer. Therefore, it cannot capture the correlation
feature representations. Finally, the high-level sparse rep- between the two features that may be beneficial to the
resentation obtained separately will be fused in parallel. recognition task. And, the same as the fusion between the
Figure 2 depicts this modal parallel sparse coding scheme. bottom layers, the extracted features are the fused higher-
The scheme learns two dictionaries for the two underlying level features, which will cause some beneficial informa-
modal features in parallel. Finally, the two high-level tion at the bottom layer to be completely discarded.
sparse feature representations are fused according to the For the high-level fusion and the low-level fusion, the
following structure to obtain new fusion features. two feature fusion structures are finally transformed to
Finally, the sparse representation obtained in parallel is obtain a fused sparse representation feature, which will be
fused according to the above structure to obtain the fused sent into the classification model. However, this com-
feature representation. The specific expression is as pletely discards the underlying features, which also dis-
follows: cards a lot of important information contained in the
 
ðiÞ 2
 
Fhhm ¼ min xðiÞ m  Dm am 2 þl am 1
0  ðiÞ 
ð5Þ underlying features. In view of the problems of the above
ðiÞ
Dm ;am two structures, this paper has made further improvements
 
ðiÞ 2
  on the structure designed previously. This paper designs
Fhhc ¼ min xðiÞ 0  ðiÞ 
c  Dc ac 2 þl ac 1 ð6Þ and adopts a structure that integrates the low-level features
ðiÞ
Dc ;ac
and high-level features and extracts features, as shown in
ahh ¼ ½am ; ac  ð7Þ Fig. 3.

Fig. 2 Fusion of high-level xm am


feature and high-level feature

MFCC Dm Sparse

Fusion

CQT Dc Sparse
ahh
xc ac

123
2376 Neural Computing and Applications (2024) 36:2371–2383

Fig. 3 Fusion of low-level xm


feature and high-level feature

MFCC

Fusion

MFCC Dm Sparse
alh
xm am

This structure first extracts MFCC features directly from through historical speech information. In theory, RNN
voice data. On the other hand, a sparse high-level feature is model can model speech signals of any time sequence
obtained by sparse coding after extracting the low-level length. However, in the process of long sequence training,
feature. Finally, two different modal features are fused the RNN network will produce a high power of the matrix,
through the above fusion structure to obtain a new fusion which will lead to gradient disappearance or explosion. If
feature. The sparse representation extraction formula is as the gap between the previous voice information and the
follows: current frame becomes very large, RNN will lose the
 
ðiÞ 2
 ðiÞ  ability to learn to connect such a long-distance information.
Flh ¼ min xðiÞ
m  Dm am 2 þl am 1
  ð8Þ Unit state, forgetting gate, input gate, and output gate
ðiÞ
Dm ;am
are all components of LSTM, a unique subclass of RNNs.
alh ¼ ½xm ; am  ð9Þ The unit state is where all of your long-term memories
where xm is MFCC feature, am is sparse feature. from before this moment are kept. The forgetting gate
As a low-level feature, MFCC contains a lot of original eliminates unwanted or unnecessary information from
information. The more original information, the more long-term memory. Short-term memory is used by the
complete the information. It accurately describes the non- input gate to update permanent storage. The output gate
linear characteristics of human ear frequency. As the combines the information stored in both the short- and
underlying features, these features usually contain redun- long-term memories to produce a final output. The issue of
dant information, which will interfere with the expression long-term dependency of RNN has been resolved by
of effective information. The features after sparse repre- LSTM, which can handle both long- and short-term
sentation have good distinguishability, and will be sup- memory. The traditional LSTM model is a one-way
plemented when the underlying information is ambiguous. transmission structure, while speech recognition is a con-
This makes it easier for the classifier to obtain the effective text-dependent phoneme distribution problem, so this paper
information contained in the signal. Therefore, compared uses BiLSTM to process speech. Two independent forward
with the previous two structures, the fusion structure of the and backward LSTM networks process the front and back
bottom layer and the top layer will be easier to accurately information, respectively, and then splice the two outputs
mine the effective information in the voice signal. to get the acoustic model at that time.
In this paper, attention mechanism is added to the
3.3 Speech recognition based on BiLSTM model, which searches for some inputs related to the cur-
and feature fusion rent prediction output from the input through the calcula-
tion of attention weight in the network. The closer the
Speech recognition is a typical signal processing problem relationship, the greater the weight value assigned to the
with time characteristics. The hidden layer neurons in DNN input vector. When the relevant input features contain more
and CNN are only affected by the current voice signal. effective information, the greater the impact of the relevant
RNN is a neural network structure that allows hidden layer input features given greater weight on the current output.
neurons to have self-feedback loops. Its hidden layer input This can improve the recognition accuracy. Therefore, this
not only comes from the voice signal at the current chapter sends the fusion features with different levels of
moment, but also receives the memory information from information extracted from the previous chapter into the
the hidden layer at the previous moment. This structure BiLSTM network via attention mechanism. This makes use
mimics the speech perception ability of human brain, that of the characteristics of attention to strengthen the influ-
is, to predict the speech information of the next moment ence of relevant input characteristics on the current

123
Neural Computing and Applications (2024) 36:2371–2383 2377

prediction output, so as to achieve the effect of reducing of the context vector at the first moment can be expressed
the recognition error rate. The feature fusion speech as follows:
recognition model based on attention and BiLSTM (FF- X
T
ATT-BiLSTM) is shown in Fig. 4. c1 ¼ x1t ht ð10Þ
The structure of BiLSTM model based on attention i¼1
mechanism is composed of coding layer, attention mech- where x is input.
anism end and decoding layer. In terms of model building, Attention mechanism will give greater weight to the
BiLSTM is selected for classification. Compared with the input sequence most relevant to the current prediction
traditional GMM-HMM and support vector machine output. According to this principle, when the input feature
models, the specific structure of BiLSTM can connect the contains more effective information, the proportion of
information far away from the previous to the current task. information in the input feature that is more relevant to the
This can make full use of the voice information at the front current output will increase as the corresponding weight
and back ends and also avoid the problem of gradient increases. The greater the impact of the relevant input
disappearance. And, this kind of serialized network struc- characteristics after being given greater weight on the
ture is suitable for processing voice data. current output. Therefore, this paper puts the multimodal
When forecasting the current time, the hidden layer features into the model based on attention mechanism. This
features of all the current input times are matched with the can increase the effective input information and also
predicted state of the previous time to calculate the score. strengthen the effect of effective input on output through
The matching score can be calculated by a small neural the model.
network. A small neural network is trained to learn the In order to alleviate the over-fitting phenomenon in the
relationship between the higher-order features and the training process, this paper uses the Dropout method and a
predicted state at the last time. The key step is to use the certain probability to randomly delete the number of hid-
softmax function, which passes the scores at all times den layer units in the model to improve the generalization
through the softmax function to ensure that the sum of all ability of the model. In addition, this paper also sets up the
weight values is 1. Finally, the weighted sum of the hidden Early Stopping mechanism during the training process.
layer features and corresponding weights output by the This is stopped in time when the validation accuracy on a
coding end is the context vector. The specific relationship

Fig. 4 FF-ATT-BiLSTM
pipeline Output

h1 h2 h3 hn

BiLSTM BiLSTM BiLSTM BiLSTM

e1 e2 e3 en

Embedding

all alh ahh

123
2378 Neural Computing and Applications (2024) 36:2371–2383

certain number of validation sets has not improved, so as to 4.3 Method comparison
prevent the training process from over-fitting.
To further verify the progressiveness of FF-ATT-BiLSTM
in the field of speech recognition, this paper compares it
4 Experimental result with others. The methods contain RNN, LSTM, BiLSTM
and Transformer. To ensure the consistency of the exper-
4.1 Dataset and experimental environment iment, keep the experimental parameters unchanged as
much as possible, the result is in Table 2.
This work collects voice data during multimedia visual FF-ATT-BiLSTM proposed in this paper can obtain the
interaction to build two datasets SRA and SRB. The two best performance. Specifically, FF-ATT-BiLSTM achieves
data sets contain different amounts of data. SRA has 92.1% accuracy and 90.5% F1 score on SRA and 95.5%
20,318 training samples and 8026 test samples. SRB has accuracy and 92.8% F1 score on SRB. Compared with the
52,189 training samples and 19,053 test samples. The hot Transformer algorithm this year, F-ATT-BilSTM
model is a deep learning model. The experimental envi- achieves 2.2% accuracy improvement and 3.2% F1 score
ronment is in Table 1. improvement in SRA. In SRB, the corresponding increases
The speech recognition task is essentially a classification are 2.9 and 2.3%, respectively. These performances and
task. The performance evaluation indicators selected for improvements verify the feasibility and superiority of FF-
this work are the accuracy and F1 score. ATT-BiLSTM in the field of speech recognition in the
TP þ TN process of multimedia visual interaction.
Acc ¼ ð11Þ To further verify the advantages of FF-ATT-BiLSTM,
TP þ TN þ FP þ FN
this paper compares the performance of different training
2RP
F1 ¼ ð12Þ stages, as shown in Fig. 6.
RþP
At different stages of training, FF-ATT-BiLSTM can
achieve different degrees of performance improvement
4.2 Analysis on FF-ATT-BiLSTM training compared with other methods. This further verifies the
superiority of this method.
FF-ATT-BiLSTM is a deep learning model, so it must be
trained and optimized first. This work first analyzes the 4.4 Analysis on feature fusion
training process of FF-ATT-BiLSTM, and the main anal-
ysis object is the loss in the training process as Fig. 5. FF-ATT-BiLSTM proposed three different feature combi-
With the increase in training times, the loss of FF-ATT- nations and fused the three sets of features. To verify the
BiLSTM on both data sets gradually decreased. When the feasibility of this feature fusion, this paper carries out
training reaches 60 epochs, the loss will not change sig- comparative experiments for different features. To ensure
nificantly, which indicates that the model has converged at the comparability of the experiment, this work tries to keep
this time. In addition, when the model converges, the loss the experimental parameters unchanged. The experimental
of FF-ATT-BiLSTM on SRB is smaller. This is because results are shown in Table 3.
the dataset is larger and the model training is more From the data shown in the table, we can see that the
sufficient. performance corresponding to the fusion of low-level fea-
tures and low-level features is the lowest. If high-level
features are combined with high-level features, the per-
formance will be improved to a certain extent. The fusion
of low-level features and high-level features can further
improve the accuracy and F1 score. However, a single
Table 1 Experimental environment feature combination cannot achieve the best performance.
Option Configuration Only after these three groups of features are fused can FF-
ATT-BiLSTM achieve the highest speech recognition
Operating system Windows10 performance.
CPU Intel(R) Core(TM) i7-9700
GPU NVIDIA RTX3070 4.5 Analysis on improved KSVD
Deep learning framework PyTorch 1.6
Development language Python This work uses the improved KSVD algorithm to sparsely
represent speech features. In order to verify the feasibility

123
Neural Computing and Applications (2024) 36:2371–2383 2379

Fig. 5 Training loss of FF-ATT-BiLSTM

Table 2 Method comparison Table 3 Analysis on feature fusion


Method SRA SRB Feature SRA SRB
ACC F1 ACC F1 ACC F1 ACC F1

RNN 81.2% 80.1% 85.1% 83.3% all 88.2% 86.2% 90.3% 88.5%
LSTM 84.1% 82.2% 86.9% 84.9% ahh 89.3% 86.7% 92.0% 89.1%
BiLSTM 87.5% 85.9% 90.3% 88.7% alh 90.5% 87.8% 93.6% 91.1%
Transformer 89.9% 87.3% 92.6% 90.5% all þ ahh þ alh 92.1% 90.5% 95.5% 92.8%
FF-ATT-BiLSTM 92.1% 90.5% 95.5% 92.8%

Fig. 6 Training F1 score on different stage

123
2380 Neural Computing and Applications (2024) 36:2371–2383

Fig. 7 Analysis on improved KSVD

of this improvement measure, this paper compares the 4.7 Analysis on BiLSTM
model recognition performance when the improvement
measure is not used and the improved KSVD algorithm is FF-ATT-BiLSTM uses BiLSTM to extract speech features.
used. The experimental result is shown in Fig. 7. To verify the high-performance extraction effect of
After the corresponding improvement of KSVD, the BiLSTM on speech features, this work analyzes the model
speech recognition performance of FF-ATT-BiLSTM can performance when using LSTM and BiLSTM. The exper-
be improved to a certain extent. Specifically, the accuracy imental results are shown in Fig. 8.
and F1 score on SRA are improved by 1.7 and 1.6%, After embedding BiLSTM module, the speech recog-
respectively. Similarly, the accuracy rate and F1 score on nition performance of FF-ATT-BiLSTM can be improved
SRB are improved by 2.1 and 1.9%, respectively. to a certain extent. Specifically, the accuracy and F1 score
on SRA are improved by 1.2 and 1.0%, respectively.
4.6 Analysis on attention Similarly, the accuracy and F1 score on SRB are improved
by 1.3 and 0.9%, respectively.
FF-ATT-BiLSTM uses attention to enhance features. To
verify the role of attention in speech recognition, this work 4.8 Analysis on dropout
analyzes the model performance when attention is not used
and when attention is used. The experimental results are FF-ATT-BiLSTM uses Dropout to enhance neural net-
shown in Table 4. work. To verify the role of Dropout in speech recognition,
After embedding attention module, the speech recogni- this work analyzes the model performance when Dropout is
tion performance of FF-ATT-BiLSTM can be improved to not used and when Dropout is used. The experimental
a certain extent. Specifically, the accuracy and F1 score on results are shown in Table 5.
SRA are improved by 1.3 and 1.4%, respectively. Simi- After embedding Dropout strategy, the speech recogni-
larly, the accuracy and F1 score on SRB are improved by tion performance of FF-ATT-BiLSTM can be improved to
1.3 and 1.2%, respectively. a certain extent. Specifically, the accuracy and F1 score on
SRA are improved by 1.0 and 0.7%, respectively. Simi-
larly, the accuracy and F1 score on SRB are improved by
0.7 and 0.6%, respectively.
Table 4 Analysis on attention
4.9 Analysis on early stopping
Method SRA SRB
ACC F1 ACC F1 FF-ATT-BiLSTM uses Early Stopping to constrain neural
network. To verify the role of Early Stopping in speech
FF-BiLSTM 90.8% 89.1% 94.2% 91.6%
recognition, this work analyzes the model performance
FF-ATT-BiLSTM 92.1% 90.5% 95.5% 92.8%
when Early Stopping is not used and when Early Stopping
is used. The experimental results are shown in Fig. 9.

123
Neural Computing and Applications (2024) 36:2371–2383 2381

Fig. 8 Analysis on BiLSTM

Table 5 Analysis on Dropout 5 Conclusion


Method SRA SRB
With the continuous popularity of multimedia intelligent
ACC F1 ACC F1 devices, the realization of multimedia visual interaction has
Without Dropout 91.1% 89.8% 94.8% 92.2% gradually become the focus of research. Therefore, human–
With Dropout 92.1% 90.5% 95.5% 92.8% computer interaction has become a key research field. The
accuracy of speech recognition technology has far-reaching
significance in this field. The accuracy of a speech recog-
nition system depends heavily on the accuracy of the
classification model and the quality of the derived acoustic
After embedding Early Stopping strategy, the speech
data. Therefore, it is required to include as much infor-
recognition performance of FF-ATT-BiLSTM can be
mation as possible in the sound content. At the same time,
improved to a certain extent. Specifically, the accuracy and
this should also minimize the interference of information
F1 score on SRA are improved by 1.8 and 2.0%, respec-
irrelevant to classification. Only the distinguishable fea-
tively. Similarly, the accuracy and F1 score on SRB are
tures extracted can make the features more conducive to
improved by 2.2 and 1.6%, respectively.
recognition. This work proposes a new method for speech
recognition in multimedia visual interaction. First of all,

Fig. 9 Analysis on Early Stopping

123
2382 Neural Computing and Applications (2024) 36:2371–2383

this work considers the problem that a single feature cannot variation[C]. In: Proceedings of the First Workshop on Bridging
fully represent complex speech information. This paper Human–Computer Interaction and Natural Language Processing,
pp 34–40
proposes three kinds of feature fusion structures to extract 7. Oh EY, Song D (2021) Developmental research on an interactive
speech information from different angles. This extracts application for language speaking practice using speech recog-
three different fusion features based on the low-level fea- nition technology. Educ Tech Res Dev 69(2):861–884
tures and higher-level sparse representation. Secondly, this 8. Ran D, Yingli W, Haoxin Q (2021) Artificial intelligence speech
recognition model for correcting spoken English teaching. J Intell
work relies on the strong learning ability of neural network Fuzzy Syst 40(2):3513–3524
and the weight distribution mechanism of attention model. 9. Fu Q, Fu J, Zhang S et al (2021) Design of intelligent human-
In this paper, the fusion feature is combined with BiLSTM computer interaction system for hard of hearing and non-disabled
and attention. The extracted fusion features contain more people. IEEE Sens J 21(20):23471–23479
10. Pei J, Yu Z, Li J, et al (2022) TKAGFL: a federated communi-
speech information with strong discrimination. When the cation framework under data heterogeneity. IEEE Trans Netw Sci
weight increases, it can further improve the influence of Eng 1:1–11
features on the predicted value and improve the perfor- 11. Weng Z, Qin Z, Tao X, et al 2023 () Deep learning enabled
mance. Finally, this paper has carried out systematic semantic communications with speech recognition and synthesis.
IEEE Trans Wirel Commun 1:6227–6240
experiments on the proposed method, and the results verify 12. Subramanian AS, Weng C, Watanabe S et al (2022) Deep
the feasibility. learning based multi-source localization with source splitting and
its effectiveness in multi-talker speech recognition. Comput
Speech Lang 75:101360
Data availability The datasets used during the current study are 13. Oruh J, Viriri S, Adegun A (2022) Long short-term Memory
available from the corresponding author upon reasonable request. Recurrent neural network for Automatic speech recognition.
IEEE Access 10:30069–30079
14. Fendji JLKE, Tala DCM, Yenke BO et al (2022) Automatic
Declarations speech recognition using limited vocabulary: a survey. Appl Artif
Intell 36(1):2095039
15. Bhangale KB, Kothandaraman M (2022) Survey of deep learning
Conflict of interest The authors declare no conflict of interest exists. paradigms for speech processing. Wirel Pers Commun
125(2):1913–1949
Open Access This article is licensed under a Creative Commons 16. Dua S, Kumar SS, Albagory Y et al (2022) Developing a speech
Attribution 4.0 International License, which permits use, sharing, recognition system for recognizing tonal speech signals using a
adaptation, distribution and reproduction in any medium or format, as
convolutional neural network. Appl Sci 12(12):6223
long as you give appropriate credit to the original author(s) and the 17. Gupta AK, Gupta P, Rahtu E (2022) FATALRead-fooling visual
source, provide a link to the Creative Commons licence, and indicate speech recognition models: put words on lips. Appl Intell 1:1–16
if changes were made. The images or other third party material in this
18. Lu Y J, Chang X, Li C, et al (2022) ESPnet-SE??: Speech
article are included in the article’s Creative Commons licence, unless
enhancement for robust speech recognition, translation, and
indicated otherwise in a credit line to the material. If material is not understanding. arXiv preprint arXiv:2207.09514
included in the article’s Creative Commons licence and your intended 19. Agarwal P, Kumar S (2022) Electroencephalography-based
use is not permitted by statutory regulation or exceeds the permitted
imagined speech recognition using deep long short-term memory
use, you will need to obtain permission directly from the copyright
network. ETRI J 44(4):672–685
holder. To view a copy of this licence, visit http://creativecommons. 20. Shashidhar R, Patilkulkarni S, Puneeth SB (2022) Combining
org/licenses/by/4.0/. audio and visual speech recognition using LSTM and deep con-
volutional neural network. Int J Inf Technol 14(7):3425–3436
21. Baum LE (1972) An inequality and associated maximization
References technique in statistical estimation for probabilistic functions of
Markov processes. Inequalities 3(1):1–8
1. Zgank A (2022) Influence of highly inflected word forms and 22. Graves A, Graves A (2012) Connectionist temporal classification.
acoustic background on the robustness of automatic speech Supervised Seq Labell Recurr Neural Netw 1:61–93
recognition for human-computer interaction. Mathematics 23. Chan W, Jaitly N, Le Q, et al (2016) Listen, attend and spell: a
10(5):711 neural network for large vocabulary conversational speech
2. Liu M (2022) English speech emotion recognition method based recognition. In: IEEE international conference on acoustics,
on speech recognition. Int J Speech Technol 25(2):391–398 speech and signal processing (ICASSP). IEEE, pp 4960–4964
3. Šumak B, Brdnik S, Pušnik M (2022) Sensors and artificial 24. Graves A (2012) Sequence transduction with recurrent neural
intelligence methods and algorithms for human–computer intel- networks. arXiv preprint arXiv:1211.3711
ligent interaction: a systematic mapping study. Sensors 22(1):20 25. Waibel A, Hanazawa T, Hinton G et al (1989) Phoneme recog-
4. Liu Y, Sivaparthipan CB, Shankar A (2022) Human–computer nition using time-delay neural networks. IEEE Trans Acoust
interaction based visual feedback system for augmentative and Speech Signal Process 37(3):328–339
alternative communication. Int J Speech Technol 1:1–10 26. Liu H, Zhao L (2019) A speaker verification method based on
5. Sang Y, Chen X (2022) Human-computer interactive physical TDNN–LSTMP. Circ Syst Signal Process 38:4840–4854
education teaching method based on speech recognition engine 27. Normandin Y (1996) Maximum mutual information estimation of
technology. Front Public Health 10:941083–941097 hidden Markov models. Autom Speech Speak Recog: Adv Top
6. Markl N, Lai C(2021) Context-sensitive evaluation of automatic 1:57–81
speech recognition: considering user experience & language

123
Neural Computing and Applications (2024) 36:2371–2383 2383

28. Cho K, Van Merriënboer B, Gulcehre C, et al (2014) Learning 33. Zhang S, Lei M, Yan Z, et al (2018) Deep-FSMN for large
phrase representations using RNN encoder-decoder for statistical vocabulary continuous speech recognition. In: IEEE international
machine translation. arXiv preprint arXiv:1406.1078 conference on acoustics, speech and signal processing (ICASSP).
29. Bahdanau D, Chorowski J, Serdyuk D, et al (2016) End-to-end IEEE, pp 5869–5873
attention-based large vocabulary speech recognition. In: IEEE 34. Cheng X, Xu M, Zheng TF (2020) A multi-branch ResNet with
international conference on acoustics, speech and signal pro- discriminative features for detection of replay speech signals.
cessing (ICASSP). IEEE, pp 4945–4949 APSIPA Trans Signal Inform Process 9:28
30. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you 35. Sivaram G, Nemala S K, Elhilali M, et al (2010) Sparse Coding
need. Adv in Neural Information Processing Systems 30:1–11 for Speech Recognition. In: IEEE International Conference on
31. Zhou S, Dong L, Xu S, et al (2018) Syllable-based sequence-to- Acoustics, Speech and Signal Processing. IEEE, pp 4346–4349
sequence speech recognition with the transformer in mandarin 36. Chen S, Saunders D (2001) Atomic decomposition by basis
Chinese. arXiv preprint arXiv:1804.10752 pursuit. SIAM Rev 43(1):129–159
32. Zhang Y, Lu X (2018) A speech recognition acoustic model
based on LSTM-CTC[C]. In: IEEE 18th International Conference Publisher’s Note Springer Nature remains neutral with regard to
on Communication Technology (ICCT). IEEE, pp 1052–1055 jurisdictional claims in published maps and institutional affiliations.

123

You might also like