0% found this document useful (0 votes)

62 views4 pages

Robust End-Of-Utterance Detection For Real-Time Speech Recognition Applications

The document proposes a sub-band energy-based algorithm to detect the end of utterance for real-time speech recognition applications. It uses sub-band spectral energy trajectories from multiple sub-bands and looks for the time when enough of these energies fall and remain below adaptive thresholds, indicating a non-speech period after the utterance. Experiments on isolated word and continuous digit recognition in various noise conditions show an average proper end-of-utterance detection rate of 94%, representing a 43% reduction in error rates over previous methods.

Uploaded by

vinaynsit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views4 pages

Robust End-Of-Utterance Detection For Real-Time Speech Recognition Applications

Uploaded by

vinaynsit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

ROBUST END-OF-UTTERANCE DETECTION FOR REAL-TIME SPEECH RECOGNITION APPLICATIONS

Ramalingam Hariharan, Juha Hkkinen, Kari Laurila Nokia Research Center, Speech and Audio Systems Laboratory, Tampere, Finland Email: {ramalingam.hariharan, juha.m.hakkinen, kari.laurila}@nokia.com

ABSTRACT
In this paper we propose a sub-band energy based end-ofutterance algorithm that is capable of detecting the time instant when the user has stopped speaking. The proposed algorithm finds the time instant at which many enough sub-band spectral energy trajectories fall and stay for a pre-defined fixed time below adaptive thresholds, i.e. a non-speech period is detected after the end of the utterance. With the proposed algorithm a practical speech recognition system can give timely feedback for the user, thereby making the behaviour of the speech recognition system more predictable and similar across different usage environments and noise conditions. The proposed algorithm is shown to be more accurate and noise robust than the previously proposed approaches. Experiments with both isolated command word recognition and continuous digit recognition in various noise conditions verify the viability of the proposed approach with an average proper endof-utterance detection rate of around 94% in both cases, representing 43% error rate reduction over the most competitive previously published method.

In order to give timely responses to the user, an ASR system should be able to detect when the user has stopped speaking. In our terminology, an end-of-utterance detection algorithm should be utilised. The state-of-the-art end-of-utterance detectors typically utilise methods based on the use of spectral full-band energy values, zero-crossing detectors etc. [2]. However, as shown also in this paper, they do not seem to work well in noisy conditions. Another conventional method to detect the end-of-utterance (EOU) is to use the confidence score of the recognised utterance [3][4][5]. However, the confidence based EOU scheme also works poorly in noisy environments. The main reason is that acoustic models generally do not match noisy signals well, thus producing lower confidences. The spoken words may then not be detected, producing a long delay in the response to the user, especially in noisy conditions. The confidence measure is also likely to produce too-early detections1 in a recognition system, where the speech vocabulary contains words in which one utterance is part of another longer utterance. Thus, for example, if there are utterances like "call" and "call home" and the utterance "call home" is spoken, it is likely that "call" is recognised since it produces a very high confidence first. In this paper, we describe a sub-band based EOU scheme, which works well in various noise environments. The end-ofutterance detector is developed for both isolated and continuous word recognition applications. The detection is based on finding the time instant at which many enough subband energies fall below adaptive thresholds, in other words a non-speech region is detected after the utterance has been spoken. It is verified through experiments that better noise robustness can be obtained by combining the independent decisions of the different sub-bands. The description of the EOU scheme and the associated experimental details for isolated word recognition are explained in the next two sections. Section 4 describes the adaptation of the algorithm for connected word recognition. Finally the conclusions are drawn in the last section.

1. INTRODUCTION
Despite recent improvements in the automatic speech recognition (ASR) technology, there are still a number of challenges to be solved before ASR can be successfully utilised in various real world environments by novice users. In 1997, Sagayama conducted an internet-based discussion of what is hindering the wider use of ASR [1]. The leading reasons found in the survey were: Robustness against noise, utterance and channel variabilities is too poor The human interface is too poor for users to utilise ASR comfortably

One of the biggest reasons for the inadequacy of the human interface is the perceived unpredictability of ASR. Clearly the fact that speech recognisers make more errors than humans is the biggest reason for unpredictability. Another important aspect in the perceived predictability is the timing of the ASR system's response to the user. Sometimes the ASR system responds too early, interrupting the user's speech, and sometimes user needs to wait too long for a response. This kind of behaviour greatly annoys users and thus, there is a clear need for ASR systems whose timing of the reply for the user is more predictable, and furthermore, remains similar across wide range of usage environments and noise conditions.

END-OF-UTTERANCE FOR ISOLATED WORD RECOGNITION

The end-of-utterance algorithm for isolated word recognition tries to estimate the time instant when the user has stopped speaking and remained quiet for a pre-defined period of time.
1

An utterance is said to be detected too-early, when the system stops recognition before the end of the spoken utterance. Hence only part of the utterance is actually recognised.

energy value in dB

This time instant is found by examining the sub-band energy trajectories of the utterance. The algorithm uses M sub-band energies, obtained from the mel-scaled energy values, to detect the end of utterance. The log sub-band energies are then used as input to a rank order filtering block, described later in this section, to find the end of utterance hypothesis for each subband separately. If many enough individual sub-bands (n out of M) trigger their sub-band end-of-utterance hypothesis, then the end-of-utterance is said to be detected and recognition is stopped. The sub-band energy trajectories, obtained from speech, contain many impulse type noises especially in adverse conditions. Since it is well known that non-linear filtering techniques can be employed to reduce the effect of impulsive noise, we did a series of experiments and developed a rankorder filtering scheme, which is described in detail below. The end-of-utterance is detected separately in each sub-band according to a rank order filtering based EOU scheme, which is described using a state machine representation as shown in Figure 1. The sub-band energy at time t is denoted as p(t). In the beginning of the recognition (state S0 in Figure 1), initialisation init() is done. This means that the counter c is set to zero, the global minimum power p_min(t=1) is set to and the global maximum power p_max(t=1) is set to - . This is followed by an instant transition into state S1. The function f() in states S1-S3 consists of the following actions: 1) p(t) is buffered (FIFO of N last frame energies is used) 2) the global minimum and maximum powers (p_min(t) and p_max(t)) are computed from a sliding buffer as given by the following equation:
window_min (t ) = min{ p (t N + 1), p (t N + 2),..., p (t )} window_max (t ) = max{ p (t N + 1), p(t N + 2),..., p(t )}
p_min(t) = min{p _ min(t 1), window_max(t )} p_max(t) = max{p _ max(t 1), window_min(t )}

Figure 2 shows the plot of an energy contour of the lowest subband for the utterance "Pekka Kapanen" until the detection of the end-of-utterance. The figure also shows the contours of p_min(t), p_max(t) and the threshold.
37 36 35 34 33 32 31 30 29 28 27 120 140 160 180 200 220 240 frame number 260 280 300 320 Energy Contour Pmin EOU detected pmax

threshold

Figure 2 An example of the end-of-utterance detection in the lowest sub-band of the utterance "Pekka Kapanen".

3. EXPERIMENTS FOR ISOLATED WORD RECOGNITION

Experiments were initially carried out to test the end-ofutterance detection for isolated word recognition. The results were compared with a baseline confidence score based and full-band energy based EOU detection. Some of the terminology used in the result tables is explained here. The target of the end-of-utterance detection system is to detect the time instant when speech ends with a sharp 0.80 seconds delay (80 frames in the test system). As shown in Fig. 3, early detections in this context mean that the end-ofutterance has been detected earlier than 400ms (40 frames) after the actual end of speech. Only when the end-of-utterance is detected before the actual speech ends, then some recognition errors due to end of utterance failure may occur. Late detection means that there is significantly longer delay than 0.8 seconds (greater than 1.2 seconds for isolated word detection and 1.35 seconds for continuous speech detection), which is not pleasant for the user, but does not necessarily deteriorate the recognition accuracy. The utterances for which the EOU algorithm fails to stop the recognition are termed as Algorithm failure. Here the processing of the utterance continues until the end of the speech file is reached. It is to be noted here that all the utterance files used in the experiments had a large enough non-speech region at the end of the utterance. All the other utterances, which do not fall under any of the above three categories produced proper end-of-utterance detections and are classified as proper detections. It should be noted that hand marked label files were used in detecting the actual end of speech for measurements.

(1) (2)

3) the median power p(t)m is computed as the median of the buffered frame energy values (N). This median power is used for comparison with a threshold. 4) the threshold is calculated as shown below:
thr = p_min + k ( p_max p_min ) , where 0 < k < 1
C < BEG

(3)

p m >= thr
S2 f() C=0

p m < thr
S3 f() C++

S0 Init()

S1 f() C++

p m < thr

C>= BEG

C>= END

p m >= thr

Figure 1 State machine that describes the rank order filtering scheme. The algorithm to detect the end-of-utterance for each sub-band can be summarised as follows. First the counter is initialised to zero. The counter value is then increased for every frame in which p(t)m is less than an estimated threshold. If p(t)m stays long enough under the threshold, then it is assumed that speech has ended in that sub-band, and the end-of-utterance is detected in the current sub-band. If p(t)m exceeds the threshold, then the counter is reset to zero, and counting starts again (this means that speech is still detected in that sub-band). As shown in Figure 1, the comparison of p(t)m against the threshold is initiated after BEG number of frames. If the counter C exceeds END, then the recognition is stopped and state S4 is reached.

0 E a rly D ete ctions

12 0 (1 35 fo r con tinu ous spe ech )

en d o f file

nu m be r of fram e s afte r the e nd o f utte ran ce

prope r de tectio ns

La te D ete ctions

A lgorithm failure

Figure 3 Explanation of terminology used in the results table.

3.1 Isolated Command Word Recognition

The isolated command word experiments were carried out for the following different schemes: Experiments using the confidence measure Experiments using single frequency band energy Experiments using multi-channel sub-band energy based end-of-utterance scheme

The database contained 60 confusable single and two-word Finnish isolated phrases, spoken by a single speaker with each utterance repeated 12 times. The tests were conducted by artificially adding both car noise (Volkswagen car driving at 115 km/h) and music (pop music by Bruce Springsteen) simultaneously to clean speech at different SNR combinations. The first baseline experiments used the confidence measure for detecting the end-of-utterance. This confidence was based on the difference between the word model and the garbage model log-likelihood scores [3]. The confidence scores of each model was normalised with the length of the recognised word and compared against a pre-defined confidence threshold. The threshold was optimised to obtain the best results. Table 1 gives the results of the experiments with the optimised confidence threshold scheme. From the results, one can see that this scheme works well in clean and less noisy conditions, but fails poorly at low SNRs. The average percentage of utterances that do not get detected by the EOU algorithm is about 48%.
Noise Music Proper (in dB) (in dB) Detections (%) -1 91.7 0 68.6 -5 23.5 10 65.9 0 10 31.5 -5 10 10.9 Average 48.7 Early Detections (%) 0.2 3.0 2.4 0.9 1.5 1.1 1.5 Late Algorithm Detections Failure (%) (%) 5.9 2.3 0.8 27.6 0.2 73.9 2.1 31.1 0.5 66.5 0.6 87.4 1.7 48.1

Noise Music Proper (in dB) (in dB) Detections (%) 99.7 0 94.1 -5 87.6 10 88.2 0 10 86.2 -5 10 74.9 Average 88.4

Early Detections (%) 0.3 5.3 6.8 0.6 5.0 6.4 4.1

Late Algorithm Detections Failure (%) (%) 0.0 0.0 0.3 0.3 1.8 3.8 3.9 7.3 3.3 5.5 8.3 10.5 3.0 4.5

Table 2 Results for single-channel energy based EOU scheme.

Noise Music Proper (in dB) (in dB) Detections (%) 100.0 0 97.4 -5 91.7 10 88.6 0 10 95.1 -5 10 87.3 Average 93.4 Early Detections (%) 0.0 2.4 4.2 0.9 1.5 2.3 1.9 Late Algorithm Detections Failure (%) (%) 0.0 0.0 0.2 0.0 3.0 1.1 7.3 3.2 2.6 0.8 7.0 3.5 3.3 1.4

Table 3 Results using multi-channel sub-band energy based EOU scheme. A comparison of the histograms of the time instants when the end-of-utterance was detected (given by the number of frames after the real end of the speech utterance), helps to illustrate the superior performance of the multi-channel based scheme over the conventional full band energy method. These histograms are shown here in Figure 4 for the particular case of speech mixed with 0 dB car noise and 10 dB music. Ideally, these methods should detect the end of an utterance at exactly 80 frames after the end of speech. It can be seen from the figure that there are too many early detections for the single channel scheme and also has a higher spread than the multi-channel method. On the other hand, the multi-channel end-of-utterance scheme is seen to have a very high concentration around the ideal 80 frames delay.
comparison of single- and multi-channel based EOU schemes

Table 1 Results with the confidence based EOU scheme. The results using the energy-based scheme with a single frequency band is given in Table 2, while Table 3 gives the results for the proposed multi-channel sub-band energy based EOU scheme. It can be seen from the results that the single channel energy based scheme provides better results than the confidence based method especially in very high car noise conditions, but performs badly in the presence of background music. If one takes a typical example of a person travelling in a car at 115 km/h with music in the background, represented by 0 dB car noise and 10 dB music, this scheme produces only 86.2% proper detections. The best results are obtained with the multi-channel sub-band based scheme, which works well with both car noise and background music. There are very few Algorithm failures as compared to the previous schemes. In the example case (0 dB car noise and 10 dB music), the proposed algorithm provides 95.2% proper detections, a relative improvement of almost 65.2% over the single channel scheme. There is more than 93% overall proper detections using this multi-channel sub-band based end-of-utterance scheme.

280 number of utterances 240 200 160 120 80 40 0 4 31 58 85 112 139 166 193 220 number of frames after end of speech multi-channel single-channel

Figure 4 Histograms of the time instant after the end of speech when end-of-utterance is detected for single and multi-channel based schemes. This particular case involves speech mixed with 0 dB car noise and 10 dB music. The target is to detect the end-of-utterance after exactly 80 frames.

- denotes that there is no noise addition

4. END-OF-UTTERANCE FOR CONTINUOUS SPEECH RECOGNITION

One of the major problems for detection of the end of continuous speech, as opposed to isolated speech, is the presence of inter-word pauses in the former. We have also observed that people often tend to decrease their speaking effort towards the end of a multi-word utterance in real-world conditions. Due to these, we expected some performance degradation when we decided to test the end-of-utterance scheme proposed in the previous section in the continuous speech recognition case. Indeed, the use of the isolated end-ofutterance algorithm produced more than 50% early detections, when applied to the recognition of spoken digit sequences. The major change in the algorithm for continuous speech was the usage of the intermediate recognition results in addition to the individual sub-band energy information for detection of EOU. This recognition result gives the number of recognised words up to the given frame and gets updated as soon as a new word is recognised. Detection is disabled for a fixed number of frames after an intermediate result. This will reduce the risk of false early EOU detection. This requirement is combined with the earlier condition for isolated word EOU detection (n out of M sub-bands should trigger EOU) for stopping the recognition. To cater to the changes in the speaking effort, shorter-term adaptation of the global maximum and minimum powers, used to compute the threshold, are carried out. The change in the buffered power values is used to determine the rate of adaptation for p_max and p_min. The maximum and minimum values of the buffered sub-band energy values (win_max and win_min) are computed, and compared with the current p_max and p_min. The updation of p_max and p_min are given by the following equations: p_min = (1 - )p_min + ( win_min) p_max = (1 - )p_max + ( win_max)

clean environment and in car noise conditions. The results in Table 4 show that on average, almost 94% of the connected digit utterances are detected properly by the proposed end-ofutterance scheme.
Noise Type SNR Proper Early Late (in dB) Detections Detections Detections (%) (%) (%) 97.6 1.3 0.4 Car 10 95.8 0.6 3.3 Car 5 94.8 0.9 3.9 Car 0 88.0 4.6 6.9 Average 94.0 1.8 3.6 Algorithm Failure (%) 0.7 0.4 0.4 0.5 0.5

Table 4 Results using the proposed sub-band EOU scheme for connected digit recognition task

5. CONCLUSIONS
We have proposed a robust multi-channel sub-band energy based end-of-utterance detector that works well in various realistic noise environments. The end-of-utterance detector is essential to improve the predictability and usability of a practical speech recognition system. It helps to provide a timely response to the user prompts. The proposed multichannel method is shown to outperform the single band energy scheme with a relative average improvement in proper detection rate of more than 43% for a isolated command word recognition task. The scheme is found to be robust to both stationary and non-stationary noise conditions, which is essential for speech recognition applications like the use of hands-free voice activation in a car environment. The algorithm was also adapted to work for detection of continuous speech. Experiments with a connected digit dialling task provided an average 94% proper end-of-utterance detections in a stationary car noise environment including very low SNRs.

(4)

where 0 < < 1. The update coefficient can be made bigger (or smaller) depending on whether the difference p_minwin_min or win_max-p_max is bigger (or smaller), to further improve the adaptation to varying speaking effort.

REFERENCES
[1] S. Sagayama, K. Aikawa, "Issues Relating to the Future of ASR for Telecommunications Applications", Proc. of ESCA-NATO Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, pp. 75-82, France,1997. L. Lamel, R. Rabiner, J. Rosenberg, J. Wilpon, An improved end-point detector for isolated word recognition, IEEE ASSP Magazine, pp. 777-785, 1981. O. Viikki, K. Laurila, P. Haavisto, "A Confidence Measure for Detecting Recognition Errors in Isolated Word Recognition", Proc. International Conference on Speech Science and Technology, pp. 67-72, Adelaide, Australia, 1996. H. Boulard, B. D'hoore, J.-M. Boite, Optimizing recognition and rejection performance in wordspotting systems, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-373 - I376, Australia 1994. M. G. Rahim, C.-H. Lee, B.-H. Juang, Robust utterance verification for connected digits recognition, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 285- 288, Detroit, USA 1995. J. Iso-Sipil, K. Laurila, R. Hariharan, O. Viikki, "Hands Free Voice Activation in Car Environments", Proc. Eurospeech 97, pp. 2375-2378, Rhodos, Greece, 1997.

4.1 Speaker Experiments

Independent

Digit

Dialler

[2]

The end-of-utterance for continuous speech recognition was tested for a speaker independent digit-dialling task. The test database consisted of Finnish connected digit utterances spoken in a clean environment. There were 375 speakers with an approximately equal number of males and females, with a total of 3686 six-digit utterances. During active listening, when the mobile phone waits for an activation phrase for the phone to be activated (to make a call), there is likely to be both stationary noise and music in the background [6]. However, when the phone is activated, people tend to reduce the volume of music during name or digit dialling. Hence, the digit dialling experiments, unlike the previous command word recognition tests were conducted only in stationary car noise environments. The results of using the modified multi-channel sub-band scheme for end-of-utterance is shown in Table 4. The algorithm had been tuned to obtain very little early detections so that inter-word pauses are not detected as end of an utterance. It can be seen that the algorithm works well in both

[3]

[4]

[5]

[6]

The Physics and Technology of Diagnostic Ultrasound: Study Guide (Second Edition)
From Everand
The Physics and Technology of Diagnostic Ultrasound: Study Guide (Second Edition)
Robert Gill
No ratings yet
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
DAFX: Digital Audio Effects
From Everand
DAFX: Digital Audio Effects
Udo Zölzer
3.5/5 (2)
KST-2017 Paper 68
No ratings yet
KST-2017 Paper 68
4 pages
Preprocessing Signal
No ratings yet
Preprocessing Signal
6 pages
Speech Recognition: College Name: Guru Nanak Engineering College Authors: Shruthi Tapse
No ratings yet
Speech Recognition: College Name: Guru Nanak Engineering College Authors: Shruthi Tapse
13 pages
Term Paper ECE-300 Topic: - Speech Recognition
No ratings yet
Term Paper ECE-300 Topic: - Speech Recognition
14 pages
Detection of Out-Of-Vocabulary Words in Posterior
No ratings yet
Detection of Out-Of-Vocabulary Words in Posterior
5 pages
Editor in Chief,+recurrent Neural Networks in Automatic Speech Recognition
No ratings yet
Editor in Chief,+recurrent Neural Networks in Automatic Speech Recognition
8 pages
A Study On Automatic Speech Recognition
100% (1)
A Study On Automatic Speech Recognition
2 pages
ASR For Embedded Real Time Applications: K.Kartheek, D.V.Srihari Babu
No ratings yet
ASR For Embedded Real Time Applications: K.Kartheek, D.V.Srihari Babu
9 pages
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
No ratings yet
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
6 pages
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
From Everand
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
Fouad Sabry
No ratings yet
Transmagnetic Resonance Field Theory
From Everand
Transmagnetic Resonance Field Theory
Timothy E. Douglas
No ratings yet
Gebreegziabher 2020
No ratings yet
Gebreegziabher 2020
5 pages
As R Tutorial
No ratings yet
As R Tutorial
16 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
Research Paper
No ratings yet
Research Paper
9 pages
A Review On Different Approaches For Speech - Recognition System
No ratings yet
A Review On Different Approaches For Speech - Recognition System
6 pages
Review of Feature Extraction Techniques in Automatic Speech Recognition
100% (1)
Review of Feature Extraction Techniques in Automatic Speech Recognition
6 pages
Automatic Speech Segmentation in Syllable Centric Speech Recognition System
No ratings yet
Automatic Speech Segmentation in Syllable Centric Speech Recognition System
10 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Ijreas Volume 3, Issue 3 (March 2013) ISSN: 2249-3905 Efficient Speech Recognition Using Correlation Method
No ratings yet
Ijreas Volume 3, Issue 3 (March 2013) ISSN: 2249-3905 Efficient Speech Recognition Using Correlation Method
9 pages
Convai Technical Overview Speech Ai Part 2 2301964
No ratings yet
Convai Technical Overview Speech Ai Part 2 2301964
11 pages
46 Silence PDF
No ratings yet
46 Silence PDF
8 pages
The Impact of Speech Recognition On Speech Synthesis
No ratings yet
The Impact of Speech Recognition On Speech Synthesis
8 pages
COMMUNICATION SYSTEMS
From Everand
COMMUNICATION SYSTEMS
B.P. Lathi
No ratings yet
Speech Recognition1
No ratings yet
Speech Recognition1
24 pages
Robust Endpoint Detection For Speech Recognition Based On Discriminative Feature Extraction
No ratings yet
Robust Endpoint Detection For Speech Recognition Based On Discriminative Feature Extraction
4 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
Marz Koll 2002 Speech-Pause
No ratings yet
Marz Koll 2002 Speech-Pause
10 pages
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Speech Recognition System
No ratings yet
Speech Recognition System
5 pages
Speech Endpoint Detection Based On Sub-Band Energy and Harmonic Structure of Voice
No ratings yet
Speech Endpoint Detection Based On Sub-Band Energy and Harmonic Structure of Voice
9 pages
AJSAT Vol.5 No.2 July Dece 2016 pp.23 30
No ratings yet
AJSAT Vol.5 No.2 July Dece 2016 pp.23 30
9 pages
Narrow Band-Pass Filters for Low Frequency Applications: Evaluation of Eight Electronics Filter Design Topologies
From Everand
Narrow Band-Pass Filters for Low Frequency Applications: Evaluation of Eight Electronics Filter Design Topologies
Raman K. Attri
No ratings yet
Digital Signal Processing for Audio Applications: Volume 2 - Code
From Everand
Digital Signal Processing for Audio Applications: Volume 2 - Code
Anton R Kamenov
5/5 (1)
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
A Report On
No ratings yet
A Report On
35 pages
1 Paper
No ratings yet
1 Paper
9 pages
Scaling Up Online Speech Recognition Using ConvNets
No ratings yet
Scaling Up Online Speech Recognition Using ConvNets
8 pages
(IJCST-V4I2P62) :Dr.V.Ajantha Devi, Ms.V.Suganya
No ratings yet
(IJCST-V4I2P62) :Dr.V.Ajantha Devi, Ms.V.Suganya
6 pages
Adaptive Filter: Enhancing Computer Vision Through Adaptive Filtering
From Everand
Adaptive Filter: Enhancing Computer Vision Through Adaptive Filtering
Fouad Sabry
No ratings yet
Article - The Limits of Cloud Based ASR
No ratings yet
Article - The Limits of Cloud Based ASR
3 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
Paper 4
No ratings yet
Paper 4
14 pages
Speech Recognition Algo
No ratings yet
Speech Recognition Algo
17 pages
End-to-End Speech Recognition: A Survey
No ratings yet
End-to-End Speech Recognition: A Survey
27 pages
Speech Recognition Technology
No ratings yet
Speech Recognition Technology
23 pages
Distributed Listening
No ratings yet
Distributed Listening
4 pages
Time-Frequency Domain for Segmentation and Classification of Non-stationary Signals: The Stockwell Transform Applied on Bio-signals and Electric Signals
From Everand
Time-Frequency Domain for Segmentation and Classification of Non-stationary Signals: The Stockwell Transform Applied on Bio-signals and Electric Signals
Ali Moukadem
No ratings yet
NLP Project Reportttt
No ratings yet
NLP Project Reportttt
9 pages
Aircraft Intercom Systems
From Everand
Aircraft Intercom Systems
Askhat Khafizow
No ratings yet
Shareef Seminar Docs
No ratings yet
Shareef Seminar Docs
24 pages
Voice Recognition System Speech To Text
No ratings yet
Voice Recognition System Speech To Text
5 pages
Nonlinear Transformations of Random Processes
From Everand
Nonlinear Transformations of Random Processes
Ralph Deutsch
No ratings yet
An Introduction To Speech and Speaker Recognition
No ratings yet
An Introduction To Speech and Speaker Recognition
8 pages
A Review On Speech Recognition Challenge
No ratings yet
A Review On Speech Recognition Challenge
7 pages
Lectures 1 Rabiner Speech Processing
No ratings yet
Lectures 1 Rabiner Speech Processing
77 pages
Meyer 2011
No ratings yet
Meyer 2011
16 pages
Learning A Language For Free While Translating The Web. Does Duolingo Work?
No ratings yet
Learning A Language For Free While Translating The Web. Does Duolingo Work?
7 pages
Bayesian Estimate of An Infinite/Infinitesimal Universe
No ratings yet
Bayesian Estimate of An Infinite/Infinitesimal Universe
10 pages
Towards A Taxonomy For Skills
No ratings yet
Towards A Taxonomy For Skills
8 pages
Towards A Taxonomy For Skills
No ratings yet
Towards A Taxonomy For Skills
8 pages
3.1 Fluid Mosaic Model
No ratings yet
3.1 Fluid Mosaic Model
35 pages
Cloze Test: How To Crack The Nut
No ratings yet
Cloze Test: How To Crack The Nut
4 pages
Throne of Secrets Kerri Maniscalco Instant Download
100% (2)
Throne of Secrets Kerri Maniscalco Instant Download
41 pages
Management Accounting 2marks Solved (2014-2021)
No ratings yet
Management Accounting 2marks Solved (2014-2021)
12 pages
Mabel Amos Special Fiduciary
No ratings yet
Mabel Amos Special Fiduciary
10 pages
Sas2 Pos057
No ratings yet
Sas2 Pos057
4 pages
General Surgical Operations 2nd Edition by Dhiraj Choudhury Ebook and TestBank Bundle Official Test Bank
No ratings yet
General Surgical Operations 2nd Edition by Dhiraj Choudhury Ebook and TestBank Bundle Official Test Bank
332 pages
AGAPE
No ratings yet
AGAPE
13 pages
Announcement - Moodys-Updates-Its-Methodology-For-Rating-Trade-Receivables-Backed-Transactions - 11mar19
No ratings yet
Announcement - Moodys-Updates-Its-Methodology-For-Rating-Trade-Receivables-Backed-Transactions - 11mar19
4 pages
Elektronik Ders Prog 2024 2025guz
No ratings yet
Elektronik Ders Prog 2024 2025guz
24 pages
Lec-1 Introduction
No ratings yet
Lec-1 Introduction
68 pages
Introduction To Six Sigma &amp Process Improvement 2nd An James R. Evans &amp William M. Lindsay Instant Download
100% (2)
Introduction To Six Sigma &amp Process Improvement 2nd An James R. Evans &amp William M. Lindsay Instant Download
28 pages
Media Are The Communication Outlets or Tools Used To Store and Deliver Information or Data
No ratings yet
Media Are The Communication Outlets or Tools Used To Store and Deliver Information or Data
7 pages
4452756321762954@4364258&425681233248 - 4288232132658965554 TH Application Form
No ratings yet
4452756321762954@4364258&425681233248 - 4288232132658965554 TH Application Form
3 pages
BARRIERS TO GIRLS EDUCATION IN - SOUTH CENTRAL SOMALIA Annex 1
No ratings yet
BARRIERS TO GIRLS EDUCATION IN - SOUTH CENTRAL SOMALIA Annex 1
34 pages
B .Inggris
No ratings yet
B .Inggris
4 pages
Haiku News (Edited by Laurence Stacey and Dick Whyte)
No ratings yet
Haiku News (Edited by Laurence Stacey and Dick Whyte)
152 pages
( Ybm ( ) ) - 1 1
No ratings yet
( Ybm ( ) ) - 1 1
13 pages
Project Insight - Cipla LTD
No ratings yet
Project Insight - Cipla LTD
14 pages
Rhotic Degemination in Sanskrit and The Etymology of Vedic Ūrú - Thigh', Hittite UZU (U) Walla - Id.'
No ratings yet
Rhotic Degemination in Sanskrit and The Etymology of Vedic Ūrú - Thigh', Hittite UZU (U) Walla - Id.'
32 pages
Group Abcdefghijkl r27
No ratings yet
Group Abcdefghijkl r27
12 pages
Prose Genres
100% (2)
Prose Genres
33 pages
Planning Engineer or Business Analyst or Data Analyst or Plannin
No ratings yet
Planning Engineer or Business Analyst or Data Analyst or Plannin
2 pages
Practical Research Proposal
No ratings yet
Practical Research Proposal
33 pages
482 Cr.P.C. Ramu & Ors by Prem Driver CJM 2017
No ratings yet
482 Cr.P.C. Ramu & Ors by Prem Driver CJM 2017
10 pages
DLL Gr8 Edited
No ratings yet
DLL Gr8 Edited
60 pages
How Rentomojo and Furlenco Got Buried Under The Weight of Their Own Furniture - The Ken
No ratings yet
How Rentomojo and Furlenco Got Buried Under The Weight of Their Own Furniture - The Ken
2 pages
For Billing Enquiry Call Tikona Care at 1800-20-94276: Current Bill Details Amount (RS.)
No ratings yet
For Billing Enquiry Call Tikona Care at 1800-20-94276: Current Bill Details Amount (RS.)
1 page
Infinitiv Ili - Ing
0% (1)
Infinitiv Ili - Ing
4 pages
Understanding Basic Concepts in DRRM Material
No ratings yet
Understanding Basic Concepts in DRRM Material
7 pages

Robust End-Of-Utterance Detection For Real-Time Speech Recognition Applications

Uploaded by

Robust End-Of-Utterance Detection For Real-Time Speech Recognition Applications

Uploaded by

ROBUST END-OF-UTTERANCE DETECTION FOR REAL-TIME SPEECH RECOGNITION APPLICATIONS

END-OF-UTTERANCE FOR ISOLATED WORD RECOGNITION

3. EXPERIMENTS FOR ISOLATED WORD RECOGNITION

0 E a rly D ete ctions

12 0 (1 35 fo r con tinu ous spe ech )

nu m be r of fram e s afte r the e nd o f utte ran ce

Figure 3 Explanation of terminology used in the results table.

3.1 Isolated Command Word Recognition

Table 2 Results for single-channel energy based EOU scheme.

- denotes that there is no noise addition

4. END-OF-UTTERANCE FOR CONTINUOUS SPEECH RECOGNITION

4.1 Speaker Experiments

You might also like