0% found this document useful (0 votes)
51 views4 pages

tr05 01 PDF

This paper investigates using spectral entropy features for speech recognition. Spectral entropy measures the disorganization of a signal's frequency spectrum and can help identify voiced speech regions versus noise/unvoiced regions. The paper computes entropy from sub-bands of the short-time Fourier transform spectrum normalized as a probability mass function. Experiments evaluate entropy features on connected digit recognition in clean and noisy conditions, both alone and combined with mel-frequency cepstral coefficients. Results show entropy features improve baseline performance and robustness in additive noise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views4 pages

tr05 01 PDF

This paper investigates using spectral entropy features for speech recognition. Spectral entropy measures the disorganization of a signal's frequency spectrum and can help identify voiced speech regions versus noise/unvoiced regions. The paper computes entropy from sub-bands of the short-time Fourier transform spectrum normalized as a probability mass function. Experiments evaluate entropy features on connected digit recognition in clean and noisy conditions, both alone and combined with mel-frequency cepstral coefficients. Results show entropy features improve baseline performance and robustness in additive noise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

SPECTRAL ENTROPY AS SPEECH FEATURES

FOR SPEECH RECOGNITION


Aik Ming Toh Roberto Togneri Sven Nordholm
School of Electrical, Electronic, School of Electrical, Electronic, Western Australian
and Computer Engineering and Computer Engineering Telecommunications
The University of Western Australia The University of Western Australia Research Institute

Abstract— This paper presents an investigation of spectral would have a flatter distribution and thus higher entropy. This
entropy features, used for voice activity detection, in the context concept has enabled the entropy to be considered in voice
of speech recognition. The entropy is a measure of disorganization activity detection [4] and speech recognition [2].
and it can be used to measure the peakiness of a distribution.
We compute the entropy features from the short-time Fourier In this paper we investigate the entropy feature for its
transform spectrum, normalized as a PMF. The concept of performance in speech recognition. Misra had evaluated the
entropy shows that the voiced regions of speech have lower entropy features for phoneme recognition. We extend the en-
entropy since there are clear formants. The flat distribution tropy features for connected digit recognition on the TI-DIGIT
of silence or noise would induce high entropy values. In this database and study their robustness in noisy environments. In
paper, we investigate the use of the entropy as speech features
for speech recognition purpose. We evaluate different sub-band addition, we append the entropy features to MFCC 0 features
spectral entropy features on the TI-DIGIT database. We have rather than the PLP features used in [5]. Our experiments in [6]
also explored the use of multi-band entropy features to create showed that MFCC 0 outperformed the PLP features both in
higher dimensional entropy features. Furthermore, we append recognition performance and robustness. We want to determine
the entropy features to baseline MFCC 0 and evaluate them the contribution of entropy features on the state-of-art feature.
in clean, additive babble noise and reverberant environments.
The results show that entropy features improve the baseline We have also generated multiple multi-band entropy features
performance and robustness in additive noise. from smaller sub-bands entropy features. Furthermore, we
evalute the entropy features with MFCC 0 in additive babble
noise and reverberant noise for robustness.
I. INTRODUCTION The paper is organized as follows: Section 2 presents an
Speech recognition systems typically use speech features overview of the spectral entropy features and its derivation.
based on the short-term spectrum of speech signal. The state The third section explains the experimental setup and is fol-
of the art feature used in most speech recognizer is the Mel- lowed by the results in section 4. The final section comprises
frequency cepstral coefficients (MFCC) with enhancements the conclusion of the work.
such as regression features and normalization strategy. Other
speech features such as perceptual linear prediction (PLP) and II. S PECTRAL E NTROPY
its variant RASTA [1] are also popular speech features. An
additional type of feature based on the entropy has recently The entropy has been used to detect silence and voiced
emerged in the context of speech recognition. It is also know region of speech in voice activity detection. The discriminatory
as the Wiener entropy since it measures the power spectral property of this feature gives rise to its use in speech recogni-
flatness of the spectrum. Misra proposed the use of entropy tion. The entropy can be used to capture the formants or the
features as speech features for use in speech recognition [2]. peakiness of a distribution. Formants and their locations have
Entropy is usually used in the context of pattern classifica- been considered to be important for speech tracking. Thus,
tion and information technology. Originally the entropy was the peak capturing ability of entropy was employed for speech
defined for information sources by Shannon [3]. It is a measure recognition.
of disorganization or uncertainty in a random variable. The We converted the spectrum into a probability mass function
information can be interpreted as essentially the negative of (PMF) by normalizing the spectrum in each sub-band. Misra
the entropy, and the negative logarithm of its probability. also suggested the use of entropy computation from the full-
The application of the entropy concept for speech recog- band normalized spectrum [5]. Equation (1) is used for sub-
nition is based on the assumption that the speech spectrum band normalization.
is more organized during speech segments than during noise Xi
segments. In addition, the spectral peaks of the spectrum are xi = PN for i = 1 to N (1)
i=1 Xi
supposed to be more robust to noise. Thus a voiced region
of speech would induce low entropy since there are clear where Xi represents the energy of the it h frequency compo-
formants in the region. The spectra of noise or unvoiced region nent of the spectrum and xi is the PMF of the spectrum. The
8 8

7
7

6
6

5
5

4
3

3
2

2
1

1 0
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180

Fig. 1. The entropy contour of connected digit utterance ”1-9-8-6” in clean Fig. 2. The entropy contour of digit utterance ”1-9-8-6” in clean (dashed
(dashed line) and corrupted with babble noise at SNR10dB (line) line) and corrupted speech with RT0.2s (line)

area under the normalized spectrum in each sub-band should entropy. Due to the number of points in STFT spectrum, we
sum to 1. The normalized spectra were considered as a PMF divided the distribution into sub-bands of equal size with the
and used for entropy computation.The entropy was computed remainder allocated into the last sub-band. The K sub-bands
with equation (2). of interest were the full-band, 2, 3, 4, 5, 6, 8, 12 and 13 sub-
X bands. The performance of the 12 and 13 sub-bands entropy
H(x) = − xi · log2 xi (2) features will be compared against the baseline MFCC and
x∈X MFCC 0 features. Initially we only proposed to compute the
Figure 1 shows the full-band entropy contour of the speech entropy features up to 13 subbands. Misra [5] reports that the
utterance ”1-9-8-6”. The dashed line illustrates the reference sub-bands 16, 24 and 32 yield surprisingly good performance
or the entropy contour of the clean utterance. The line contour and we decided to investigate those sub-bands.
represents the entropy contour of the corrupted utterance under We also appended the sub-band entropy features to create
influence of babble noise at SNR10dB. The figure shows that larger dimensional entropy features. Misra [2] have only eval-
the entropy feature managed to track most of the formants uated the performance of 15 dimensional entropy computed
which is represented by the contour trough even in low SNR from the full-band, 2, 3, 4 and 5 sub-bands entropy features.
of 10dB. The beginning and end of the entropy contour depicts We computed multiple dimensional entropy features to show
that the entropy for the distribution increases as the level of the contribution of smaller sub-bands in comparison with
noise increased. We have also compared the full-band contours conventional entropy features of the same size.
of the utterance under the babble noise influence of 20dB and
10dB. The plots revealed that the location of the peak remained
pretty much in place and the noise region contour were almost IV. EXPERIMENTAL SETUP
similar. The entropy features were able to discriminate the The TI-digit corpus was used in the speech recognition
speech region and the noise region. experiments. The database comprised both isolated and con-
We have also carried out the same analysis for reverberant nected digit utterances. The training data contained utterances
noise and initial results on slight reverberation indicates that of 24 male and 24 female speakers. There were 8 male and
spectral entropy may not be suitable for speech recognition 8 female speakers for the testing data. Each subset composed
in reverberant condition. Figure 2 illustrates the contour of of 77 digit utterances.
the clean (dashed line) and the utterance under reverbernt
The babble noise from NOISEX 92 database was used to
influence of RT 0.2s. The contour of the reverberant utterance
corrupt the testing data for evaluation in noisy environments.
has been shifted. Reverberations have been shown to introduce
The babble noise represented a real-world non-stationary envi-
a temporal smearing effect in [6]. The formants have been
ronment noise. The test data was corrupted with babble noise
displaced due to temporal smearing and this was not ideal for
at five different signal-to-noise ratios (SNRs) ranging from
speech recognition.
0dB to 40dB at the intervals of 10dB.
Reverberant effects were captured by estimating the impulse
III. S UB - BANDS E NTROPY response of the room environments from long segments of
The full-band entropy captures the gross peakiness of the speech. The experiment used the room impulse response
spectrum. We partitioned the STFT spectrum into sub-bands designed to match the characteristic of a 2.2m high, 3.1m
for improved resolution. The distribution is separated into wide and 3.5m long room. The microphone and the speakers
K non-overlapping sub-bands. We refered to this as K sub- were localized 0.5m from the wall at opposite end. The speech
bands entropy features whereas Misra called them multi-band was convolved with the RT60 room impulse response. The
TABLE I TABLE II
W ORD ERROR RATES FOR SUB - BAND ENTROPY FEATURES W ORD ERROR RATES FOR DIMENSIONAL ENTROPY FEATURES

Entropy Word Error Rates (%) DIM Entropy Word Error Rates (%)
Full-band 46.96 12 sub-bands (3,4,5) 21.31
2 sub-bands 39.52 12 sub-bands (2,4,6) 18.98
3 sub-bands 34.98 13 sub-bands (1,3,4,5) 18.96
4 sub-bands 31.87 13 sub-bands (1,4,8) 15.79
5 sub-bands 32.96 13 sub-bands (1,2,4,6) 17.87
6 sub-bands 25.67 14 sub-bands (2,4,8) 16.49
8 sub-bands 24.31 15 sub-bands (1,2,3,4,5) 20.54
12 sub-bands 23.50
13 sub-bands 24.78

B. Dimensional Entropy
Table II displays the WERs for dimensional entropy com-
number of filter coefficients were adjusted according to the puted from smaller sub-bands. The dimensional entropy fea-
reverberation time. tures performed better than the conventional sub-bands entropy
We chose MFCC 0 features as the baseline because of their features. We were able to yield an improvement of about 5%
broad use in speech research and adoption as the state-of- to 9% for 13 dimensional entropy compared to 13 sub-bands
art features in speech recognition systems. In our previous entropy features. The 14 dimensional and 15 dimensional
study [6], we have shown that the PLP features fell short both entropy features have also outperformed the conventional sub-
in performance and recognition, thus we decided to adopt band features by 10.5% and 9.7% respectively. The WERs
MFCC 0 as the baseline feature. All the speech files were for conventional 14 and 15 sub-bands entropy features were
pre-emphasized and windowed with a Hamming window. The 27.03% and 30.30% respectively.
speech signal was analyzed every 10ms with a frame width Our primary aim in investigating the 12 and 13 dimensional
of 25ms. The number of points of the STFT spectrum is 257. sub-bands was to evaluate their performances in comparison
A Mel-scale triangular filterbank with 26 filterbank channels to the baseline MFCC and MFCC 0 features. The results
was used to generate the Mel-frequency cepstral coefficients showed that dimensional entropy features were still unable to
(MFCC) features. The MFCC 0 coefficients constitute the 12 match the performance of the baseline feature. This showed
static MFCC coefficients and the zeroth cepstral coefficients. that spectral entropy features were not suitable to be used as
The HMM model used 15 states and 5 mixtures for the baseline speech features. Thus, we decided to utilize them
connected digit recognition. We do not use any penalty factor as additional features and appended them to our baseline
to optimize the recognition accuracy for these experiments. MFCC 0 feature for speech recognition.

C. MFCC and Subband Entropy


V. EXPERIMENT RESULTS The entropy features were not competitive when compared
to the baseline MFCC 0 features, therefore we appended the
A. Subband Entropy entropy features to assess the performance of entropy features
as additional features. The entropy features did improve the
Table I shows the word error rates (WERs) for spectral baseline recognition accuracy slightly. Most of the entropy
entropy features up to 13 sub-bands. We have conjectured features reduced the WERs to less than 2.00% in the clean
that entropy features with sub-bands greater than 5 would environment.
contribute more to the recognition performance which was not We have also appended the dimensional entropy features to
shown in [2]. the baseline MFCC 0. The results did not show much contribu-
We could observe the contribution of entropy features with tion from the dimensional entropy features and caused slight
better resolution. The WERs decreased as the dimension of degradation in some cases. The use of dimensional entropy
the entropy increased as in Table I. However, the 12 and features should enhance the performance of the baseline but
13 sub-band entropy features did not outperform the baseline experimental results showed otherwise.
MFCC and MFCC 0 with WERs of just 2.97% and 2.05%
respectively. D. Spectral Entropy in Noisy Environments
Misra extended their entropy computations to 32 sub-bands We then performed speech recognition with MFCC 0 and
in [5]. They have reported that 16-bands, 24-bands and 32- entropy features in additive babble noise. The NOISEX 92
bands entropy features gave WERs of 15% to 18%. Our results noise characterized the background noise of real-world envi-
gave different perspectives as the WERs for subbands greater ronment. The baseline MFCC 0 results from [6] were used as
than 16 were typically more than 25%. This raised the issue benchmark. Table III shows the WERs for MFCC 0 appended
of redundancy in excessive sub-bands entropy computation. with entropy features in additive noise environment. All the
TABLE III
W ORD ERROR RATES % FOR MFCC 0 WITH ENTROPY FEATURES IN AADDITIVE BABBLE NOISE

Clean SNR 40 SNR 30 SNR 20 SNR 10 SNR 0


MFCC 0 2.05 2.33 3.09 17.97 59.93 90.20
MFCC0 +Full 2.03 2.15 4.98 29.73 64.43 91.83
MFCC0 +2 1.76 1.98 4.41 22.85 55.00 86.83
MFCC0 +3 1.76 1.81 4.53 24.21 61.34 93.14
MFCC0 +4 1.93 2.08 3.34 17.48 51.68 88.42
MFCC0 +5 1.88 2.00 3.54 19.83 59.38 94.41
MFCC0 +6 1.76 1.81 3.42 17.95 52.52 89.48
MFCC0 +8 2.08 2.10 2.97 19.18 55.17 94.90
MFCC0 +12 1.91 2.00 2.45 14.73 51.31 88.25
MFCC0 +13 2.13 2.13 2.72 13.42 50.22 88.04

TABLE IV
smearing effect induced by reverberations have shifted the
W ORD ERROR RATES % FOR MFCC 0 AND ENTROPY FEATURES IN
formants and the entropy distribution. Thus, spectral entropy
REVERBERANT ENVIRONMENTS
failed to perform for reverberant speech recognition.
Entropy RT 0.1s RT 0.2s
Full-band 3.56 8.89 VI. CONCLUSION
2 sub-bands 3.96 11.01
The utilization of the spectral entropy features have been
3 sub-bands 3.42 9.95
adopted in speech activity detection and speech recognition.
4 sub-bands 4.43 10.82
We have investigated the use of spectral entropy as speech
5 sub-bands 4.23 9.13
features and evaluated them on the TI-DIGIT connected digit
6 sub-bands 4.41 9.18
database and in noisy environments. The spectral entropy
8 sub-bands 5.07 9.53
features with better resolution such as sub-bands 12 and
12 sub-bands 5.15 8.89
13 performed better than other sub-bands. The spectral en-
13 sub-bands 5.22 8.96
tropy features alone however were not able to surpass the
performance of the cepstral features such as the baseline
MFCC 0. The use of spectral entropy features as additional
spectral entropy features contributed to robustness as shown by features showed improvements in the recognition accuracy
the performance in SNR 40dB. However, the entropy features and robustness against additive noise when compared with the
with more sub-bands performed better than those with less baseline MFCC 0 features.
sub-bands as the SNR decreased. Both the analysis and the results showed that entropy
The contribution of sub-band entropy features were evident features were less affected by additive noise such as babble
for 12 and 13 sub-bands entropy. These entropy features have noise. The entropy contours also demonstrated that formants
shown robustness in additive noise throughout different level were less affected by noise. However, spectral entropy fea-
of SNRs. Other sub-bands such as sub-band 2,4 and 6 were tures were not suitable for speech recognition in reverberant
also robust to additive babble noise. The results in Table III environment. Both temporal smearing and shifted formants
demonstrated that MFCC 0 and entropy features with more contributed to the poor performance of spectral entropy in
sub-bands performed better than the baseline MFCC 0 features reverberant environment.
across different levels of additive noise.
We have also evaluated speech recognition with MFCC 0 R EFERENCES
and spectral entropy features in reverberant environments. Ta- [1] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE
Trans. SAP, vol. 2, no.4, July, Oct. 1994.
ble IV records the WERs for speech recognition with MFCC 0 [2] H. Misra, S. Ikbal, H. Bourlard, and H. Hermansky, “Spectral entropy
and entropy features in reverberant environments. Preliminary based feature for robust asr,” in Proc. ICASSP, May 2004, pp. 193–196.
results in light reverberation of RT 0.1s and RT 0.2s did [3] C.E. Shannon, “A mathematical theory of communication,” Bell System
Technical Journal, vol. 27, pp. 379–423, 623–656, July, Oct. 1948.
not show any significant contribution or robustness from [4] P. Renevey and A. Drygajlo, “Entropy based voice activity detection
spectral entropy features. The WERs for baseline MFCC 0 in in very noisy conditions,” in Proc.Eurospeech, USA, Sept. 2001, pp.
reverberant condition were 4.48% and 7.40% for RT60 of 0.1s 1887–1890.
[5] H. Misra, S. Ikbal, S. Sivadas, and H. Bourlard, “Multi-resolution spectral
and 0.2s respectively. The performance of the entropy features entropy feature for robust asr,” in Proc. ICASSP, March 2005, pp. 253–
greatly deteriorated as the reverberant level increased. 256.
One reason would be the weakness of spectral entropy in [6] A.M. Toh, R. Togneri, and S. Nordholm, “Investigation of robust features
for speech recognition in hostile environments,” in Proc. APCC, 2005.
capturing shifted spectral peaks. Figure 2 illustrates the effects
of reverberation on spectral entropy contour. The temporal

You might also like