MFCCs in Speech Recognition PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Comparative Evaluation of Various MFCC Implementations

on the Speaker Verification Task


Todor Ganchev, Nikos Fakotakis, George Kokkinakis

Wire Communications Laboratory,


University of Patras, 26500 Rion-Patras, Greece
[email protected]

Abstract ent sampling rates (and different bandwidth of the speech sig-
nal) they are not directly comparable. To solve that discrep-
Making no claim of being exhaustive, a review of the most
ancy, keeping the filter spacing and filter bandwidth as pro-
popular MFCC (Mel Frequency Cepstral Coefficients) imple-
posed in the original description of these implementations, we
mentations is made. These differ mainly in the particular ap-
reduce the number of filters to adapt it to sampling frequency
proximation of the nonlinear pitch perception of human, the
of 8 kHz. Sampling frequency of 8 kHz is common for all tele-
filter bank design, and the compression of the filter bank out-
phone driven services, and thus, it is default for the contempo-
put. Then, a comparative evaluation of the presented imple-
rary real-world speaker recognition corpora (for instance:
mentations is performed on the task of text-independent
Switchboard, NIST SRE data [9], etc).
speaker verification, by means of the well-known 2001 NIST
SRE (speaker recognition evaluation) one-speaker detection
database.
2. Implementations of the MFCC parameters
1. Introduction Following the introduction of the MFCC [4], numerous varia-
tions and improvements of the original idea were proposed.
The quest for better speech parameterization led to various One of the main reasons for such diversity of implementations
speech features, which were reported to provide advantage in is the desire of researchers to follow the progress made in the
specific conditions and applications. Moreover, for some area of psychoacoustics during the years. For instance, let’s
speech features, such as the well-known and widely-used consider the various approximations of the nonlinear pitch per-
MFCC, multiple implementations were developed. These im- ception by the human auditory system. An early approximation,
plementations differ mainly in the number of filters, the shape referred to as Koenig scale is exactly linear below 1000 Hz and
of the filters, the way the filters are spaced, the bandwidth of logarithmic above 1000 Hz. It provides a computationally in-
the filters, and the manner in which the spectrum is warped. In expensive representation of the Mel scale, which however is
addition, the frequency range of interest, the selection of actual not very precise and significantly deviates from the original
subset and the number of MFCC coefficients employed in the scale for frequencies both lower and higher than 1000 Hz. A
classification can be also different. more precise approximation, suggested by Fant, is:
Although there are a number of studies [1÷3] that compare
⎛ f ⎞
various implementations of the MFCC on the speech recogni- fˆmel = k const ⋅ log n ⎜ 1 + lin ⎟ , (1)
tion task, up to the authors’ present knowledge no such study ⎝ Fb ⎠
has been performed on the task of speaker recognition. Since where Fb = 1000. A specific form of (1), presented in [7]:
the speech and speaker recognition tasks exploit different as-
1000 ⎛ f ⎞
pects of the speech signal, we deem it worthy to carry out such fˆmel = ⋅ log n ⎜ 1 + lin ⎟ (2)
a study. Therefore, employing a text-independent speaker veri- log n 2 ⎝ 1000 ⎠
fication system, we perform a comparative evaluation of the was found to provide a more close approximation of the Mel
following implementations: scale (only for the frequency range of [0, 5] kHz), when com-
• MFCC FB-20 – introduced in 1980 by Davis and Mermel- pared with the approximation offered by the Koenig scale. In
stein [4]; Davis and Mermelstein assume sampling fre- addition, the formulation (2) is particularly interesting since the
quency of 10 kHz; speech bandwidth [0, 4600] Hz.
values of fˆmel remain unaffected by the choice of the base n
• MFCC FB-24 HTK – from the Cambridge HMM Toolkit
(HTK) described in Young, 1995 [5]; Young uses a filter of the logarithm. Other approximations of the Mel scale that
bank of 24 filters for speech bandwidth [0, 8000] Hz (sam- were derived from (1) make use of natural or decimal loga-
pling rate ≥ 16 kHz). rithm, which leads to different choice of the constant kconst .
• MFCC FB-40 – from the Auditory Toolbox for MATLAB The following two representations:
[6] written by Slaney in 1998; Slaney assumes sampling ⎛ f ⎞
rate of 16 kHz, and speech bandwidth [133, 6854] Hz. fˆmel = 2595 ⋅ log10 ⎜1 + lin ⎟ (3)
⎝ 700 ⎠
• HFCC-E FB-29 (Human Factor Cepstral Coefficients) of
⎛ f ⎞
Skowronski and Harris, 2004 [3]; Skowronski and Harris fˆmel = 1127 ⋅ ln ⎜1 + lin ⎟ (4)
assume sampling rate of 12.5 kHz and speech bandwidth ⎝ 700 ⎠
[0, 6250] Hz. are widely used in the various implementations of the MFCC.
The abbreviation FB-nn (Filter Bank), which we stick after the The formulae (3) and (4), when compared to (2), provide a
designation MFCC (HFCC), provides information about the closer approximation of the Mel scale for frequencies below
number of filters in the filter bank as described by the corre- 1000 Hz, at the price of higher inaccuracy for frequencies
sponding authors. Since these implementations assume differ- higher than 1000 Hz.
2.1. The original MFCC FB-20 Having the filter bank constructed, the MFCC parameters
In the paradigm introduced by Davis and Mermelstein, 1980, are computed [4], as:
[4] the novel MFCC were designed as a set of discrete cosine M
⎛ π ⎞
C j = ∑ X i ⋅ cos ⎜ j ⋅ (i − 1 2) ⋅ ⎟, with j = 1, 2,..., J , (10)
transform decorrelated parameters, which were computed i =1 ⎝ M ⎠
through a transformation of the logarithmically compressed where M is the number of filters in the filter bank, J is the
filter-output energies. These energies were derived through a number of cepstral coefficients which are computed (usually
perceptually spaced bank of twenty equal height triangular
J < M ), and X i is formulated as the “log-energy output of the
filters that are applied on the Discrete Fourier Transform
(DFT)-ed speech signal. In brief, given N - point DFT of the i - th filter” [4]. Here, the “log-energy output of the i - th filter”
discrete input signal x( n) , is understood as:
⎛ N −1 ⎞
N −1
⎛ − j 2π nk ⎞ X i = log10 ⎜ ∑ X (k ) ⋅ H i (k ) ⎟ , i = 1, 2,..., M . (11)
X (k ) = ∑ x(n) ⋅ exp ⎜ ⎟, k = 0,1,..., N − 1 , (5) ⎝ k =0 ⎠
n =0 ⎝ N ⎠
a filter bank with M equal height triangular filters is con- The log-energy output X i of each filter is derived through the
structed. Each of these M equal height filters is defined as: magnitude spectrum (5) and filter bank (6). It has to be speci-
⎧ 0 for k < f bi−1 fied here that since X i is derived through the magnitude spec-

(
⎪ k − f bi−1 )
for f bi−1 ≤ k ≤ fbi
trum, and not through the power spectrum, it does not comply
with the Parseval’s definition of energy as sum of squared

H i (k ) = ⎨
(
⎪ fbi − fbi−1 ) , i = 1, 2,..., M (6)
terms. Nevertheless, this definition of energy is used in most of
(
⎪ f bi+1 − k )
for fbi ≤ k ≤ fbi+1
the MFCC implementations.

(
⎪ fbi+1 − f bi ) 2.2. The HTK MFCC-FB24
⎪ 0 for k > fbi+1 Another widely-used implementation of the MFCC was pro-
⎩ vided in the framework of the Cambridge Hidden Markov
where i stands for the i - th filter, fbi are the boundary points Models (HMM) Toolkit [5], known as HTK. The designation
of the filters, and k = 1, 2,..., N corresponds to the k - th coeffi- HTK MFCC FB-24 reflects the number of filters M =24 rec-
cient of the N - point DFT. The boundary points fbi are ex- ommended by Young for speech bandwidth of 8 kHz.
The HTK MFCC FB-24 makes use of the definition (3) of
pressed in terms of position, which depends on the sampling the Mel frequency. In this implementation, the limits of the
frequency Fs and the number of points N in the DFT: frequency range are the parameters that define the basis for the
⎛N⎞
fbi = ⎜ ⎟ ⋅ fˆ −mel

1 ⎜ ˆ
f mel ( f low ) + i ⋅
( )
fˆmel f high − fˆmel ( flow ) ⎞
⎟ . (7)
filter bank design. Specifically, the lower and the higher
boundaries of the frequency range of the entire filter bank,
⎝ Fs ⎠ ⎜ M +1 ⎟
⎝ ⎠ fˆlow and fˆhigh respectively, determine the computation of the
Here, the function fˆmel (.) states the transformation (4), unit interval ∆fˆ :
flow and f high are respectively the low and high boundary fre- fˆhigh − fˆlow
∆fˆ = , (12)
quencies for the entire filter bank, M is the number of filters, M +1
and fˆ −mel
1
is the inverse to (4) transformation, formulated as: which serves as footstep in the definition of the centre frequen-
⎡ ⎛ fˆ ⎞ ⎤ cies of the individual filters. The centre frequency fˆci of the i -
fˆ mel
−1
= flin = 700 ⋅ ⎢exp ⎜ mel ⎟ − 1⎥ . (8)
⎜ 1127 ⎟ ⎥ th filter is given by:
⎣⎢ ⎝ ⎠ ⎦
fˆci = fˆlow + i ⋅ ∆fˆ , i = 1,..., M − 1 , (13)
Here, and everywhere next, the sampling frequency Fs , and
where M is the total number of filters in the filter bank. The
the frequencies f low , f high , and f lin , are in Hz, and the fˆmel is conversion of the centre frequencies of the filters to linear fre-
in mels. Equation (7) guarantees that the boundary points of the quency (Hz) is given by:
filters are uniformly spaced in the Mel scale. The endpoints of ⎛ fˆ 2595 ⎞
each one of the triangular filters are determined by the centre f ci = 700 ⋅ ⎜10 ci − 1⎟ . (14)
⎝ ⎠
frequencies of its adjacent filters. Therefore, the bandwidth of
In HTK, similarly to the filter bank of the original MFCC FB-
the filters is not an independent variable.
20 [4], a filter bank of equal height filters is used. The shape of
The filter bank of Davis and Mermelstein is comprised of
the individual triangular filters is defined by (6).
twenty equal height filters which cover the frequency range
The HTK MFCC FB-24 parameters are computed as fol-
[0, 4600] Hz. Unlike (7), the centre frequencies of the first ten
lows: The DFT X ( k ) (5), computed for the discrete input sig-
filters are linearly spaced between 100 Hz and 1000 Hz, and
the next ten have centre frequencies logarithmically spaced nal x( n) , is used for computing the magnitude spectrum
between 1000 Hz and 4000 Hz. The choice of centre frequency X (k ) , which acts as input for the filter bank H i (k ) (6).
f ci for the i -th filter can be approximated [3] as: Next, the filter bank output is logarithmically compressed:
⎧⎪ 100 ⋅ i, i = 1,...,10 ⎛ N −1 ⎞
X i = ln ⎜ ∑ X ( k ) ⋅ H i (k ) ⎟ , (15)
f ci = ⎨ 0.2( i −10) (9) ⎝ k =0 ⎠
f
⎪⎩ c10 ⋅ 2 , i = 11,..., 20
and then decorrelated by the DCT (10) to provide the HTK
where the centre frequency f ci is assumed in Hz. MFCC FB-24 parameters.
2.3. The MFCC FB-40 In brief, the HFCC filter bank design [3] consists of the fol-
The MFCC FB-40 speech features were described in the lowing steps: First the low flow and high f high boundaries of
Slaney’s Auditory Toolbox [6]. Assuming sampling frequency the entire filter bank and the number M of filters are chosen.
16 kHz, Slaney implemented a filter bank of 40 equal area The centre frequencies f c1 and f cM of the first and the last of
filters, which cover the frequency range [133, 6854] Hz. The
the filters, respectively, are computed as:
centre frequencies of the first 13 of them are linearly spaced in
the range [ 200, 1000] Hz with a step of 66.67 Hz and the ones
1
(
f ci = ⋅ −b + b 2 − 4 ⋅ c ,
2
) (21)
of the next 27 are logarithmically spaced in the range [1071,
6400] Hz with a step logStep = 1.0711703 , computed as: where the index i is either 1 or M , and b , c defined as:
⎛ ⎛ fc ⎞ ⎞ b − bˆ c − cˆ
logStep = exp ⎜ ln ⎜ 40 ⎟ numLogFilt ⎟ . (16) b= and c = (22)
⎜ 1000 ⎟ a − aˆ a − aˆ
⎝ ⎝ ⎠ ⎠
receive different values for the two cases. The values a, b, c
Here f c40 = 6400 Hz is the centre frequency of the last of the
are these from (20): 6.23 ⋅ 10−6 , 93.39 ⋅ 10−3 , 28.52, respec-
logarithmically spaced filters, and numLogFilt = 27 is the num-
ber of logarithmically spaced filters. Each one of these equal tively. For the first filter, the values of the coefficients aˆ , bˆ, cˆ
area triangular filters is defined as: are computed as:
⎧ 0 for k < f bi−1 1 1 700 f ⎛ 700 ⎞
aˆ = ⋅ , bˆ = , cˆ = − low ⋅ ⎜1 + ⎟ .(23)

⎪ (
2 k − f bi−1 ) for f bi−1 ≤ k ≤ fbi
2 700 + flow 700 + flow 2 ⎝ 700 + flow ⎠
⎪ For the last filter these are:
H i (k ) = ⎨
( )(
⎪ fbi − fbi−1 fbi+1 − fbi−1 ) , (17) 1 1 700 fhigh ⎛ 700 ⎞
⎪ (
2 f bi+1 − k ) for fbi ≤ k ≤ fbi+1
aˆ = − ⋅
2 700 + fhigh
, bˆ = −
700 + fhigh
, cˆ = ⋅ ⎜1+ ⎟ (24)
2 ⎜⎝ 700 + fhigh ⎟⎠

( )(
⎪ fbi+1 − f bi fbi+1 − fbi−1 ) Once the centre frequencies of the first and the last filter are
⎪ 0 for k > f bi+1 computed, the centre frequencies of the filters situated between

where i = 1, 2,..., M stands for the i - th filter, fbi are M + 2 them are easily calculated since they are equidistant on the
Mel-scale. The step ∆fˆ between the centre frequencies of
boundary points that specify the M filters, and k = 1, 2,..., N
adjacent filters is computed as:
corresponds to the k - th coefficient of the N - point DFT. The
fˆc − fˆc1
boundary points fbi are expressed in terms of position, as ∆fˆ = M (25)
M −1
specified above. The key to equalization of the area below the
filters (17) lies in the term: where all the frequencies are in mels. The conversions
2 f c1 → fˆc1 and f cM → fˆcM are given by (3). Having ∆fˆ , the
. (18)
(
fbi+1 − f bi−1 ) centre frequencies fˆci are computed as:
Due to the term (18), the filter bank (17) is normalized in such fˆci = fˆc1 + ( i − 1) ⋅ ∆fˆ , for i = 2,..., M − 1. (26)
a way that the sum of coefficients for every filter equals one.
Thus, the i - th filter satisfies: Next, through (14), the reverse transformation fˆci → f ci is

∑ performed, and through (20) the ERBi for each f ci is com-


N
H (k )
k =1 i
= 1 , for i = 1, 2,..., M (19)
Next, the equal area filter bank (17) is employed in the compu- puted. Finally, the low and high frequencies flowi and f highi ,
tation of the log-energy output (11). Finally, the DCT (10) respectively, of the i - th filter are derived through:
provides the MFCC-FB40 parameters.
flowi = −(700 + ERBi ) + (700 + ERBi ) 2 + f ci ( f ci + 1400) (27)
2.4. The HFCC-E FB-29 f highi = flowi + 2 ⋅ ERBi . (28)
The Human Factor Cepstral Coefficients (HFCC) introduced in
With all parameters computed through (21) ÷ (28), the design
2004 by Skowronski and Harris [3], provide the most recent
of the HFCC-E filter bank is completed.
update of the MFCC filter bank. Assuming sampling frequency
Finally, as in the MFCC FB-20 of Davis and Mermelstein,
of 12.5 kHz Skowronski and Harris proposed the HFCC-E filter
the log-energy filter bank outputs are computed (11), and then
bank composed of 29 Mel-warped equal height filters, which
(10) is applied to decorrelate the HFCC-E FB29 parameters.
cover the frequency range [0, 6250] Hz. The most significant
difference between the HFCC, and the earlier MFCC, is that in
HFCC-E the filter bandwidth is decoupled from the filter spac-
3. Experiments and results
ing. Specifically, the filter bandwidth in the HFCC-E is derived The MFCC implementations outlined in Section 2 were evalu-
from the equivalent rectangular bandwidth (ERB) introduced ated on the 2001 NIST SRE database by means of the PNN-
by Moore and Glasberg [8]: based text-independent speaker verification system [10]. A
common protocol was followed in all experiments according to
ERB = 6.23 ⋅ 10−6 ⋅ f c2 + 93.39 ⋅ 10−3 ⋅ f c + 28.52 , (20) the rules described in the 2001 NIST SRE Plan [9]. In brief,
where f c is the centre frequency of the individual filters in Hz. approximately 40 seconds of voiced speech were detected (in
The filter bandwidth (20) is further scaled by a constant, which two-minute recordings) for training the target models. The
Skowronski and Harris labeled as E-factor. common reference model was created by exploiting the male
training speech available in the 2002 NIST SRE database. Table 1. The Equal Error Rate (EER) and normalized optimal
Approximately one hour and forty minutes of voiced speech Decision Cost Function (DCFopt) for various MFCC im-
was available for that purpose. After training, the user models plementations
were tested carrying out all male trials as defined in the com- Speech Features # filters for DCFopt EER [%]
[0, 4000] Hz
plete one-speaker detection task. Each experiment comprised MFCC FB-20 D&M 19 0.554 14.00%
850 target and 8500 impostor trials with a duration from 0 to MFCC FB-24 HTK 20 0.538 13.76%
60 seconds of speech. MFCC FB-40 Slaney 32 0.541 13.65%
To accommodate to sampling rate of 8 kHz, we have ex- HFCC-E FB-29 S&H 19 0.638 15.41%
cluded from all filter banks the filters which spread beyond the HFCC-E FB-29 S&H 24 0.640 14.71%
4 kHz border. Thus, in the experiments with the MFCC FB-20 HFCC-E FB-29 S&H 29 0.592 13.65%
of Davis and Mermelstein we have used 19 filters – ten with
linearly spaced centre frequencies and nine with logarithmi- Table 2. The Equal Error Rate (EER) and normalized optimal
cally spaced ones. Following the instructions in [5], we used a Decision Cost Function (DCFopt) for HFCC-E FB-29
filter bank of 20 filters for computing the HTK MFCC FB-24 with different E-factors
features. In the experiment with the Slaney’s MFCC FB-40, Speech Features E factor DCFopt EER [%]
we kept the first 32 filters, which cover the frequency range HFCC-E FB-29 E=0.25 0.604 14.00%
[133, 3954] Hz. Finally, in the experiment with the HFCC-E HFCC-E FB-29 E=0.35 0.589 13.77%
FB-29 (using E-factor E=1) we tested various number of filters HFCC-E FB-29 E=0.50 0.567 13.06%
(19, 24, 29) to cover the frequency range of [0, 4000] Hz. In HFCC-E FB-29 E=1.00 0.592 13.65%
all experiments, the full number of cepstral coefficients, except HFCC-E FB-29 E=1.50 0.585 14.12%
the first one, was employed. Cepstral mean subtraction and HFCC-E FB-29 E=2.00 0.609 14.45%
dynamic range normalization were used for all speech features.
Table 1 presents the experimental results. As it was ex- 4. Conclusions
pected, there is no significant difference among the results for
the MFCC FB-24 HTK, Slaney’s MFCC FB-40, and the Comparative evaluation of various MFCC implementations
HFCC FB-29 (with 29 filters in the range [0, 4000] Hz). Next, was performed. As expected, the speaker verification perform-
the MFCC FB-20 of Davis and Mermelstein performed ance did not vary vastly when different approximations of the
slightly worse, and finally, the HFCC-E FB-29 features with non-linear pitch perception of human were used. However,
24 and 19 filters provided the highest Equal Error Rate (EER). some observations suggest that regardless of the specific filter
Assuming a filter bank of 24 filters for the frequency range [0, bank design, a larger number of filters favours the speaker
4000] Hz, Skowronski and Harris [3] suggested 29 filters for detection performance. Beside the number of filters in the filter
the frequency range [0, 6250] Hz. However, the speaker verifi- bank, the overlapping among the neighbouring filters also
cation results demonstrated that 29 filters (in the frequency proved a sensitive parameter. Increase or decrease of the over-
range [0, 4000] Hz) provide lower EER than 24 or 19 filters. lapping beyond a given range increases the error rates.
We deem the reason for this is (at least in part) in the irrelevant 5. References
overlapping between the first few filters in the HFCC-E filter
[1] Zheng F., Zhang, G., Song, Z., “Comparison of different imple-
bank, especially when the number of filters is low. This results
mentations of MFCC”, J. Computer Science & Technology,
in a bad frequency resolution at low frequencies. In addition, 16(6):582-589, Sept. 2001.
examining the results for MFCC FB-40 and HFCC FB-29, it [2] Shannon B.J., Paliwal K.K., “A comparative study of filter bank
seems that more filters in the filter bank provide a better spacing for speech recognition”, Proc. of Microelectronic engi-
speaker differentiation. The only exception here is the result of neering research conference, Brisbane, Australia, Nov. 2003.
HTK MFCC FB-24 – apparently, other factors influence the [3] Skowronski, M.D., Harris, J.G., “Exploiting independent filter
speaker verification performance as well. bandwidth of human factor cepstral coefficients in automatic
To study the importance of the E-factor we experimented speech recognition”, Journal of the Acoustical Society of America,
116(3):1774–1780, Sept. 2004.
with various values. In experiments with a filter bank of 24
[4] Davis, S.B., Mermelstein, P., “Comparison of Parametric Repre-
filters for the frequency range [0, 4000] Hz, it was observed sentations for Monosyllabic Word Recognition in Continuously
that E=1 provides the lowest EER. Table 2 presents results for Spoken Sentences”, IEEE Trans. on Acoustic, Speech and Signal
the best HFCC-E FB-29 – with 29 filters in the frequency Processing, 28(4):357–366, 1980.
range [0, 4000] Hz. For this filter bank it was found that E=0.5 [5] Young, S.J., Odell, J., Ollason, D., Valtchev, V., Woodland, P.,
provides the lowest EER. Deviating from E=0.5 in either di- “The HTK Book. Version 2.1”, Department of Engineering,
rection increases the EER. We deem the reason is that for Cambridge University, UK, 1995.
lower values of the E-factor, the filters with the lowest centre [6] Slaney M. “Auditory Toolbox. Version 2”, Technical Report
frequencies barely overlap, and thus the filter bank resolution #1998-010, Interval Research Corporation, 1998.
[7] Fant, G. Speech Sounds and Features. The MIT Press, Cambridge,
for these frequencies is low – threshold phenomena were ob-
MA, USA, 1973.
served. For higher values of E, the filters are very broad and [8] Moore, B.C.J., Glasberg, B.R. “Suggested formulae for calculat-
thus smooth some details in the spectrum which are important ing auditory-filter bandwidths and excitation patterns”, Journal of
for speaker differentiation. In addition, in the HFCC-E scheme the Acoustical Society of America, 74(3):750–753, 1983.
the filters with highest centre frequencies overlap widely. Each [9] “The NIST Year 2001 Speaker Recognition Evaluation Plan”, The
filter overlaps not only with its immediate neighbours but also NIST of USA, 2001. Available: http://www.nist.gov/speech/
with more distant ones. This was reported useful for speech tests/spk/2001/doc/2001-spkrec-evalplan-v05.9.pdf.
recognition [3], but does not favour the speaker recognition [10] Ganchev, T., Fakotakis, N., and Kokkinakis, G., “Text-
Independent Speaker verification Based on Probabilistic Neural
task. The MFCC FB-40 and MFCC FB-24 HTK were found
Networks”, Proc. of Acoustics, Patras, Greece, 2002. 159-166.
to provide the lowest decision cost.

You might also like