MFCCs in Speech Recognition PDF
MFCCs in Speech Recognition PDF
MFCCs in Speech Recognition PDF
Abstract ent sampling rates (and different bandwidth of the speech sig-
nal) they are not directly comparable. To solve that discrep-
Making no claim of being exhaustive, a review of the most
ancy, keeping the filter spacing and filter bandwidth as pro-
popular MFCC (Mel Frequency Cepstral Coefficients) imple-
posed in the original description of these implementations, we
mentations is made. These differ mainly in the particular ap-
reduce the number of filters to adapt it to sampling frequency
proximation of the nonlinear pitch perception of human, the
of 8 kHz. Sampling frequency of 8 kHz is common for all tele-
filter bank design, and the compression of the filter bank out-
phone driven services, and thus, it is default for the contempo-
put. Then, a comparative evaluation of the presented imple-
rary real-world speaker recognition corpora (for instance:
mentations is performed on the task of text-independent
Switchboard, NIST SRE data [9], etc).
speaker verification, by means of the well-known 2001 NIST
SRE (speaker recognition evaluation) one-speaker detection
database.
2. Implementations of the MFCC parameters
1. Introduction Following the introduction of the MFCC [4], numerous varia-
tions and improvements of the original idea were proposed.
The quest for better speech parameterization led to various One of the main reasons for such diversity of implementations
speech features, which were reported to provide advantage in is the desire of researchers to follow the progress made in the
specific conditions and applications. Moreover, for some area of psychoacoustics during the years. For instance, let’s
speech features, such as the well-known and widely-used consider the various approximations of the nonlinear pitch per-
MFCC, multiple implementations were developed. These im- ception by the human auditory system. An early approximation,
plementations differ mainly in the number of filters, the shape referred to as Koenig scale is exactly linear below 1000 Hz and
of the filters, the way the filters are spaced, the bandwidth of logarithmic above 1000 Hz. It provides a computationally in-
the filters, and the manner in which the spectrum is warped. In expensive representation of the Mel scale, which however is
addition, the frequency range of interest, the selection of actual not very precise and significantly deviates from the original
subset and the number of MFCC coefficients employed in the scale for frequencies both lower and higher than 1000 Hz. A
classification can be also different. more precise approximation, suggested by Fant, is:
Although there are a number of studies [1÷3] that compare
⎛ f ⎞
various implementations of the MFCC on the speech recogni- fˆmel = k const ⋅ log n ⎜ 1 + lin ⎟ , (1)
tion task, up to the authors’ present knowledge no such study ⎝ Fb ⎠
has been performed on the task of speaker recognition. Since where Fb = 1000. A specific form of (1), presented in [7]:
the speech and speaker recognition tasks exploit different as-
1000 ⎛ f ⎞
pects of the speech signal, we deem it worthy to carry out such fˆmel = ⋅ log n ⎜ 1 + lin ⎟ (2)
a study. Therefore, employing a text-independent speaker veri- log n 2 ⎝ 1000 ⎠
fication system, we perform a comparative evaluation of the was found to provide a more close approximation of the Mel
following implementations: scale (only for the frequency range of [0, 5] kHz), when com-
• MFCC FB-20 – introduced in 1980 by Davis and Mermel- pared with the approximation offered by the Koenig scale. In
stein [4]; Davis and Mermelstein assume sampling fre- addition, the formulation (2) is particularly interesting since the
quency of 10 kHz; speech bandwidth [0, 4600] Hz.
values of fˆmel remain unaffected by the choice of the base n
• MFCC FB-24 HTK – from the Cambridge HMM Toolkit
(HTK) described in Young, 1995 [5]; Young uses a filter of the logarithm. Other approximations of the Mel scale that
bank of 24 filters for speech bandwidth [0, 8000] Hz (sam- were derived from (1) make use of natural or decimal loga-
pling rate ≥ 16 kHz). rithm, which leads to different choice of the constant kconst .
• MFCC FB-40 – from the Auditory Toolbox for MATLAB The following two representations:
[6] written by Slaney in 1998; Slaney assumes sampling ⎛ f ⎞
rate of 16 kHz, and speech bandwidth [133, 6854] Hz. fˆmel = 2595 ⋅ log10 ⎜1 + lin ⎟ (3)
⎝ 700 ⎠
• HFCC-E FB-29 (Human Factor Cepstral Coefficients) of
⎛ f ⎞
Skowronski and Harris, 2004 [3]; Skowronski and Harris fˆmel = 1127 ⋅ ln ⎜1 + lin ⎟ (4)
assume sampling rate of 12.5 kHz and speech bandwidth ⎝ 700 ⎠
[0, 6250] Hz. are widely used in the various implementations of the MFCC.
The abbreviation FB-nn (Filter Bank), which we stick after the The formulae (3) and (4), when compared to (2), provide a
designation MFCC (HFCC), provides information about the closer approximation of the Mel scale for frequencies below
number of filters in the filter bank as described by the corre- 1000 Hz, at the price of higher inaccuracy for frequencies
sponding authors. Since these implementations assume differ- higher than 1000 Hz.
2.1. The original MFCC FB-20 Having the filter bank constructed, the MFCC parameters
In the paradigm introduced by Davis and Mermelstein, 1980, are computed [4], as:
[4] the novel MFCC were designed as a set of discrete cosine M
⎛ π ⎞
C j = ∑ X i ⋅ cos ⎜ j ⋅ (i − 1 2) ⋅ ⎟, with j = 1, 2,..., J , (10)
transform decorrelated parameters, which were computed i =1 ⎝ M ⎠
through a transformation of the logarithmically compressed where M is the number of filters in the filter bank, J is the
filter-output energies. These energies were derived through a number of cepstral coefficients which are computed (usually
perceptually spaced bank of twenty equal height triangular
J < M ), and X i is formulated as the “log-energy output of the
filters that are applied on the Discrete Fourier Transform
(DFT)-ed speech signal. In brief, given N - point DFT of the i - th filter” [4]. Here, the “log-energy output of the i - th filter”
discrete input signal x( n) , is understood as:
⎛ N −1 ⎞
N −1
⎛ − j 2π nk ⎞ X i = log10 ⎜ ∑ X (k ) ⋅ H i (k ) ⎟ , i = 1, 2,..., M . (11)
X (k ) = ∑ x(n) ⋅ exp ⎜ ⎟, k = 0,1,..., N − 1 , (5) ⎝ k =0 ⎠
n =0 ⎝ N ⎠
a filter bank with M equal height triangular filters is con- The log-energy output X i of each filter is derived through the
structed. Each of these M equal height filters is defined as: magnitude spectrum (5) and filter bank (6). It has to be speci-
⎧ 0 for k < f bi−1 fied here that since X i is derived through the magnitude spec-
⎪
(
⎪ k − f bi−1 )
for f bi−1 ≤ k ≤ fbi
trum, and not through the power spectrum, it does not comply
with the Parseval’s definition of energy as sum of squared
⎪
H i (k ) = ⎨
(
⎪ fbi − fbi−1 ) , i = 1, 2,..., M (6)
terms. Nevertheless, this definition of energy is used in most of
(
⎪ f bi+1 − k )
for fbi ≤ k ≤ fbi+1
the MFCC implementations.
⎪
(
⎪ fbi+1 − f bi ) 2.2. The HTK MFCC-FB24
⎪ 0 for k > fbi+1 Another widely-used implementation of the MFCC was pro-
⎩ vided in the framework of the Cambridge Hidden Markov
where i stands for the i - th filter, fbi are the boundary points Models (HMM) Toolkit [5], known as HTK. The designation
of the filters, and k = 1, 2,..., N corresponds to the k - th coeffi- HTK MFCC FB-24 reflects the number of filters M =24 rec-
cient of the N - point DFT. The boundary points fbi are ex- ommended by Young for speech bandwidth of 8 kHz.
The HTK MFCC FB-24 makes use of the definition (3) of
pressed in terms of position, which depends on the sampling the Mel frequency. In this implementation, the limits of the
frequency Fs and the number of points N in the DFT: frequency range are the parameters that define the basis for the
⎛N⎞
fbi = ⎜ ⎟ ⋅ fˆ −mel
⎛
1 ⎜ ˆ
f mel ( f low ) + i ⋅
( )
fˆmel f high − fˆmel ( flow ) ⎞
⎟ . (7)
filter bank design. Specifically, the lower and the higher
boundaries of the frequency range of the entire filter bank,
⎝ Fs ⎠ ⎜ M +1 ⎟
⎝ ⎠ fˆlow and fˆhigh respectively, determine the computation of the
Here, the function fˆmel (.) states the transformation (4), unit interval ∆fˆ :
flow and f high are respectively the low and high boundary fre- fˆhigh − fˆlow
∆fˆ = , (12)
quencies for the entire filter bank, M is the number of filters, M +1
and fˆ −mel
1
is the inverse to (4) transformation, formulated as: which serves as footstep in the definition of the centre frequen-
⎡ ⎛ fˆ ⎞ ⎤ cies of the individual filters. The centre frequency fˆci of the i -
fˆ mel
−1
= flin = 700 ⋅ ⎢exp ⎜ mel ⎟ − 1⎥ . (8)
⎜ 1127 ⎟ ⎥ th filter is given by:
⎣⎢ ⎝ ⎠ ⎦
fˆci = fˆlow + i ⋅ ∆fˆ , i = 1,..., M − 1 , (13)
Here, and everywhere next, the sampling frequency Fs , and
where M is the total number of filters in the filter bank. The
the frequencies f low , f high , and f lin , are in Hz, and the fˆmel is conversion of the centre frequencies of the filters to linear fre-
in mels. Equation (7) guarantees that the boundary points of the quency (Hz) is given by:
filters are uniformly spaced in the Mel scale. The endpoints of ⎛ fˆ 2595 ⎞
each one of the triangular filters are determined by the centre f ci = 700 ⋅ ⎜10 ci − 1⎟ . (14)
⎝ ⎠
frequencies of its adjacent filters. Therefore, the bandwidth of
In HTK, similarly to the filter bank of the original MFCC FB-
the filters is not an independent variable.
20 [4], a filter bank of equal height filters is used. The shape of
The filter bank of Davis and Mermelstein is comprised of
the individual triangular filters is defined by (6).
twenty equal height filters which cover the frequency range
The HTK MFCC FB-24 parameters are computed as fol-
[0, 4600] Hz. Unlike (7), the centre frequencies of the first ten
lows: The DFT X ( k ) (5), computed for the discrete input sig-
filters are linearly spaced between 100 Hz and 1000 Hz, and
the next ten have centre frequencies logarithmically spaced nal x( n) , is used for computing the magnitude spectrum
between 1000 Hz and 4000 Hz. The choice of centre frequency X (k ) , which acts as input for the filter bank H i (k ) (6).
f ci for the i -th filter can be approximated [3] as: Next, the filter bank output is logarithmically compressed:
⎧⎪ 100 ⋅ i, i = 1,...,10 ⎛ N −1 ⎞
X i = ln ⎜ ∑ X ( k ) ⋅ H i (k ) ⎟ , (15)
f ci = ⎨ 0.2( i −10) (9) ⎝ k =0 ⎠
f
⎪⎩ c10 ⋅ 2 , i = 11,..., 20
and then decorrelated by the DCT (10) to provide the HTK
where the centre frequency f ci is assumed in Hz. MFCC FB-24 parameters.
2.3. The MFCC FB-40 In brief, the HFCC filter bank design [3] consists of the fol-
The MFCC FB-40 speech features were described in the lowing steps: First the low flow and high f high boundaries of
Slaney’s Auditory Toolbox [6]. Assuming sampling frequency the entire filter bank and the number M of filters are chosen.
16 kHz, Slaney implemented a filter bank of 40 equal area The centre frequencies f c1 and f cM of the first and the last of
filters, which cover the frequency range [133, 6854] Hz. The
the filters, respectively, are computed as:
centre frequencies of the first 13 of them are linearly spaced in
the range [ 200, 1000] Hz with a step of 66.67 Hz and the ones
1
(
f ci = ⋅ −b + b 2 − 4 ⋅ c ,
2
) (21)
of the next 27 are logarithmically spaced in the range [1071,
6400] Hz with a step logStep = 1.0711703 , computed as: where the index i is either 1 or M , and b , c defined as:
⎛ ⎛ fc ⎞ ⎞ b − bˆ c − cˆ
logStep = exp ⎜ ln ⎜ 40 ⎟ numLogFilt ⎟ . (16) b= and c = (22)
⎜ 1000 ⎟ a − aˆ a − aˆ
⎝ ⎝ ⎠ ⎠
receive different values for the two cases. The values a, b, c
Here f c40 = 6400 Hz is the centre frequency of the last of the
are these from (20): 6.23 ⋅ 10−6 , 93.39 ⋅ 10−3 , 28.52, respec-
logarithmically spaced filters, and numLogFilt = 27 is the num-
ber of logarithmically spaced filters. Each one of these equal tively. For the first filter, the values of the coefficients aˆ , bˆ, cˆ
area triangular filters is defined as: are computed as:
⎧ 0 for k < f bi−1 1 1 700 f ⎛ 700 ⎞
aˆ = ⋅ , bˆ = , cˆ = − low ⋅ ⎜1 + ⎟ .(23)
⎪
⎪ (
2 k − f bi−1 ) for f bi−1 ≤ k ≤ fbi
2 700 + flow 700 + flow 2 ⎝ 700 + flow ⎠
⎪ For the last filter these are:
H i (k ) = ⎨
( )(
⎪ fbi − fbi−1 fbi+1 − fbi−1 ) , (17) 1 1 700 fhigh ⎛ 700 ⎞
⎪ (
2 f bi+1 − k ) for fbi ≤ k ≤ fbi+1
aˆ = − ⋅
2 700 + fhigh
, bˆ = −
700 + fhigh
, cˆ = ⋅ ⎜1+ ⎟ (24)
2 ⎜⎝ 700 + fhigh ⎟⎠
⎪
( )(
⎪ fbi+1 − f bi fbi+1 − fbi−1 ) Once the centre frequencies of the first and the last filter are
⎪ 0 for k > f bi+1 computed, the centre frequencies of the filters situated between
⎩
where i = 1, 2,..., M stands for the i - th filter, fbi are M + 2 them are easily calculated since they are equidistant on the
Mel-scale. The step ∆fˆ between the centre frequencies of
boundary points that specify the M filters, and k = 1, 2,..., N
adjacent filters is computed as:
corresponds to the k - th coefficient of the N - point DFT. The
fˆc − fˆc1
boundary points fbi are expressed in terms of position, as ∆fˆ = M (25)
M −1
specified above. The key to equalization of the area below the
filters (17) lies in the term: where all the frequencies are in mels. The conversions
2 f c1 → fˆc1 and f cM → fˆcM are given by (3). Having ∆fˆ , the
. (18)
(
fbi+1 − f bi−1 ) centre frequencies fˆci are computed as:
Due to the term (18), the filter bank (17) is normalized in such fˆci = fˆc1 + ( i − 1) ⋅ ∆fˆ , for i = 2,..., M − 1. (26)
a way that the sum of coefficients for every filter equals one.
Thus, the i - th filter satisfies: Next, through (14), the reverse transformation fˆci → f ci is