0% found this document useful (0 votes)

16 views22 pages

Automatic Speech Recognition 2

The document discusses Automatic Speech Recognition (ASR) and its development, highlighting the use of Mel Frequency Cepstral Coefficients (MFCCs) and Hidden Markov Models (HMMs) in modern ASR systems. It provides an overview of the architecture and components necessary to construct an ASR system, including signal discretization, feature extraction methods, and experimental results. The author, Jia Pei, emphasizes the significance of various open-source ASR systems and the evolution of techniques in speech recognition research.

Uploaded by

resmanabrahma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views22 pages

Automatic Speech Recognition 2

Uploaded by

resmanabrahma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228687340

Automatic Speech Recognition

Article · May 2010

CITATIONS READS

2 1,144

1 author:

Jia Pei
Longer Vision Tech
4 PUBLICATIONS 2 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jia Pei on 30 May 2014.

The user has requested enhancement of the downloaded file.

Automatic Speech Recognition

JIA Pei

Email: [email protected]

Vision Open Working Group

First Edition

April 22, 2010

Contents

0.1 Model the Speech Stochastic Process . . . . . . . . . . . . . . 1

0.1.1 Discretize the Speech Signal . . . . . . . . . . . . . . . 1
0.1.2 Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.1.3 Analysis of MFCC . . . . . . . . . . . . . . . . . . . . 6
0.1.4 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . 10
0.1.5 Language Model . . . . . . . . . . . . . . . . . . . . . 13
0.1.6 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 13
0.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 13

1
Abstract

ASR has already been researched and applied as a HCI for over thirty
years [6]. Normally, in order to apply speech recognition, an entire system
needs to be developed from scratch. Fortunately, several famous open source
speech recognition systems are available on the Internet, such as HTK devel-
oped by Cambridge University [38, 39] (note: HTK has license restriction),
Julius maintained by Kyoto University [16], ISIP speech recognition envi-
ronment by Mississippi State University [7, 23] and CMU Sphinx [18, 30].
Modern general-purpose speech recognition systems, including all the above
four open source ASR systems, are generally based on Mel Frequency Cep-
stral Coefficients (MFCCs) for audio signal presentation and Hidden Morkov
Models (HMMs) to model the speech stochastic process.
HMM was first described in a series of statistical papers by Baum [3]
and some other authors in the 1960’s. Speach recognition reseach based
on HMM started in the mid-1970s. A representative ASR at that time is
Dragon [1, 2]. After that, variants of HMM were explored, such as discrete
HMMs [19], semicontinuous HMMs [13–15] and continuous HMMs [24]. In
1989, Rabiner reviewed HMM theories and summarized some problems in
speach recognition that could be modeled by HMM [25]. All the above HMM
related researches are summarized in the online tutorial “Ten years of HMMs”
[5].
In fact, from the viewpoint of pattern recognition and machine learning,
HMM, as well as another famous control model - Kalman filter could be
viewed as examples of Dynamic Bayesian Networks (DBNs) [10]. Murphy
reviewed this point in his PhD thesis [21] and stated HMMs has limited
“expressive power” but DBNs generalize HMMs by allowing the state space
to be represented in factored form, rather than a single discreted random
variable. Therefore, DBNs, also known as directed graphic models [4] should
be a more generalized tool for speech recognition. In fact, early in 1995,
Zweig has already put DBNs into speech recognition in his PhD thesis [40].
The entire architecture of CMU Sphinx4 speech recognition system is
cited in figure 1.
In the following section, how to generally construct an entire ASR system
is elaborated.
Figure 1: Sphinx4 ASR Framework [30]

0.1 Model the Speech Stochastic Process

0.1.1 Discretize the Speech Signal
Substantially, speech can be looked on as a continuous stochastic process.
Operably, it is dealt with as a discrete stochastic process by sampling first
and then windowing. But how to discretize the speech signal?
We first denote the sample rate as Sr ;

• The frequency range of the sound that human being can hear is be-
tween 20 to 20,000Hz. According to Nyquist theory, in order to be
able to analyze the sound (1D signal) which can be heard, a sample
rate of at least 2*20,000 is necessary. The most common sample rate
for speech/music is Sr = 44, 100Hz, which absolutely satisfies Nyquist
theory.

• Refer to [33], the frequency of ordinary human speech varies between

512Hz and 2,048Hz, between 2,048Hz and 8,192Hz for labial and frica-
tive sounds, and from 8,192Hz to 11,000Hz for the sound of letter “S”.
Therefore, according to Nyquist theory, the sample rate could be just
over 11,000*2=22,000Hz. That’s why Sr = 22, 050Hz is also often used

1
in audio files. In fact, in the above speech recognition OSSs, Sr is often
selected as 8,000Hz, 16,000Hz and 32,000Hz.

• In order to analyze the frequency spectrum of a signal, at least one

entire period of the specific signal harmonic component should be cov-
ered. The biggest period for the harmonic component from the sound
that can be heard is calculated as 1/20ms, which means if harmonic
component with the lowest hearable frequency needs to be covered in
one sample window, the window size should be at least of length Sr /20.

• From [37], “the voiced speech of a typical adult male will have a fun-
damental frequency of from 85 to 155 Hz, and that of a typical adult
female from 165 to 255 Hz”. Therefore, although we can hear a sound
below 85 Hz, we are never able to speak out such a sound. In order to
be able to analyze the harmonic component with the lowest frequency
of the voice that can be spoken out by human being, the window size
should be at least of length Sr /85.

• The window size should be narrow enough that the speech articulators
do not significantly change in that window.

• In order to ease later Fourier transform, it’s better to ensure the window
size be 2N , N = 1, 2, · · ·

In sum, we give out the following conclusion:

• Sample rate: 44,100Hz, 22,050Hz, 32,000Hz, 16,000Hz, 8,000Hz.

• Suitable window size to discretize the speech signal: 128, 256, 512,
1,024 and 2,048.

After this speech discretization, it’s easy for us to understand why the ad-
jacent frames (windows) are partially overlapped. A Hamming window oper-
ating on the speech signal attenuates the signal at both window edges, which
causes the signal at the edge is not well analyzed in this frame. Therefore,
we may recenter the sample window in the next frame partially overlapping
with the current sample window, in order to analyze the entire speech signal
without loss.

0.1.2 Frontend
Several canonical feature extraction methods have been proposed for speech
recognition as the frontend, including MFCC, LPC, RASTA-PLP [11, 12]

2
and TECC [8], etc. Variants based on the above audio processing methods
flooded by removing preemphasis, adding liftering technology, weighting filter
banks rather than the triangular way, etc. Here, we only explain MFCC in
detail.

Preprocessing
We first denote the raw wave data as sn , n = 0, 1, · · · . The audio signal
preprocessing is summarized in the following several steps:

1. Preprocess the signal by holistic preemphasis, namely, by applying the

first order difference equation
s′n = sn+1 − ksn n = 0, 1, · · · (1)
to all the samples sn . Here k is the preemphasis coefficient which should
be in the range 0 ≤ k < 1. In our application, k = 0.97. Refer to figure
3.
Note: Preemphasis changes the voice frequency characteris-
tics.
2. Afterwards, a window function is applied to observe a single frame of
N samples and extract features from the corresponding local samples.
It is usually beneficial to taper the samples in each window so that dis-
continuities at the window edges are attenuated. This could be done by
applying Hamming window functions to the samples after preemphasis
s′n in each window as follows:
2πn ′
s′′n = 0.53836 − 0.46164 cos ( )s n = 0···N − 1 (2)
N −1 n
Hamming window is visually described in figure 2.

The audio signal preprocessing could be summarized in figure 3.

MFCC
After preprocessing, MFCCs are calculated in the following substeps.

1. Carry out the Discrete Fourier transform (DFT) (Refer to [34] on (a

windowed excerpt of) a signal, here s′′n , n = 0 · · · N − 1, the frequency
spectrum S could be obtained by
N
X −1
Sk = s′′n e−2πikn/N k = 0, . . . , N − 1 (3)
n=0

3
Figure 2: Hamming Window Function [32]

where k corresponds to a frequency fk . The sampling theory tells us

that
fk = kfs /N (4)
where fs is the sampling frequency in Herz [26].
2. Build up the triangular overlapping filter banks as shown in figure 5 in
terms of the evenly distributed mel scale corresponding to the frequency
values as in figure 4, where mel scale is defined as:

m = 1127.01048 ln(1 + f /700) (5)

With the M built triangular overlapping filter banks bu , u = 1, 2, · · · , M

which are unevenly distributed in FFT frequency with M + 2 frequency
endpoints pu , u = 0, 1, 2, · · · , M +1, the M mel bins could be computed
as:
N
X −1
Bu = kSk kbu (fk ) u = 1, 2 · · · , M (6)
k=0

where


 0 if fk < pu−1 or fk > pu+1
fk −pu−1
bu (fk ) = pu −pu−1
if pu−1 ≤ fk < pu u = 1, 2, · · · , M
pu+1 −fk
if pu ≤ fk ≤ pu+1


pu+1 −pu
(7)
If the concerned frequency is limited between 0 and 8,000, and 19 fil-
ters are finally selected to compose the filter banks, namely, p0 = 0,

4
(a) Original Signal

(b) First Audio Frame

(c) Preemphasis of The Frame

(d) Convoluted by Hamming Window

Figure 3: Speech Signal Preprocessing

5
pM +1 = 8, 000 and M = 19, the frequency edges of triangular filter
banks are easily calculated as the following vector p=(0.0, 94.0, 200.6,
321.6, 458.7, 614.3, 790.8, 991.0, 1218.1, 1475.6, 1767.8, 2099.2, 2475.1,
2901.4, 3385.0, 3933.6, 4555.8, 5261.5, 6062.0, 6970.0, 8000.0), as shown
in figure 5.
It’s reported in [20] that “for speech/music this classification problem,
the results are (statistically) significantly better if Mel-based cepstral
features rather than linear-based cepstral features are used.” Here in
our application, we just take this conclusion as true.
3. Take a log function on the above obtained M mel bins.
Bu′ = ln(Bu ) u = 1, 2 · · · , M (8)

4. Take the Discrete Cosine Transform (DCT) [17] of the list of mel bins,
as if it were a signal.
M
X
′ π(2m − 1)u
C(u) = Bu cos u = 1, 2 · · · , M (9)
u=1
2M

The MFCCs are just the amplitudes of the above obtained spectrum
C(u), which will be used as the extracted features for both training
and recognition.

The entire MFCC process could be summarized in figure 6.

0.1.3 Analysis of MFCC

In order to demonstrate the performance of MFCC representation, a seg-
ment of speech is used to try out its reconstruction capability. Based on
white noise modulation, this speech segment can be roughly reconstructed,
which shows that MFCC are good features capable to represent the speech.
Actually, key features used for recognition might not be the features which
can reconstruct the original speech. But here, we only investigate the recon-
struction capability. By the way, TECC is recently reported to be an even
better front end for audio presentation (refer to [8]).
During the whole process of MFCC, all steps are invertible except two:
1. The absolute values rather than real and imaginary parts of DFT co-
efficients are used. So, to reconstruct the speech, white noises are ran-
domly generated and amplitude-modulated (AM) by the DFT absolute
values. The original signals are guessed through this way without too
much distortion. (Refer to 0.1.3)

6
Figure 4: Mel Hz Plot with Even Distributed Mel Slots

Figure 5: Filter Banks

7
(a) Absolute Values of DFT

(b) Frequency Band Pass

(d) Mel Log

8
(e) DCT

Figure 6: MFCC
2. Frequency to mel scale mapping is not 1 to 1. Generally, MFCC is an
energy compaction method, which tries to pack energies from lots of
frequency bins to a small number of mel bins. During the process of
voice reconstruction, a pseudo inverse is calculated so the energy from
a small number of mel bins can be roughly redistributed to a great
number of frequency bins.

White Noise Modulation

With known kSk k, k = 0, 1 · · · N − 1 only (refer to (6)) and without any
prior knowledge about the signal itself, recovering s′′n in (3) is difficult. One
possible way is to randomly generate Gaussian white noises of mean 0 and
variance 1, namely, a Gaussian white noise sequence wn , n = 0, 1 · · · N − 1.
Every number wn is a random number generated from the normal distribution
N(0, 1). Hopefully, wn will not change the energy of the original signal.
Afterwards, DFT is carried out on this randomly generated Gaussian
white noise sequence.
N
X −1
Wk = wn e−2πikn/N k = 0, . . . , N − 1 (10)
n=0

Finally, the generated real and imaginary parts from Wk are amplitude-
modulated (AM) by the sequence kSk k.

Redistribute Energy from Mel Bins to Frequency Bins

In fact, (6) is just a matrix multiplication, where kSk k is a column vector
of length N, Bu is a column vector of length M, and bu (fk ) is a matrix of
size M ∗ N, which reflects the mapping relationship from frequency bins to
mel bins. From (7) and (4), if the number of mel bins M, the size of the
sample window N, and the sample rate fs are known values, this mapping
matrix bu (fk ) can be determined preliminarily, then, its pseudo inverse can be
directly used to calculate kSk k from Bu . Figure 7 just shows all the weights
that have been adopted in our frequecy to mel mapping.

Voice Reconstruction
Figure 8 just shows the entire process of recoverying a single audio frame
from DCT parameters only.
As we can see, although (c) in figure 8 is quite similar to (b) in figure 6,
there are big differences between (d) in figure 8 and (d) in figure 3. That
means, for one audio frame, the signal can’t be well reconstructed. However,

9
Figure 7: Used Filter Bank Weights

comparing (g) or (h) in figure 8 and (a) in figure 3, we may conclude that
from a long-term aspect of view, the signal can be reconstructed reasonably
well, without considering the amplitude infidelity.
Two methods are finally used to join the recovered audio frames into a
whole signal. The first one (refer to (g) in figure 8) ignores the overlapped
part of every audio frames, the second one (refer to (h) in figure 8) averages
the overlapped part of every two neighbour frames. Experiments show the
first method gives better results.

0.1.4 Acoustic Model

As mentioned in 0.1.1, the window size should be short enough to ensure

there is only one single sound pronounced within this specific window. In
fact, all sounds of all Latin languages could be represented in the standard
IPA, short for International Phonetic Alphabet [35]. IPA is just such a system
to represent distinctive phonemes, intonation, and the separation of words
and syllables in spoken language. As of 2008, IPA defines 107 distinct letters,
52 diacritics, and 4 prosody marks to represent the spoken languages.
Apparently, if every sound unit is looked on as one of the hidden states,
and every vector of MFCCs is looked on as one possible observation state,
the entire speech process could just be looked on as a Hidden Markov Model
(HMM). Why is the model a HMM rather than just a Markov Model is
because the observation states can’t be unambiguously mapped to the corre-
sponding hidden states, namely, the observation states and the hidden states
are not 1-1 map. Acoustic models contain a statistical representation of the
distinct sound units that make up a whole word in the dictionary. Each
distinct sound unit corresponds to a phoneme.

10
(a) Inverse DCT to Obtain Mel Log

(b) Inverse Mel Log to Obtain Mel

(c) Inverse Mel to Obtain Frequency

(d) Inverse Frequency to Obtain Signals after Preprocessing, based on

White Noisy Modulation

11
(e) Inverse Hamming to Obtain Preemphasis Data

(f) Inverse Preemphasis to Origianl Signal for Current Audio Frame

(g) Voice Recovery for Entire Speech Method 1

(h) Voice Recovery for Entire Speech Method 2

Figure 8: Voice Reconstruction

12
0.1.5 Language Model
Language models contain a list of words and their probability of occurrence in
a given sequence. Unlike acoustic models which provide phoneme-level speech
structure, language models provide word-level language structure. Language
models typically fall into two categories: graph-driven models [28,29] and N-
Gram models [27,31]. Graph-driven models could be looked on as 1D Markov
model, which means the word probability is only based on the previous word;
while N-Gram models could be looked on as (n − 1)D Markov model, which
means the word probability is estimated by the previous n − 1 words.

0.1.6 Decoder
To recognize all words of a sentence in context is obviously a much more
complicated task than to just recognize isolated words. If we simply consider
to recognize the isolated words, then, decoder could be looked on as a search
engine, which essentially look for the most likely single word in the dictionary
that has the most similar pronunciation as what the user has pronounced.
Therefore, speech recongnition finally becomes a graph search problem to
find the most likely phoneme sequence. Generally speaking, there are many
categories of search algorithms, such as (refer to [36]): brute-force search or
exhaustive search, heuristic search including A∗ search, breath-first search,
depth-first search, etc. Since our acoustic model is based on HMM, Viterbi
search algorithm specific for HMM will be adopted in our experiments.

0.2 Experimental Results

We carry out our experiments using CMU with the following testing config-
uration parameters:
“WSJ 8gau 13dCep 16k 40mel 130Hz 6800Hz.Model” is used as the acous-
tic model which is configured in the XML file before execution. A simple
explanation of this acoustic model is as follows:

• WSJ – Wall Street Journal [9], which refers to a large speech data set
(or “corpus”) that was suitable for training the acoustic model. WSJ is
read by many adult male and female speakers with American English.
The dataset is available online in LDC (Linguistic Data Consortium);

• 8gau – An HMM models a process using a sequence of states. Associ-

ated with each state, there is a PDF (Probability Density Function).
A popular choice for this function is a Gaussian mixture, that is, a

13
summation of Gaussians. A single Gaussian is defined by a mean and
a variance, or, in the case of a multidimensional Gaussian, by a mean
vector and a covariance matrix, or, under some simplifying assump-
tions, a variance vector. Here, a mixture of 8 Gaussian distributions is
used in this acoustic model;

• 13dCep – A Cepstrum is the result of taking the Fourier transform (FT)

of the decibel spectrum as if it were a signal. Its name was derived by
reversing the first four letters of “spectrum”. In Sphinx4, 13 Cepstral,
13 ∇Cepstral (first order derivative) and 13 ∇2 Cepstral (second order
derivative) coefficients were used to model the speech spectra (Please
refer to [22] for how Sphinx4 apply time-derivative parameters). Thus,
there should be 39 Cepstrum parameters in total;

• 16k – the training speech files are sampled at the rate of 16kHz;

• 40mel – 40 Mel scale filters are used to suit for 16k different audio
1+f /700
frequencies. To convert f hertz into m mel, m = 1127.01048loge ;
and the inverse, f = 700(em/1127.01048 − 1);

• 130Hz 6800Hz – Possible minimum and maximum audio frequencies

for an ordinary human.

In our application, the IW acts according to the following 6 commands

by the corresponding words: “forward”, “backward”, “left”, “right”, “start”,
“stop”. Please refer to the online video at http://www.visionopen.com/
products/JP_IWCommands.html

14
Bibliography

[1] J. K. Baker. The dragon system–an overview. IEEE Transactions on

Acoustics, Speech, and Signal Processing, 23(1):24–29, February 1975.
IBM Thomas J. Watson Research Center, Yorktown Heights, N.Y.,
USA.

[2] J. K. Baker. Stochastic modeling as a means of automatic speech recog-

nition. PhD thesis, Carnegie Mellon University, 1975.

[3] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization tech-

nique occurring in the statistical analysis of probabilistic functions of
markov chains. Ann. Math. Statist, 41(1):164–171, 1970.

[4] Christopher M. Bishop. Pattern Recognition and Machine Learning.

Springer, 2006.

[5] O. Cappé. Ten years of hmms, March 12 2001.

[6] J. Clark and R. Roemer. Voice controlled wheelchair. Arch. Physical

Med. Rehab., 58:169–175, 1977.

[7] N. Deshmukh, A. Ganapathiraju, J. Hamaker, J. Picone, and M. Or-

dowski. A public domain speech-to-text system. In Proceedings of the 6th
European Conference on Speech Communication and Technology, pages
2127–2130, September 1999.

[8] D. Dimitriadis, P. Maragos, and A. Potamianos. Auditory teager energy

cepstrum coefcients for robust speech recognition. In Proceedings of
European Speech Processing Conference, Lisbon, Protugal, September
2005.

[9] J. Garofalo, D. Graff, D. Paul, and D. Pallett. Csr-i (wsj0) complete.

Linguistic Data Consortium, Philadelphia, 1993.

15
[10] Z. Ghahramani. Adaptive Processing of Sequences and Data Structures
. Lecture Notes in Artificial Intelligence., chapter Learning Dynamic
Bayesian Networks, pages 168 – 197. Springer-Verlag, Berlin, 1998.

[11] H. Hermansky. Perceptual linear predictive (plp) analysis of speech.

Journal of the Acoustical Society of America, 87(4):1738–1752, April
1990.

[12] H. Hermansky and N. Morgan. Rasta processing of speech. IEEE Trans-

action on Speech and Audio Process, 2(4):578–589, October 1994.

[13] X. Huang. Semi-continuous hidden markov models for speech signals.

Computer Speech and Language, 3(3), July 1989.

[14] X. Huang. Phoneme classification using semicontinuous hidden markov

models. IEEE Transactions on Signal Processing, 40(5), May 1992.

[15] X. Huang, H. Hon, and M. Hwang. A comparative study of dis-

crete, semicontinuous, and continuous hidden markov models. Computer
Speech and Language, 7(4), October 1993.

[16] Julius. Multipurpose large vocabulary continuous speech recognition

engine. Technical report, Kyoto University, December 2001. Translated
from the original Julius-3.2-book by Ian Lane ?Kyoto University.

[17] Syed Ali Khayam. The discrete cosine transform (dct): Theory and
application. Technical report, Department of Electrical & Computer
Engineering, Michigan State University, March 10th 2003.

[18] P. Lamere, P. Kwok, W. Walker, E. Gouvea, R. Singh, B. Raj, and

P. Wolf. Design of the cmu sphinx-4 decoder. Eurospeech, 2003 (Eu-
rospeech 2003) TR2003-110, Mitsubishi Electric Research Laboratories
(MERL), September 2003.

[19] K. F. Lee, H. W. Hon, and R. Reddy. An overview of the sphinx speech

recognition system. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 38:35–45, January 1990. Research supported by DARPA.

[20] B. Logan. Mel frequency cepstral coefficients for music modeling. In

Proceedings of International Symposium on Music Information Retrieval
( Music IR 2000), Plymouth, MA, USA, October 23-25 2000. University
of Massachusetts at Amherst.

16
[21] K. Murphy. Dynamic Bayesian Networks: Representation, Inference
and Learning. PhD thesis, UC Berkeley, Computer Science Division,
July 2002.

[22] Y. Obuchi and R. M. Stern. Normalization of time-derivative parameters

using histogram equalization, 2003.

[23] Joe Picone and the Staff at ISIP. Fundamentals of Speech Recognition:
A Tutorial Based on a Public Domain C++ Toolkit. Institute for Signal
and Information Processing (ISIP), Mississippi State University, August
15 2002.

[24] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj, M. Rav-

ishankar, R. Rosenfeld, K. Seymore, M. Siegler, R. Stern, and E. Thayer.
The 1996 hub-4 sphinx-3 system. In In Proceedings of the DARPA Speech
Recognition Workshop, Chantilly, VA, USA, February 1997.

[25] L. R. Rabiner. A tutorial on hidden markov models and selected appli-

cations in speech recognition. In Proceedings of IEEE, volume 77 of 2,
pages 257–286, Febrary 1989.

[26] S. Sigurdsson, K. B. Petersen, and T. Lehn-Schiler. Mel frequency cep-

stral coefcients: An evaluation of robustness of mp3 encoded music. In
7th International Conference on Music Information Retrieval (ISMIR
2006), Victoria, Canada, October 8 - 12 2006.

[27] C. Tillmann and F. Xia. A phrase-based unigram model for statisti-

cal machine translation. In Proceedings of the 2003 Conference of the
North American Chapter of the Association for Computational Linguis-
tics on Human Language Technology (HLT-NAACL 2003), volume 2,
pages 106–108, Edmonton, Canada, 2003. IBM T.J. Watson Research
Center, Yorktown Heights, NY 10598, USA.

[28] J.-P. Vert and M. Kanehisa. Graph-driven features extraction from mi-
croarray data using diffusion kernels and kernel cca, December 9-14 2002.

[29] J.-P. Vert and Y. Yamanishi. Supervised graph inference, December

13-18 2005.

[30] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf,

and J. Woelfel. Sphinx-4: A flexible open source framework for speech
recognition. Technical Report SMLI TR2004-0811, SUN MICROSYS-
TEMS INC., November 2004.

17
[31] Wikipedia. N-gram — wikipedia, the free encyclopedia, 2007. [Online;
accessed 28-December-2007].

[32] Wikipedia. Window function — wikipedia, the free encyclopedia, 2007.

[Online; accessed 29-December-2007].

[33] Wikipedia. Audio frequency — wikipedia, the free encyclopedia, 2008.

[Online; accessed 25-December-2008].

[34] Wikipedia. Discrete fourier transform — wikipedia, the free encyclope-

dia, 2008. [Online; accessed 13-December-2008].

[35] Wikipedia. International phonetic alphabet — wikipedia, the free ency-

clopedia, 2008. [Online; accessed 17-December-2008].

[36] Wikipedia. Search algorithm — wikipedia, the free encyclopedia, 2008.

[Online; accessed 21-December-2008].

[37] Wikipedia. Voice frequency — wikipedia, the free encyclopedia, 2008.

[Online; accessed 18-December-2008].

[38] S. Young. The htk hidden markov model toolkit: Design and philosophy.
Technical Report Technical Report TR.153, Department of Engineering,
Cambridge University, 1994.

[39] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu,

G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Wood-
land. The HTK Book. Cambridge University Engineering Department,
Cambridge University Engineering Department, Cambridgeshire, UK,
for htk version 3.4 edition, December 2006.

[40] Geoffrey G. Zweig. Speech Recognition with Dynamic Bayesian Net-

works. PhD thesis, University of California, Berkeley, 1998.

View publication stats

Voice Recognition
60% (5)
Voice Recognition
31 pages
Speaker Verification For Remote Authentication
100% (2)
Speaker Verification For Remote Authentication
31 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
100% (1)
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
65 pages
Continuous Density Hidden Markov Model For Hindi Speech Recognition
No ratings yet
Continuous Density Hidden Markov Model For Hindi Speech Recognition
7 pages
CCS369 - TSS-Unit 5
No ratings yet
CCS369 - TSS-Unit 5
23 pages
Comp Sci - Recognition Isolated - Shanthi Teressa1
No ratings yet
Comp Sci - Recognition Isolated - Shanthi Teressa1
6 pages
Speaker Dependent Continuous Kannada Speech Recognition Using HMM
No ratings yet
Speaker Dependent Continuous Kannada Speech Recognition Using HMM
4 pages
Comp Sci - Speech Recognition - Sandeep Kaur
No ratings yet
Comp Sci - Speech Recognition - Sandeep Kaur
6 pages
MFCC and Vector Quantization For Arabic Fricatives2012
No ratings yet
MFCC and Vector Quantization For Arabic Fricatives2012
6 pages
Marine Radar PDF
No ratings yet
Marine Radar PDF
60 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
No ratings yet
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
50 pages
Thesis Mns25
No ratings yet
Thesis Mns25
163 pages
3g Radio Parameter Rev 01
100% (1)
3g Radio Parameter Rev 01
82 pages
Proposal of An Intelligent Speech Recognition System: November 2012
No ratings yet
Proposal of An Intelligent Speech Recognition System: November 2012
7 pages
Feature Extraction Methods LPC, PLP and MFCC
100% (1)
Feature Extraction Methods LPC, PLP and MFCC
5 pages
Thesis mns25 PDF
No ratings yet
Thesis mns25 PDF
163 pages
Dynamic Spectrum Derived MFCC and HFCC Parameters and Human Robot Speech Interaction
No ratings yet
Dynamic Spectrum Derived MFCC and HFCC Parameters and Human Robot Speech Interaction
5 pages
Speech Recognition, Synthesis, and Dialogue 2
No ratings yet
Speech Recognition, Synthesis, and Dialogue 2
59 pages
MFCC Features: Appendix A
No ratings yet
MFCC Features: Appendix A
19 pages
SOP XGPON ALU Provisioning v1 13112014
No ratings yet
SOP XGPON ALU Provisioning v1 13112014
29 pages
Speech To Text Conversion STT System Using Hidden Markov Model HMM
No ratings yet
Speech To Text Conversion STT System Using Hidden Markov Model HMM
4 pages
Voice Recognition
No ratings yet
Voice Recognition
6 pages
207.OTN 260SCX2 and 130SCX10 Cards V1 04 11feb16
100% (1)
207.OTN 260SCX2 and 130SCX10 Cards V1 04 11feb16
24 pages
08 FDD Lte Radio Icic 36
No ratings yet
08 FDD Lte Radio Icic 36
36 pages
Lecture 7 - Automatic Speech Recognition
No ratings yet
Lecture 7 - Automatic Speech Recognition
58 pages
MFCC Feature Extraction
No ratings yet
MFCC Feature Extraction
9 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
LTE Performance Counter Reference Summary
No ratings yet
LTE Performance Counter Reference Summary
410 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
9 pages
Speech Recognition UTHM
No ratings yet
Speech Recognition UTHM
30 pages
Implementation of Speech Recognition Using Artificial Neural Networks
No ratings yet
Implementation of Speech Recognition Using Artificial Neural Networks
12 pages
Hexylon Datasheet (En)
No ratings yet
Hexylon Datasheet (En)
2 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
Term Paper ECE-300 Topic: - Speech Recognition
No ratings yet
Term Paper ECE-300 Topic: - Speech Recognition
14 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
FEC For Ethernet - Whitepaper
No ratings yet
FEC For Ethernet - Whitepaper
4 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
45 pages
Unit 5 (Automatic Speech Recognition)
No ratings yet
Unit 5 (Automatic Speech Recognition)
13 pages
Speaker Recognition System
No ratings yet
Speaker Recognition System
7 pages
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
No ratings yet
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
5 pages
Speaker Recognition Using Mel Frequency Cepstral Coefficients (MFCC) and Vector
No ratings yet
Speaker Recognition Using Mel Frequency Cepstral Coefficients (MFCC) and Vector
4 pages
Punjabi Speech Recognition: A Survey: by Muskan and Dr. Naveen Aggarwal
No ratings yet
Punjabi Speech Recognition: A Survey: by Muskan and Dr. Naveen Aggarwal
7 pages
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
No ratings yet
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
6 pages
EE115 Lab Manual BOURNS COllEGE
No ratings yet
EE115 Lab Manual BOURNS COllEGE
88 pages
Speech Recognition Using MFCC and DTW: January 2014
No ratings yet
Speech Recognition Using MFCC and DTW: January 2014
5 pages
Recognizing Voice For Numerics Using MFCC and DTW
No ratings yet
Recognizing Voice For Numerics Using MFCC and DTW
4 pages
Voice Recognition System Speech To Text
No ratings yet
Voice Recognition System Speech To Text
5 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
05d MWD & LWD Systems
No ratings yet
05d MWD & LWD Systems
62 pages
Intechopen 80419
No ratings yet
Intechopen 80419
18 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
2 - CNN Based Speaker Recognition in Language and Text Independent Small Scale System
No ratings yet
2 - CNN Based Speaker Recognition in Language and Text Independent Small Scale System
4 pages
ADRV9009
No ratings yet
ADRV9009
128 pages
M FCC Review
No ratings yet
M FCC Review
10 pages
Data and Computer Communications: Chapter 4 - Transmission Media
No ratings yet
Data and Computer Communications: Chapter 4 - Transmission Media
35 pages
FTD 041 Application Note FLARM Antenna Installation
No ratings yet
FTD 041 Application Note FLARM Antenna Installation
19 pages
AMR Codec Sets
No ratings yet
AMR Codec Sets
8 pages
TD-LTE Cell Timing Advance
No ratings yet
TD-LTE Cell Timing Advance
6 pages
Analysis of Handover Problems On GSM Networks
No ratings yet
Analysis of Handover Problems On GSM Networks
48 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
Speech Recognition: A Complete Perspective: Ashok Kumar, Vikas Mittal
No ratings yet
Speech Recognition: A Complete Perspective: Ashok Kumar, Vikas Mittal
6 pages
Speech Recognition Using Matlab: Objective
No ratings yet
Speech Recognition Using Matlab: Objective
2 pages
Made By: Vikas Rexwal A2326217011
No ratings yet
Made By: Vikas Rexwal A2326217011
9 pages
GSM Internship Report
No ratings yet
GSM Internship Report
23 pages
Numaratoare PDF
No ratings yet
Numaratoare PDF
17 pages
A Review On Feature Extraction and Noise Reduction Technique
No ratings yet
A Review On Feature Extraction and Noise Reduction Technique
5 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
MSelimYavuz DoktoraTez
No ratings yet
MSelimYavuz DoktoraTez
300 pages
MF-HF Instruction
No ratings yet
MF-HF Instruction
2 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
Speech Feature Extraction and Classification Techniques: Kamakshi and Sumanlata Gautam
No ratings yet
Speech Feature Extraction and Classification Techniques: Kamakshi and Sumanlata Gautam
3 pages
Quiz
No ratings yet
Quiz
14 pages
RFexposure Assessment AQQT Final 230116 - v1.0
No ratings yet
RFexposure Assessment AQQT Final 230116 - v1.0
26 pages
Frequency Modulation Simulation Using Matlab
No ratings yet
Frequency Modulation Simulation Using Matlab
19 pages
DSP Exp 7
No ratings yet
DSP Exp 7
6 pages
GSM System Information Messages
No ratings yet
GSM System Information Messages
3 pages
Ann LA2 Project
No ratings yet
Ann LA2 Project
23 pages
Ucalgary 2014 Kitteringham Sarah
No ratings yet
Ucalgary 2014 Kitteringham Sarah
161 pages
Quectel Antenna Design Note V3.2
No ratings yet
Quectel Antenna Design Note V3.2
30 pages
Unit-5 (Important Questions With Hints)
No ratings yet
Unit-5 (Important Questions With Hints)
9 pages
Robust Bluetooth AoA Estimation For Indoor Localization
No ratings yet
Robust Bluetooth AoA Estimation For Indoor Localization
15 pages
Rekap Project Fmi
No ratings yet
Rekap Project Fmi
2 pages
Vocabulary For TOEFL 2
No ratings yet
Vocabulary For TOEFL 2
3 pages
Illocutionary Kungfu Panda 4
No ratings yet
Illocutionary Kungfu Panda 4
12 pages
A Journey From Japan To The Future-Essay-Ica
No ratings yet
A Journey From Japan To The Future-Essay-Ica
2 pages
In Your Network Monitors: The World of Lte
No ratings yet
In Your Network Monitors: The World of Lte
1 page
Scott
No ratings yet
Scott
7 pages
EEL6586 Final Project:: A Speaker Identification and Verification System
No ratings yet
EEL6586 Final Project:: A Speaker Identification and Verification System
16 pages
Ica-Basic English Grammar Final Test
No ratings yet
Ica-Basic English Grammar Final Test
1 page
Teks Edi
No ratings yet
Teks Edi
8 pages
V17 Kinanti
No ratings yet
V17 Kinanti
2 pages
Digital Signal Processing "Speech Recognition": Paper Presentation On
No ratings yet
Digital Signal Processing "Speech Recognition": Paper Presentation On
12 pages
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Digital Signal Processing for Audio Applications: Volume 1 - Formulae
From Everand
Digital Signal Processing for Audio Applications: Volume 1 - Formulae
Anton R Kamenov
No ratings yet
Some Case Studies on Signal, Audio and Image Processing Using Matlab
From Everand
Some Case Studies on Signal, Audio and Image Processing Using Matlab
Dr. Hedaya Mahmood Alasooly
No ratings yet
Signal, Audio and Image Processing
From Everand
Signal, Audio and Image Processing
Dr. Hidaia Mahmood Alassouli
No ratings yet