Makalah Speech Recognition
Makalah Speech Recognition
Abstract. A pre-processing of linear predictive coefficient (LPC) features for preparation of reliable
reference templates for the set of words to be recognized using the artificial neural network is presented
in this paper. The paper also proposes the use of pitch feature derived from the recorded speech data as
another input feature. The Dynamic Time Warping algorithm (DTW) is the back-bone of the newly
developed algorithm called DTW fixing frame algorithm (DTW-FF) which is designed to perform
template matching for the input preprocessing. The purpose of the new algorithm is to align the input
frames in the test set to the template frames in the reference set. This frame normalization is required
since NN is designed to compare data of the same length, however same speech varies in their length
most of the time. By doing frame fixing, the input frames and the reference frames are adjusted to the
same number of frames according to the reference frames. Another task of the study is to extract pitch
features using the Harmonic Filter algorithm. After pitch extraction and linear predictive coefficient
(LPC) features fixed to a desired number of frames, speech recognition using neural network can be
performed and results showed a very promising solution. Result showed that as high as 98% recognition
can be achieved using combination of two features mentioned above. At the end of the paper, a
convergence comparison between conjugate gradient descent (CGD), Quasi-Newton, and steepest
gradient descent (SGD) search direction is performed and results show that the CGD outperformed the
Newton and SGD.
Keywords: Dynamic time warping, time normalization, neural network, speech recognition, conjugate
gradient descent
Abstrak. Kertas kerja ini membentangkan pemprosesan semula ciri pertuturan pemalar Pengekodan
Ramalan Linear (LPC) bagi menyediakan template rujukan yang boleh diharapkan untuk set perkataan
yang hendak dicam menggunakan rangkaian neural buatan. Kertas kerja ini juga mencadangkan
penggunaan cirian kenyaringan yang ditakrifkan dari data pertuturan sebagai satu lagi ciri input.
Algoritma Warping Masa Dinamik (DTW) menjadi asas kepada algoritma baru yang dibangunkan,
ia dipanggil sebagai DTW padanan bingkai (DTW-FF). Algoritma ini direka bentuk untuk melakukan
padanan bingkai bagi pemprosesan semula input LPC. Ia bertujuan untuk menyamakan bilangan
bingkai input dalam set ujian dengan set rujukan. Pernormalan bingkaian ini adalah diperlukan oleh
rangkaian neural yang direka untuk membanding data yang harus mempunyai kepanjangan yang
sama, sedangkan perkataan yang sama dituturkan dengan kepanjangan yang berbeza-beza. Dengan
melakukan padanan bingkai, bingkai input dan rujukan boleh diubahsuai supaya bilangan bingkaian
1&2
Center for Biomedical Engineering, Faculty of Electrical Engineering, Universiti Teknologi Malaysia,
81310 UTM Skudai, Johor, Malaysia
3
Department of Mathematics, Faculty of Science, Universiti Teknologi Malaysia, 81310 UTM Skudai,
Johor, Malaysia
* Corresponding author: Email: [email protected]
sama seperti bingkaian rujukan. Satu lagi misi kertas kerja ini ialah mentakrif dan menggunakan cirian
kenyaringan menggunakan algoritma penapis harmonik. Selepas kenyaringan ditakrif dan pemalar
LPC dinormalkan kepada bilangan bingkaian dikehendaki, pengecaman pertuturan menggunakan
rangkaian neural dilakukan. Keputusan yang baik diperoleh sehingga mencapai ketepatan setinggi
98% menggunakan kombinasi cirian DTW-FF dan cirian kenyaringan. Di akhir kertas kerja ini,
perbandingan kadar convergence antara Conjugate gradient descent (CGD), Quasi-Newton, dan Steepest
Gradient Descent (SGD) dilakukan untuk mendapatkan arah carian titik global yang optimal. Keputusan
menunjukkan CGD memberikan nilai titik global yang paling optimal dibandingkan dengan Quasi-
Newton dan SGD.
Kata kunci: Warping masa dinamik, pernormalan masa, rangkaian neural, pengecaman pertuturan,
conjugate gradient descent
1.0 INTRODUCTION
Since its birth more than 30 years ago, Dynamic Time Warping (DTW) has been one
of the prime speech recognition methods. Its mechanism is known as a matching
technique of the unknown speech input template to a pre-define reference template,
and this method is considered as the simplest speech recognition method compared
to others like Hidden Markov Model (HMM) or neural network (NN). DTW is more
popular among pattern recognition class methods due to its ability to search for the
shortest path between two time-series signals like speech [1]. Through its matching
technique, a test speech signal can be expanded or compressed according to a reference
template [2, 3].
Now we live in the era equipped with high technology and fast computing devices.
Research in speech recognition using back-propagation neural network (BPNN) also
focus on developing a more precise recognition device with less network complexity
but with fast processing time. This is to accommodate the current living standards
and needs. Past research in NN found that when a higher number of hidden neurons
is used, higher recognition rate is achieved, but longer processing time is required for
the error to converge. To use NN as a recognition tool, one is required to fix the
number of frames to the same length for training and testing data. Thus, a method to
overcome these problems especially with less amount of input feature representation
has to be developed. In that respect, time normalization is required to align the frames
to a fix length with respect to the reference that we pick over the samples based on
their average value. Time normalization is a typical method to interpolate input signal
into a fixed size of input vector. Linear time alignment is the simplest method to
overcome time variation, but it is a poor method since it does not account important
feature vectors when deleting or duplicating them to shorten or lengthen the pattern
vectors, if required [2, 3], nevertheless, it has been the basic method for compression
and expansion of speech pattern vector [4, 5].
Many works have employed combinations of NN with multilayer perceptron (MLP)
architecture, HMM, and DTW one way or the other [6]. Meanwhile, [7] used DTW
and MLP with sequence of dynamic networks. They did not perform time alignment
using DTW, but instead they used DTW to find the global distance score and used
that score as the input into their MLP. Other works involved DTW and NN also
include [8, 9], however they also used the total distance of the warping path as input
into the MLP.
(i1, j) (i, j)
In this research, the frame fixing is done based on DTW method: the input vectors
are warped with a reference vector which has almost similar local distance (horizontal
warp), while expand the input vectors to reference vectors which shows a vertical warp
(they share the same feature vectors for a feature vector frame of an unknown input).
This frame alignment is also known as the expansion and compression method
according to the slope conditions described as follows. There are three slope conditions
that have to be dealt with in this research work, based on the DTW type 1 during the
frame compression (denoted as F ) and the expansion (denoted as F +).
( )
F = F min {d( i , j )( I , J )} (1)
The distance is calculated using Euclidean distance measure. For a set of LPC
coefficients with p feature vectors, which is from j=1, 2, ..., p of (x, y) coordinate, x
represents the test set axis while y represents the reference set axis. The distance is
calculated as
p
d ( x, y ) = ( xi y j )
2
(3)
j =1
The expansion and compression are done throughout the samples along the warping
path where the input frames are matched to the reference template frames using the
DTW-FF algorithm. After this procedure, the data are ready to be used for neural
network recognition. The normalized data/sample has been tested and compared to
the typical DTW algorithm and results showed a same global distance score. Further
findings are discussed in the results and discussion section.
SFS
Raw
Raw signal
signal Pitch
Pitch F o
(.wav file) extraction Fo track Foraw
(.wav file) extraction track
Pitch
Pitch
Foopt
optimization
optimization
Harmonic
Harmonic V(m)
decomposition
decomposition U(m)
PSHF block
LPC feature
LPC feature DTW-FF feature
DTW-FF feature Back-propagation
Back-propagation Recognition %%
Recognition
extraction
extraction extraction
extraction neural network
neural network
during the iterations. Mean square error method is used to compute the weights
adjustments. Method of steepest gradient descent is employed in the direction search
for fast convergence of the algorithm.
To summarize, Figure 3 shows the flow process of the experiment for back-end
recognition that has been carried out upon obtaining the LPC coefficients. The data
used for this preliminary study are 10-order LPC coefficients, using 10 ms frames
Hamming windows of 6 subjects uttering digits 0-9 in Malay repeated 5 times in 5
different sessions. An average of 47 frames is selected for the reference in the digits
recognition and this number is used against the source/unknown input during the
frame fixing process.
(iii) Use DTW-FF to fix the frames; source must have the same number of frames as
the template.
(iv) Retain DTW scores of the fixed frames.
(v) Load test samples.
(vi) Set the learning rate and momentum rate, i.e. must be the same as the training
setting.
(vii) Recall weights in (f) of training phase.
(viii) Compare and obtain recognition percentage.
After the frame fixing process which uses DTW-FF algorithm, local distances of the
fixed frames are collected and the data are ready to be used in the next stage, which is
the recognition stage. The normalized data/sample are tested and compared to the
LPC coefficients using typical DTW algorithm. The results did not show any changes
in the recognition rate which also mean that there is no loss of information even though
only fixed frames local distance scores (DTW-FF feature) that are being used as the
input. However, the use of DTW-FF feature can reduce the amount of input to be
presented into the NN.
In Figure 4, the local distances of the unknown input frame x(7), , x(9) are compared
according to the slope condition (i) due to its horizontal warping. Frame x(9) appears
to have the minimum local distance among the three frames, so those 3 frames are
compressed to one frame and occupies only y(7) (appears as frame 9 on y-axis).
However, frame x(19) of input is expanded to 6 frames, in accordance to the slope
condition (ii) due to a vertical warping between the utterances. The warping shows an
expansion of 6 consecutive frames of the reference template: 6 frames of reference
template at y(18), , y(23) have the same feature vectors as frame x(19) of the input
vectors, so x(19) occupies y(18), , y(23). These mean that frame x(19) of the input has
Input template, X = 24
Figure 4 The DTW frame fixing between an input and a reference template by
a speaker for digit 1
matched 6 feature vectors in a row of the reference template set. Since diagonal
movements is the fastest track towards achieving the global distance and it gives the
least local distance at all time compared to the horizontal or vertical movements, a
normal DTW procedure is applied to it.
percentage between the two types of coefficients. This might be due to the same
feature vectors pattern matching between frames template in both algorithms.
Therefore, fixing the frames does not affect the recognition rate. This also supports
that the recognition before and after DTW-FF is identical and no loss of information
during the DTW-FF algorithm has occurred.
The statistical test, called as T-Test has been conducted to the data in Table 1. The
results rejected the hypothesis that mbefore=mafter. Since mbefore<mafter, then it can be
concluded that the percent improvement of using DTW-FF coefficients by typical
DTW and BPNN is significant. On the other hand, these mean a lot of network
complexity and amount of connection weights computations during forward and back
pass have been reduced.
DTW-FF DTW-FF+pitch
100
Recognition rate (%)
80
60
40
20
0
1 2 3 4 5 6
Subject
Figure 5 Recognition rate for 5 hidden nodes before and after pitch feature is included
DTW-FF DTW-FF+pitch
100
Recognition rate (%)
80
60
40
20
0
1 2 3 4 5 6
Subject
Figure 6 Recognition rate for 10 hidden nodes before and after pitch feature is included
The pitch feature is added after step (iv) in both training and testing phase which is
described in Section 4.0. The results are tabulated in Table 2, graphically, the
improvement (before and after pitch addition) can be seen in bar chart form in Figures
5 and 6.
Figures 5 and 6 clearly show that pitch feature has improved the recognition
performance when added to DTW-FF feature even though pitch itself cannot give a
good representation for speech recognition. The networks have learned sufficiently
with 10 hidden nodes rather than 20 hidden nodes to get to a high percentage before
pitch feature is added. This means that the network converged faster.
The 5 hidden nodes comparison before and after pitch feature addition is only to
show an improvement made even though the network has not learn sufficiently yet.
However, the 10 hidden nodes improvement gives a good indication of the importance
of pitch when combined with other feature like the DTW-FF feature.
102
SGD
Quasi
Sum squared error (SSE)
CGM
101
100
0 1000 2000 3000 4000 5000
Epochs
102
Sum squared error (SSE)
101
100
0 1000 2000 3000 4000 5000
Epochs
Figure 8 The fluctuations during the search for optimal global minimum using
conjugate gradient algorithm
converges at the rate between the steepest gradient and Newton methods, but has the
most optimal global minimum among the three methods tested.
Meanwhile, the zoomed-in view of Figure 7 in the interval between epochs 200-600
is shown in the little box in Figure 8. The fluctuations, which is the increase and
decrease in sum squared error is due to the search for optimal global minimum that
applies the golden section search. In this gradient search, the time interval is subdivided
into smaller sections so that the optimal global minimum can be located. That is why
in the golden section search, sum of the errors is fluctuating between two points in the
selected interval. The sum of the error continues to increase and decrease until the
optimal global minimum is obtained, at this time the sum of squared error has the
minimum value when the difference of the points in the intervals reached to the set
tolerance.
6.0 CONCLUSION
The frame alignment based on DTW method for pre-processing LP coefficients has
been used to produce a new form of compressed data called DTW-FF coefficients.
These new inputs are used as input to the BPNN as described in this paper. Having
DTW-FF algorithm, frame matching is performed and the output, which is the local
distance scores are then fed into BPNN. From the experiments, it was proven that
DTW-FF algorithm can be used as a front-end processing of speech recognition for
BPNN, although DTW itself is a back-end recognition engine. This is an alternative
method found to resolve the problem of data feeding into neural network algorithm.
The DTW-FF coefficients were compared to the LPC coefficients using typical DTW
algorithm to identify whether or not any loss of information has occurred. From the
experiments, it has been proved that there were no changes in the recognition rate, so
we conclude that there is no loss of information during the frame fixing. The frame
alignment has adopted DTW to normalize the spoken word length. The normalized
templates then are being used as the input to BPNN for the recognition part and they
are proven been able to improve recognition performance on the samples tested to a
higher percentage.
In conclusion, the proposed DTW-FF algorithm is able to produce a better way of
representing input features into the NN while saving the computation cost and network
complexity with a high recognition rate than the typical DTW itself. A higher recognition
rate is achieved when pitch feature is added to the DTW-FF feature.
Network performance also has been tested using other descent method like the
Quasi-Newton and Conjugate Gradient which used the second order information of
the function that want to be optimized. These methods compared against the traditional
back-propagation neural network which used the steepest gradient descent during the
back-pass to update the connection weights. The observations found that the conjugate
gradient algorithm reached to the optimal global minimum point.
REFERENCES
[1] Sakoe, H. and S. Chiba. 1978. Dynamic Programming Algorithm Optimization for Spoken Word
Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. ASSP-26(1): 43-49.
[2] Silverman, H. F. and D. P. Morgan. 1990. The Application of Dynamic Programming to Connected Speech
Recognition. IEEE ASSP Magazine. 7-25.
[3] Abdulla, W. H., D. Chow, and G. Sin. 2003. Cross-Words Reference Template for DTW-based Speech
Recognition System. IEEE Technology Conference (TENCON). Bangalore, India. 1: 1-4.
[4] Rabiner, L. and B. H. Juang. 1993. Fundamentals of Speech Recognition. Englewood Cliffs, New Jersey:
Prentice Hall.
[5] Creany, M. J. 1996. Isolated Word Recognition using Reduced Connectivity Neural Networks with Non-
Linear Time Alignment Methods. Ph.D. Thesis. University of New Castle-Upon-Tyne.
[6] Kuhn, M. H., H. Tomaschewski, and H. Ney. 1981. Fast Nonlinear Time Alignment for Isolated Word
Recognition. Proceedings of ICASSP. 6: 736-740.
[7] Ahmadi, M., N. J. Bailey, and B. S. Hoyle. 1996. Phoneme Recognition using Speech Image (Spectrogram).
3rd International Conference on Signal Processing. 1: 675-677.
[8] Abdul Aziz, M. A. 2004. Speaker Recognition System Based on Cross Match Technique. Master Thesis.
Universiti Teknologi Malaysia.
[9] Wildermoth, B. R. 2000. Text-Independent Speaker Recognition Using Source Based Features. Master of
Philosophy Thesis Griffith University. Australia.
[10] Sudirman, R. and S. H. Salleh. 2005. NN Speech Recognition Utilizing Aligned DTW Local Distance
Scores. Proceeding of 9th International Conference on Mechatronics Technology. Kuala Lumpur.
[11] Sudirman, R., S. H. Salleh, and S. Salleh. 2006. Local DTW Coefficients and Pitch Feature for Back-
Propagation NN Digits Recognition, Proceeding of IASTED International Conference on Networks and
Comunications. Thailand. 201-206.
[12] Sudirman, R., S. H. Salleh, and T. C. Ming. 2005. Pre-Processing of Input Features using LPC and Warping
Process. Proceeding of 1st International Conference on Computers, Communications, and Signal Processing.
Kuala Lumpur. 300-303.
[13] Jackson, P. J. B. and C. H. Shadle. 2001. Pitch-Scaled Estimation of Simultaneous Voiced and Turbulence
Noise Components in Speech. IEEE Transactions on Speech and Audio Processing. 9(7): 713-726.
[14] Muta, H., T. Baer, K. Wagatsuma, T. Muraoka, and H. Fukada. 1988. A Pitch Synchronous Analysis of
Hoarseness in Running Speech. Journal of Acoustical Society of America. 84(4): 1292-1301.