Voice Recognition Using Artificial Neural Networks
Voice Recognition Using Artificial Neural Networks
Keywords: voice recognition, artificial neural networks, Gaussian mixture model, cepstral coefficients.
1. Introduction
Voice recognition of speakers by systems is the problem of converting the information content of the speech
waveform of speakers into identifiable sets of features that carry all the possible discriminative information
necessary for recognition of the speakers. The ability of a recognition system to adequately recognize the voice
of speakers essentially depends on the adequate capture of the time frequency and energy of the speech
waveform and how well the recognition model parameters are trained to produce the best sets of discriminations
so as to achieve accurate recognition. With the advent of technology, the idea of using the voice signals for the
purpose of identification has found many useful applications in platforms such as access control of information,
access to banking services, secured database access system, remote access to telephone services, avionics, and
automobile systems, etc., [1-3]. Although many accomplishments have been demonstrated especially for
isolated words, recognition based on continuous speech signals still remains an area that has gained considerable
attention due to the fact that continuous speech signals depict natural flow of words [4-5]. Hence, recognition of
speakers based on continuous speech signals may be useful for applications that require detection of speakers in
natural conversation. Unlike isolated word recognition system where the words of the utterances are
characterized by pauses, continuous speech signals do not have such pauses and this compels the recognition
task to predict where each of the words in the utterances ends and the others begin so as to produce the correct
utterance. In this regard, possible errors which may arise as a result of the length of utterances may broaden the
variance of the class distribution of the speaker which may lead to increased classification error and
subsequently, affect the recognition accuracy of continuous speech signals.
Many algorithms and approaches have been used over the past years for the recognition of speech patterns
[7-21] and the most commonly used algorithm is the hidden Markov model (HMM) which has been shown to
have high recognition performance. Ramesh et al [9] for example, demonstrated recognition rates of 92-94.5%
using HMM for isolated words (numbers) and in the work of Katagiri et al [10], recognition rates of 67-83%
were reported for isolated-word recognition. Despite its widespread use in speech recognition technology, the
standard HMM algorithm has been shown to exhibit poor discriminative learning due to the training algorithm
and to make up for this deficiency, various hybrid solutions have been proposed to increase the discriminative
classification power [12]. The viability of machine learning approaches such as artificial neural networks (ANN)
has also been explored as a useful technology to assist in statistical speech recognition [13-17] due to its
discriminative and adaptive learning properties. The capability of ANN has been demonstrated in many aspects
such as isolated-word recognition, phoneme classifier, and as probability estimator for speech recognizers [13,
14]. Much as ANN offers high discriminative training especially for short-time speech signals as in isolated-
words, it also has issues with adequately modeling of temporal variations of long-time speech signals. Hybrid
solutions that draw on the strengths of the ANN and HMM frameworks have also been demonstrated in speech
recognition technology. In the work of Trentin et al. [20], a hybrid ANN/HMM architecture was used to achieve
speech recognition rate of 54.9-89.27% with corresponding SNR of 5-20dB for isolated utterances. Reynolds et
al., have also employed the statistical Gaussian mixture models (GMM) framework, which is a variant of HMM,
to achieve speech recognition rate of 80.8-96.8% [21, 22] in a speaker-independent system using isolated
utterances. Though there are considerable research activities in continuous speech recognition, most of the
activities are concentrated on either correct detection or recognition of words or their positions in the utterances
for example to detect when a speaker has used a word that is not in the vocabulary of a continuous speech. The
goal of this work however, is to identify speakers using their continuous speech voice waveform distribution of
utterances. This may be particularly useful for application systems that require detection of speakers in natural
conversation environments such as forensic and security activities.
In this work we combine the frameworks of ANN and the GMM to implement voice recognition of speakers.
The combined paradigm explores the discriminative and adaptive learning capabilities of the ANN and the
capabilities of the GMM to model underlying properties and offer high classification accuracy while still being
robust to corruptions in the speech signals for the recognition task. The performance of the recognition system is
evaluated using variable lengths of speech utterances that are known and unknown to the system.
(a) (b)
Figure 2 Speech signal of the utterance “all graduating engineering students must be seated in the great hall of the university by 11:30am for
the chairman’s address” from (a) male speaker (b) female speaker.
To minimize signal leakages from one frame to the other, each framed voice signal x[m,n] is passed through a
Hamming window function as described in equation (5) to taper the edges of the signal in the frame smoothly to
zero. Figure 3 shows a plot of the clean speech signal component of the voice waveform of the male speaker
shown in Figure 2 (a) with all the background noise signals and the silence components removed.
p
y[m, p ] x[m, p ] * 0.54 0.46 cos 2 , 0 p N 1, 0 m N S 1. (5)
N 1
Figure 3 Processed utterance of a male speaker with suppressed noise and silence components.
Following suppression of all the irrelevant information in the voice signals, the clean speech is then converted
into streams of feature vectors coefficients containing only the information about the spoken utterance that is
required for its recognition. The extraction of the feature vectors of the speech signal may be carried out using
either time domain analysis or spectral domain analysis. Whilst the time domain is more straightforward and
takes less calculation time, the spectral domain approach is very effective but requires more computational time
since it involves frequency transformation. We however used the well-established spectral analysis technique of
mel frequency cepstral coefficient (MFCC) algorithm [24, 25] for the feature vectors extraction due to its
robustness. Figure 4 shows the MFCC flow process for the feature vectors coefficients extraction of the speech
signal. The windowed speech signals are first converted to the frequency domain using the discrete cosine
transform (DCT) to allow for the energy compaction of the spectral data in the lower order coefficients. The
magnitudes of the spectral distribution are then passed through a mel filter bank to obtain the Mel-domain
spectral distribution. The power spectral distribution of each frame at the output of the filter are then
smoothened using a logarithm function followed by computation of inverse DCT operation to convert the mel-
domain spectral distribution into the time domain mel coefficients C(i,j) as:
1 m 1 j
C (i, j ) logE (i, k )cos (k 0.5) , 0 j m 1, 0 i N S 1, (6)
m k 0 m
where m is the number of filters in the mel filter bank, and E(.) is the power spectrum. The mel filter is
computed using the following parameters: sampling rate = 16 kHz, minimum frequency = 10 Hz, and maximum
frequency = 8 kHz, and number of filters m = 30. The melcepst of the Voicebox Matlab toolbox was used for the
feature extraction computation [26]. A feature vector of 12 dimensional MFCC was extracted from each frame
of the speech utterance. Since the speech feature space was quite large for the computations this was reduced to
a small set of representative feature vectors (codebook) sufficient to adequately describe the extracted speech
features of the utterance. The vector quantization (VQ) algorithm [27] was used for the feature space reduction.
Figure 4 Block diagram of the MFCC operation for the voice feature data
neurons in a three layer system and varied the number of neurons in each layer through training of the network
till the optimal number that gave the best training was achieved. A variety of utterances from five speakers were
used for testing the ANN architecture. The extracted feature vectors of each of the utterances were presented at
the input layer and the most probable speech and speaker identified at the output layer. Since the back
propagation method usually requires provision of a target value that is used during training, which for speech
recognition, is usually not available, we therefore set the target output to “1” for the correct speech signals
(utterance) and “0” for others. This somehow reinforced the correct output and weakened the wrong output. The
network was trained by adjusting the values of the connection weights between the network elements in
accordance with equation (7) [14, 18]:
wij (k 1) wij (k ) E / wij wij (k ) wij (k 1) , (7)
where wij(.) is the weight matrix between the layers, E is the mean-square-error function for the training, β is the
momentum factor, and α is the learning rate which represents the speed of learning process. An important
consideration in the ANN design is the choice of learning rate, momentum factor, and number of epochs for the
network training. These parameters were varied and their effects were observed by repeating testing utterances
for each variation. For example, when a smaller learning rate was used, the algorithm took significant time to
converge due to the gradual learning process. Conversely, when a larger learning rate was used, the algorithm
diverged due to the acuteness in the learning process. The learning rate was therefore tuned for stable learning
by updating the learning rate after each training phase in accordance with the expression below and an update
value φ (from 1% – 20%):
(k ) (k 1) * 1 / k . (8)
Following a series of experiments on the variation of neurons in the layers and the different sets of utterances,
ANN architecture with 20 neurons at the input layer and a hidden layer with 30 neurons was found as suitable
for the task. Table 1 presents the summary of results of variation of hidden layer neurons and the average
recognition rates and times for different utterances (1-word to 5-word utterances). This ANN architecture was
adopted for the voice recognition system.
Table 1 Results of data sets for the optimal ANN architecture design.
Test utterances Hidden layer neurons Recognition rate (%) Recognition time (s)
transform; 10 40.91 0.60
aaron nichie; 15 45.45 1.00
signal processing; 20 59.00 2.00
voice recognition system; 25 72.73 2.00
move volume control up 30 77.27 4.00
and down; 35 77.27 6.00
where M is the number of Gaussian components, x is the N-dimensional MFCC speech feature vector (x1, x2, ,,,,,
xN) βk (k = 1, ....., M) is the weights of the mixture component k, and f (.) is the multivariate Gaussian with mean
vector µk and covariance matrix Ck, where each Gaussian is given by:
exp[0.5( x n k ) T C k1 ( x n k )]
f ( xn , k , C k ) , 1 n N . (10)
(2 ) N / 2 C k
1/ 2
To estimate the GMM parameters for a given set of the N-dimensional MFCC feature vectors an utterance, we
first organized the data into a number of cluster centroids (example 256) using K-means clustering technique
and the cluster centroids are further grouped into sets of 32 which are then passed to each component of the
GMM. The EM algorithm is then employed to obtain the ML distribution parameter estimates. The iterative
procedure first computes estimation of the current iteration values of the k-th Gaussian component for the next
iteration using equation (11) followed by a maximization operation where the predicted values are then
maximized to obtain the real values for the next iteration based on equation (12).
ki f ( x, ki , C ki )
y (k , t ) M
. (11)
j 1
i
j f ( x, , C )
i
j
i
j
y(k , t )x ki , 1j
T T
y (k , t ) x
2
T j
1
ki 1 t 1
T
; ki 1 y (k , t ) ; C i 1 (k , j ) t 1
T
, (12)
y (k , t ) y (k , t )
T t 1
t 1 t 1
To establish the number of Gaussian distributions useful for the GMM speaker model for the recognition task,
we performed experiments using varying number of Gaussian distributions and MFCC feature vectors from the
utterance “increase volume upwards” for two speakers. Table 2 shows the results obtained of recognition rates
with varying number of Gaussians. Based on the experimental results, 20 Gaussians was found adequate for the
GMM speaker models.
Table 2 Number of Gaussians and recognition rates
Following the computations, the GMM model that results in the highest LLGM value is determined as the
recognized speaker because that model gives the best probability of producing the same speech features of the
spoken utterance of the speaker. Once the speaker is identified, the features of the selected speaker is extracted
from the database and submitted to the decision for verification.
GMM , k
k 1
, (14)
X E[ X GMM ] X E[ X ANN ]
N N
2 2
GMM , k ANN , k
k 1 k 1
where E[.] denotes the mean value. Once there is a strong degree of correlation value between the outputs of the
two frameworks within a good significance level of less than 5%, the decision system considers the speaker as
recognized and the name of the recognized speaker is extracted from the enrollment database and displayed on
the graphic user interface as in Figure 5 below.
Table 3 Test results for voice recognition of speakers using 5-word utterances
st
Speakers 1 Attempt 2nd Attempt 3rd Attempt 4th Attempt
Aaron Recognition Recognition Recognition Recognition
Yonny False Rejection Recognition Recognition Recognition
Adjoa False Rejection False Rejection Recognition False Acceptance
Barbara Recognition Recognition False Acceptance Recognition
Louis Mark Recognition Recognition Recognition Recognition
Naa Kai Recognition Recognition Recognition Recognition
Rockson Recognition Recognition Recognition Recognition
Sarpong False Acceptance False Acceptance False Rejection False Acceptance
Fiawoo Recognition Recognition Recognition Recognition
Nelson Recognition Recognition Recognition False Acceptance
Table 4 Correlation coefficient evaluation based on 5-word utterances
st
Speakers 1 Attempt 2nd Attempt 3rd Attempt 4th Attempt
Aaron 99.24% 98.22% 98.66% 98.77%
Yonny 32.56% 99.49% 89.19% 88.33%
Adjoa 24.23% 33.75% 89.20% 79.75%
A summary of the average speaker recognition rates for the 30 speakers used in the testing of the system for the
5-word and 20-word utterances are shown and Figure 6 below. Figure 6(a) shows that with the 30 testing speech
samples the system is able to adequately recognize the voice of the speakers at a success rate of 77% with false
acceptance and false rejection rates of 9% and 14% respectively, when 5-word utterances similar to the trained
data sets were used. It is possible for the recognition accuracy to be improved further if more training data is
used. Since the characteristics of the microphone plays important role in the quality of the recognition accuracy,
it is likely that the measured performance value could be improved. The recognition rate of 77% however,
reduced to 43% when the length of the utterances was increased to 20-word utterance as depicted in Figure 6(b).
The reduction for may be attributed to the increased complexities of the large-sized utterances which somehow
affected the learning of relationships. Also the issue of level of mismatch associated with the training and testing
of the large-sized speech patterns tend to be significant due to increased level of variabilities which affects the
density distribution and subsequently the recognition rates. The results in Figure 6(c) on the other hand show
recognition performance for the case of 20-word utterances that are unknown to the training system. The
speaker recognition rates rapidly declined to an average of 18% with high false rejection and false acceptance
rates of 55% and 27%, respectively. The extremely low recognition rates point to the fact that the power spectral
density distributions of the testing speech data are not fully consistent with that of the training speech utterances.
Although the recognition rate of 18% may be too low for application in speaker-independent recognition
systems, it somehow also shows that the system may be adequate to identify imposters in a speaker-dependent
recognition system.
4. Conclusion
In this work we have discussed the results of speaker recognition system based on the use of continuous speech
utterances and the combined frameworks of ANN and the GMM. We have demonstrated through testing of
speech utterances from 30 different speakers with each speaker providing four different utterances and results
show the ability of the system to recognize speakers with success rate of 77% for the case of 5-word utterances
and 43% for the case of 20-word utterances for situations where the utterances are known to the training system.
In the case of speech utterances that are unknown to the system, a recognition rate of only 18% was possible for
20-word utterances. This low rate makes it possible for the system to detect imposters when used as a speaker-
dependent system though much lower rates may be required for efficiency. The ability to adequately recognize
speakers using their continuous speech waveform may find useful applications in systems that require detection
of speakers in natural conversation environments. In further studies we are expanding the database of speakers
to make more tests with varying lengths of utterances and extend the application to the recognition of speakers
using local Ghanaian languages. We are also investigating unsupervised self organizing ANN architecture to
improve on the autonomous classification and learning accuracy.
Acknowledgments
The authors wish to acknowledge the support and cooperation received from over 40 students of the University
of Ghana and the Faculty of Engineering Sciences for making available the voice recordings for the training and
testing of the voice recognition system.
References
[1] Rabiner, L. R., “Applications of speech recognition in the area of telecommunication”, IEEE Proc., 1997, pp. 501-510.
[2] Picone, J. W., “Signal modeling techniques in speech recognition,” IEEE Proc., Vol. 81, No. 9, 1993, pp. 1215-1247.
[3] Campbell, J. P. Jr., “Speech recognition: A tutorial”, IEEE, Vol. 85, No. 9, 1997, pp. 1437-1462.
[4] Morgan, N., and Bourland, H., “Continuous speech recognition using multilayer perceptrons with hidden Markov models”,
International Conference on Acoustics, Speech and Signal Processing, Albuquerque, 1990, pp. 413-416.
[5] Bourlard, H., and Morgan, N., “Continuous speech recognition by connectionist statistical methods”, IEEE Trans on Neural Networks,
Vol. 4, No. 6, 1993, pp. 893-909.
[6] Nichie, A., “Voice recognition system using artificial neural network”, Bachelor Thesis, Computer Engineering Department,
University of Ghana, Legon, June 2012.
[7] Huang, X. D., Ariki, Y., and Jack, M., “Hidden Markov models for speech recognition”, Edinburgh University Press, Edinburgh, 1990.
[8] Rabiner, L. R., “A tutorial on hidden Markov models and selected applications in speech recognition”, IEEE Proc., Vol. 77, 1989, pp.
257-286.
[9] Ramesh, P., and Wilpon, J. G., “Modeling state durations in hidden Markov models for automatic speech recognition”, IEEE , Vol. 9,
1992, pp. 381-384.
[10] Katagiri, S., and Chin-Hui, L., “A new hybrid algorithm for speech recognition based on HMM segmentation and learning vector
quantization”, IEEE Trans on Speech and Audio Processing, Vol. 1, No. 4, 1993, pp. 421-430.
[11] Frikha, M., Ben Hamida, A., and Lahyani, M., “Hidden Markov models (HMM) isolated-word recognizer with optimization of
acoustical and modeling techniques”, Int. Journal of Physical Sciences, Vol. 6, No. 22, 2011, pp. 5064-5074.
[12] Johansen, F. T., “A comparison of hybrid HMM architectures using global discriminative training”, Proceedings of ICSLP,
Philadelphia, 1996, pp. 498-501.
[13] Bengio, Y. “Neural network for speech and sequence recognition”, Computer Press, London, 1996.
[14] Lippman, R. P., “Review of neural networks for speech recognition”, Neural Computing, Vol. 1, 1989, pp. 1-38.
[15] Renals, S., and Bourlard, H., “Enhanced phone posterior for improving speech recognition”, IEEE Trans on Speech, Audio, Language
Processing, Vol.18, No. 6, 2010, pp. 1094-1106.
[16] Yegnanarayana, B., and Kishore, S., “ANN: an alternative to GMM for pattern recognition”, Neural Networks, 2002, pp. 459-469.
[17] Biing-Hwang, J., Wu, C., and Chin-Hui, L., “Minimum classification methods for speech recognition”, IEEE Transactions on Speech
and Audio Processing, Vol.5, No. 3, 1997, pp. 257-265.
[18] Haykin, S. O., “Neural networks and learning machines”, 3rd Ed., Prentice Hall, 2008.
[19] Riis, S. K., and Krogh, A., “Hidden neural networks: A framework for HMM/NN hybrids”, International Conference on Acoustics,
Speech, and Signal Processing, Munich, 1997, pp. 3233-3236.
[20] Trentin, E., and Gori, M., “Robust combination of neural network and hidden Markov models for speech recognition”, IEEE
Transactions on Neural Network, Vol. 14, No. 6, 2003, pp. 1519-1531.
[21] Reynolds, D. A., Quatieri, T. F., and Dunn, R. B., “Speaker verification using adapted Gaussian mixture speaker models”, Digital
Signal Processing, Vol 10, No. 103, 2000, pp. 19-41.
[22] Reynolds, D. A., and Rose, R. C., “Robust text-independent speaker identification using Gaussian mixture speaker models”, IEEE
Transactions on Speech and Audio Processing, Vol 3, No. 1, 1995, pp. 72-83.
[23] Brown, J. C., and Smaragdis, P., “Hidden Markov and Gaussian mixture models for automatic call classification”, Journal of Acoustic
Society of America, Vol 125, No 6., 2009, pp. 221-224.
[24] Furui, S., “Speaker dependent feature extraction, recognition and processing techniques”, Speech Communication, Vol 10, No. 5-6,
1991, pp. 505-520.
[25] Furui, S., “Cepstral analysis technique for automatic speaker verification”, IEEE Trans on Acoustics, Speech and Signal Processing,
Vol 29, No. 2, 1981, pp. 254-272.
[26] Brookes, M., “Voicebox: speech processing toolbox for Matlab” Imperial College,
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html, (March 3, 2012)
[27] Linde, Y., Buzo, A., and Gray, R., “An algorithm for vector quantizer design”, IEEE Transactions on Communications, Vol. 28, 1980,
pp. 84-95.
[28] Xuan, G., Zhang, W., Chai, P., “EM Algorithms of Gaussian mixture model and Hidden Markov model”, IEEE, 2001, pp. 145-148.
[29] Dempster, A., Laird, N., Rubin, D., “Maximum Likelihood from incomplete data via the EM algorithm”, Journal of Royal Statistical
Society, Vol 39, No. 1, 1977, pp. 1-38.