Text Independent Amharic Language Dialect Recognition: A Hybrid Approach of VQ and GMM
Text Independent Amharic Language Dialect Recognition: A Hybrid Approach of VQ and GMM
Abstract
In Amharic language there are four main different types of dialects these are Gojjam
(Gojjamegna), Wollo (Wollogna), Shewa (Shewagna) and Gonder (Gonderegna). In this
paper a hybrid approach of VQ(vector quantization) and GMM(Gaussian Mixture
Models) have been used for classifying dialects of Amharic language. For our data set a
total of 100 speakers for each group of dialects are considered. Mel frequency cepstral
coefficients (MFCC) feature vectors are used to recognize the dialects of speakers. To see
the effect of the number of these feature vectors on the performance of the system,
MFCC, ∆MFCC and ∆∆MFCC vectors are used. When 25 speakers are considered from
areas, 85.9% accuracy achieved. After conducting this experiment, the number of
speakers are increased to100, which is the maximum number of dialect speakers for our
experiment, 92.7% accuracy achieved for the given dialects.
1. Introduction
Speech is the most common and natural means of communication among humans. A
language when used by people from different regions can be analyzed to see the usage of
words with different expressions and even if they speak some standard form of the word
the difference in spectral properties of sound produced can be observed [1]. A dialect is a
regional or social variety of a language distinguished by the way they speech pattern of a
region. Ethiopia has 83 different languages with up to 200 different dialects spoken [2].
The largest ethnic and linguistic groups are the Oromos, Amharas and Tigrayans. It is
important to know Amharic dialects because different Amharic Dialects are spoken by
Amharic speakers. Like other languages in the world, Amharic language also has many
varieties. These Amharic dialects are spoken over the entire Amharic speaking regions.
Amharic Language has different dialects and is most commonly spoken language in
Amharic speaking countries. The total number of Amharic Dialects is four these are
Gonder, Gojjam, Wolo and shewa [3].
4. Signal Preprocessing
In dialects recognition, the first phase is preprocessing which deals with a speech
signal which converts an analog signal at the recording time to digital. The properties of a
signal changes with time, so that the speech can be divided into a sequence of
uncorrelated segments or frames and process the sequence as if each frame has fixed
properties. First, the continuous dialect speech signal D(t) produced by the speaker and
sensed by the microphone has to be converted to the discrete domain. Secondly, the
speech signal is segmented into frames. This is done to obtain quasi stationary units of
speech. Finally, a pre-emphasis filter is applied to each frame generated in the previous
step. Once all this procedure has been performed, the speech frames are ready to enter the
feature extraction subsystem. Diagrammatically, it can be represented as follows.
Where D[n] is signal which is converted to digital format where as Di[n] is the signal
after pre-emphasis is applied on it. Si[n], which is the final output of pre-processing phase
is the signal which is segmented in to frames and overlapped. Next, each of the steps are
to be discussed.
4.2. Pre-Emphasis
Due to the structure of voice production system, damping occurs in high-frequency
regions. For that reason, the spectrums of voiced regions are compensated by pre-
emphasis which amplifies high-frequency regions and performs filtering [8]. Widely used
pre-emphasis ranges from 0.95 to 0.97 and filter is given as,
4.5. Windowing
The pre-emphasized signal is divided into short frame blocks, and a window is applied
to these frames. The frame length can vary, but based on empirical results, is often chosen
from20 to 30ms [13] with an overlap of 10ms. This length depends on the specific feature
extraction method that is applied. The window function that is applied is preferably not
rectangular, as this can lead to distortion due to vertical frame boundaries [8].
The output signal of windowing block xw[n] can be calculated as
Xw[n]=x[n].w[n] (2)
used classification methods in speech recognition [15-16].Speech signal does not contain
speech information only. At the same time, it contains information like age, gender, and
emotional state that are related to the speaker [17].
5. Dialects Recognition
For dialects recognition, a group of S speakers S= {1, 2 ...S} is represented by GMM„s
λ1, λ2…λS. The objective is to find the dialects model which has the maximum a
posteriori probability for a given observation.
=arg max1 K S Pr (6)
coefficients are also known as MFCC, ∆MFCC and ∆∆MFCC. Below, the results
obtained from the experiments are explained.
Here, we used the first 13 MFCC coefficients for both training and testing. As we can
see from the above table, the experiment was conducted for varying number of dialect
speakers, the minimum being 25 and the maximum 100. In case of VQ the number of
dialect speakers increase, its performance decreases. This is because, as the number of
speakers increases, the probability of having similar templates increases. When 13 MFCC
coefficients considered with 25 speakers 65.2% accuracy achieved. After experimenting
with 13 MFCC coefficients, we conducted another experiment to see the performance of
the system by increasing the number of coefficients to 26 and got some improvements
from the first experiment. Here, the percentage of correctly classified dialect speakers
tend to increase when we compare it with the first one. After trying the above mentioned
experiments, we tried to see what will happen to the result if 39 MFCC coefficients are
used. We got 69.9% success for 25 individuals in the given dialects. When the number of
speakers increased to 100, which is the maximum number of dialect speakers for our
experiment, we got 53.1% success using 39 MFCC coefficients For GMM, as the number
of speakers increases, the classifier's accuracy also increases. In addition to this, as the
number of speakers increases, this increment in similarity makes the system to pass a
correct decision on the recognition of dialects speakers. For 25 individuals considering the
first 13 MFCC coefficients using GMM 61.4% accuracy achieved and when 100
speakers with 39 MFCC coefficients are considered in this experiment 79.9% accuracy
achieved.
The last experiment was conducted to see what will happen in the hybrid approaches of
both VQ and GMM. In the hybrid approaches as the numbers of speakers increases the
identification accuracy also increases. In this experiment, when 25 speakers with 13
MFCC coefficients are considered 80.2% success are achieved. Similarly, when the
individuals increased to 100 with 39 MFCC coefficients, 92.7% accuracy achieved.
References
[1] S. Sinha, A. Jain and S. S. Agrawal, “Acoustic phonetic feature based dialect identification in hindi
speech”, International journal on smart sensing and intelligent systems, vol. 8, no. 1, (2015) March.
[2] http://www.ethiopiantreasures.co.uk/pages/language.htm.
[3] http://www.languagecomparison.com/en/amharic-dialects/model-58-6.
[4] B. Gamback and L. Asker, "Experiences with Developing Language Processing Tools and Corpora for
Amharic".
[5] http://joshuaproject.net/languages/amh.
[6] http://www.davidpbrown.co.uk/help/top-100-languages-by-population.html.
[7] I. Y. Kelbesa, "An Intelligent Text Independent Speaker Identification using VQ-GMM model based
Multiple Classifier System," Universit àdegli Studi di Brescia, (2014).
[8] S. Patra, "Robust Speaker Identification System," Super Computer Education and Research Centre,
Indian Institute of Science Bangalore 560 012, (2007).
[9] G. Saha, S. Chakroborty and S. Senapati, "A New Silence Removal and Endpoint Detection Algorithm
for Speech and Speaker Recognition Applications," Departmentof Electronicsand
ElectricalCommunication Engineering Indian Institute of Technology, Khragpur, Kharagpur-721 302,
India, (2014).
[10] R. Islam and F. Rahman, "Improvement of Text Dependent Speaker Identification System Using Neuro-
Genetic Hybrid Algorithm in Office Environmental Conditions," JCSI International Journal of Computer
Science Issues, vol. 1, (2009).
[11] S. M. Siniscalchi, F. Gennaro and S. Andolina, "Embedded Knowledge-based Speech Detectors for
Real-Time Recognition Tasks," Dipartimento di Ingegneria Informatica, Università di Palermo
V.ledelleScienze (Edif. 6), 90128 Palermo, Italy.
[12] I. Y. Kelbesa, "An Intelligent Text Independent Speaker Identification using VQ-GMM model based
Multiple Classifier System," Universit àdegli Studi di Brescia, (2014).
[13] L. P. Heck, "Automatic Speaker Recognition Recent Progress, Current Applications, and Future
Trends," MIT Lincoln Laboratory, (2000).
[14] E.Yücesoy and V. V. Nabiyev, “Gender Identification of a Speaker Using MFCC and GMM”.
[15] M. H. Sedaaghi, "A Comparative Study of Gender and Age Classification in Speech Signals", Iranian
Journal of Electrical & Electronic Engineering, vol. 5, no. 1, (2009) March, pp. 1- 12.
[16] R. Djemili, H. Bourouba and M. C. A. Korba. "A speech signal based gender identification system using
four classifiers." Multimedia Computing and Systems (ICMCS), 2012 International Conference on.
IEEE, (2012).
[17] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Englewood Cliffs (N.J.), Prentice
Hall Signal Processing Series, (1993).
[18] A. Rajsekha, "Real time speaker recognition using MFCC and VQ," Department of Electronics &
Communication Engineering National Institute of Technology Rourkela – 769008, (2008).
[19] S. Selvanidhyananthan, S. kumara, "Language and Text-Independent Speaker Identification System
Using GMM," WSEAS Transactions on Signal Processing, vol. 9, no. 4, (2013) October.
Authors
Abrham Debasu Mengistu, he is born in February 04, 1985 and
received his B.Sc. Degree in Computer Science from Bahir Dar
University and also MSc. in Computer Science from Bahir Dar
University, School of Computing and Electrical Engineering, BiT,
Ethiopia. He has published 06 research papers in international journal.
His main research interest is signal processing, image processing and
Robotics. He is a life member of professional societies like MSDIWC.