0% found this document useful (0 votes)
66 views5 pages

A Speaker Independent Continuous Speech Recognizer For Amharic

This document discusses building a speaker independent continuous speech recognition system for the Amharic language using a hybrid HMM/ANN approach. The recognizer was constructed at the context dependent phoneme level using the CSLU Toolkit. The system achieved 74.28% word recognition accuracy and 39.70% sentence recognition accuracy, which were the best results reported for Amharic speech recognition at that time.

Uploaded by

Belete Belay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views5 pages

A Speaker Independent Continuous Speech Recognizer For Amharic

This document discusses building a speaker independent continuous speech recognition system for the Amharic language using a hybrid HMM/ANN approach. The recognizer was constructed at the context dependent phoneme level using the CSLU Toolkit. The system achieved 74.28% word recognition accuracy and 39.70% sentence recognition accuracy, which were the best results reported for Amharic speech recognition at that time.

Uploaded by

Belete Belay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221484299

A speaker independent continuous speech recognizer for Amharic

Conference Paper · September 2005


DOI: 10.21437/Interspeech.2005-860 · Source: DBLP

CITATIONS READS

19 475

2 authors:

Hussien Seid Björn Gambäck

1 PUBLICATION   19 CITATIONS   
Norwegian University of Science and Technology
111 PUBLICATIONS   1,147 CITATIONS   
SEE PROFILE
SEE PROFILE

Some of the authors of this publication are also working on these related projects:

PRESEMT View project

All content following this page was uploaded by Björn Gambäck on 29 May 2014.

The user has requested enhancement of the downloaded file.


INTERSPEECH 2005

A Speaker Independent Continuous Speech Recognizer for Amharic

Hussien Seid Björn Gambäck

Computer Science & Information Technology Userware Laboratory


Arba Minch University Swedish Institute of Computer Science AB
PO Box 21, Arba Minch, Ethiopia Box 1263, SE-164 29 Kista, Sweden
[email protected] [email protected]

Abstract are excellent at treating temporal aspects by providing good ab-


stractions for sequences and a flexible topology for statistical
The paper discusses an Amharic speaker independent contin- phonology and syntax. However, HMMs have some drawbacks,
uous speech recognizer based on an HMM/ANN hybrid ap- especially for large vocabulary speaker independent continuous
proach. The model was constructed at a context dependent ASR. The main disadvantage is a relatively poor discrimina-
phone part sub-word level with the help of the CSLU Toolkit. A tion power. In addition HMMs enforce some practical require-
promising result of 74.28% word and 39.70% sentence recog- ments for distributional assumptions (e.g., uncorrelated features
nition rate was achieved. These are the best figures reported so within an acoustic vector) and typically make first order Markov
far for speech recognition for the Amharic language. model assumptions for phone or sub-phone states while ignor-
ing the correlation between acoustic vectors [2].
1. Introduction In effect, HMMs adopt a hierarchical scheme modeling a
The general objective of the present research was to examine sentence as a sequence of words, and each word as a sequence
and demonstrate the performance of a hybrid HMM/ANN sys- of sub-word units. An HMM can be defined as a stochastic fi-
tem for a speaker independent continuous Amharic speech re- nite state automaton, usually with a left-to-right topology when
cognizer. Amharic is the official language of communication used for speech. Each probability is approximated based on
for the federal government of Ethiopia and is today probably the maximum likelihood techniques. Still, these techniques have
second largest language in the country (after Oromo) and quite been observed for poor discrimination, since they maximize the
possibly one of the five largest on the African continent. It is likelihood of each individual node independently from the other.
estimated to be mother tongue of more than 17 million people, On the other hand neural network classifiers have shown good
with at least an additional 5 millions of second language speak- discrimination power, typically requires fewer assumptions, and
ers. Still, just as for many other African languages, Amharic can easily be integrated in non-adaptive architectures. This is
has received preciously little attention by the speech process- the point behind changing the pure HMM approach to the hy-
ing research community; even though the last years have seen brid HMM/ANN model, by using an ANN to augment the ASR
an increasing trend to investigate applying speech technology to system [3]. The HMM is used as the main structure of the
other languages than English, most of the work is still done on system to cope with the temporal alignment properties of the
very few and mainly European and East-Asian languages. Viterbi algorithm, while the ANN is used in a specific subsys-
The Ethiopian culture is ancient, and so are the written lan- tem of the recognizer to address static classification tasks. This
guages of the area, with Amharic using its very own script. This has shown performance improvement over pure HMM: Fritsch
has caused some problems in the digital age and even though & Finke [4] describe a tree-structural hierarchical HMM/ANN
there are several computer fonts for Amharic, and an encoding system which outperformed HMM on Switchboard.
of Amharic was incorporated into Unicode in 2000, the langu- In an HMM/ANN model a neural network of multi-layered
age still has no widely accepted computer representation. In perceptrons is given an input vector of acoustic observation
recent years there has been an increasing awareness of that Am- values, ot and computes a vector of output values which are
haric speech and language processing resources must be created approximate a-posteriori state probabilities. Commonly, nine
as well as digital information access and storage. frames are given for the input of the network: four consecu-
The present paper is a step in that direction. It is laid out tive frames before, four frames after, and one frame at time t,
as follows: Section 2 introduces the HMM/ANN hybrid ASR in order to provide the ANN with more contextual data. Then
paradigm. Section 3 discusses various aspects of Amharic and the network will have one output for each phone by restricting
some previous efforts to apply speech technology to the langu- the sum of all the output units to one. This helps to calculate the
age. Then Section 4 describes the actual experiments with con- a-posteriori probability, qj of a state j conditioned on the acous-
structing, evaluating, and testing an Amharic Automatic Speech tic input: p(qj |ot ). Generally an ASR system has a front end
Recognition System using the CSLU Toolkit [1]. in which the natural speech wave is digitized and parameterized
for the recognizer. The recognizer has a neural net to train on
these digitized and parameterized data. After training, the neu-
2. HMM/ANN hybrids ral net produces the estimation of probabilities of observations
Commonly, HMM-based speech recognizers have shown the for the HMM states. The HMM uses these probabilities and
best performance. On the positive side this dominant paradigm the language model to compute the probability of a sequence of
is based on a rich mathematical framework which allows for symbols given the observation sequence. Finally, the recognizer
powerful learning and decoding methods. In particular, HMMs uses decoders to generate the recognized symbols as output.

3349 September, 4-8, Lisbon, Portugal


INTERSPEECH 2005

3. Amharic Speech Processing ferent reference models in the database for the multiple forms
of the sound depending on the gemination. (Another problem
Ethiopia is with about 70 million inhabitants the third most pop-
is an ambiguity with the 6th order characters: whether they are
ulous African country and harbours some 80 different langu-
vowelled or not. However, this is not relevant to this work.)
ages. Three of these are dominant: Oromo, a Cushitic langu-
age is spoken in the South and Central parts of the country and
written using the Latin alphabet; Tigrinya, spoken in the North 3.2. Previous work
and in neighbouring Eritrea; and Amharic, spoken in most parts This study aims at investigating and testing out the possibility
of the country, but predominantly in the Eastern, Western, and of developing speaker independent continuous Amharic speech
Central regions. Amharic and Tigrinya are Semitic languages recognition systems using a hybrid of HMM and ANN systems.
and thus distantly related to Arabic and Hebrew. Speech and language technology for the languages of Ethiopia
is still very much unchartered territory; however, on the lan-
3.1. The Amharic language guage processing side some initial work has been carried out,
Following the Constitution of 1994, Ethiopia is a divided into mainly on Amharic word formation and information access.
nine fairly independent regions, each with its own nationality See [6] or [7] for short overviews of the efforts that have been
language. However, Amharic is the language for country-wide made so far to develop language processing tools for Amharic.
communication and was also for a long period the principal lan- Research conducted on speech technology for Ethiopian
guage for literature and the medium of instruction in primary languages has been even more limited. Laine [8] made a valu-
and secondary schools of the country (while higher education able effort to develop an Amharic text-to-speech synthesis sys-
is carried out in English). Amharic speakers are mainly Ortho- tem, and Tesfay [9] did similar work for Tigrinya.1 Solomon
dox Christians, with Amharic and Tigrinya drawing common [10] built speaker dependent and speaker independent HMM-
roots to the ecclesiastic Ge’ez still used by the Coptic church based isolated consonant-vowel syllable recognition systems
— both languages are written horizontally and left-to-right us- for Amharic. He proposed that CV-syllables would be the best
ing the Ge’ez script. Written Ge’ez can be traced back to at least candidates for the basic recognition units for Amharic.
the 4th century A.D. The first versions of the language included Solomon’s work was extended by Kinfe [11] who used the
consonants only, while the characters in later versions represent HTK Toolkit to build HMM word recognizers at three different
consonant-vowel (CV) phoneme pairs. sub-word levels: phoneme, tied-state triphone, and CV-syllable.
Amharic words use consonantal roots with vowel varia- Kinfe collected a 170 word vocabulary from 20 speakers. He
tion expressing difference in interpretation. In modern written considered a subset of the Amharic syllables, concentrating on
Amharic, each syllable pattern comes in seven different forms the combination of 20 phonemes with the seven vowels, or in to-
(called orders), reflecting the seven vowel sounds. The first or- tal 140 CV-units. Kinfe’s training and test sets both consisted of
der is the basic form; the other orders are derived from it by 50 discrete words. Contrary to Solomon’s predictions, the per-
more or less regular modifications indicating the different vow- formance of the syllable-level recognition was very bad (for un-
els. There are 33 basic forms, giving 7 ∗ 33 syllable patterns clear reasons) and Kinfe abandoned it in favour of the phoneme-
(syllographs), or fidEls. Two of the base forms represent vowels and triphone-based recognizers. For the latter two he reports an
in isolation (a and €), but the rest are for consonants (or semi- isolated word recognition accuracy of 83.1% resp. 78.0% on
vowels classed as consonants) and thus correspond to CV pairs, speaker dependent models, while the speaker independent mod-
with the first order being the base symbol with no explicit vowel els gave 75.5% for phoneme-based models and 77.9% isolated
indicator (though a vowel is pronounced: C+/9/). The writing word accuracy for tied-state triphone models.
system also includes four (incomplete, five-character) orders of Molalgne [12] tried to compare HMM-based small vocabu-
labialised velars and 24 additional labialised consonants. In to- lary speaker-specific continuous speech recognizers built using
tal, there are 275 fidEls. See, e.g., [5] for an introduction to the three different toolkits: CSLU, HTK, and MSSTATE Toolkit
Ethiopian writing system. from Mississippi State, but failed in setting up CSLU so that
The Amharic writing system uses multitudes of ways to de- only two toolkits were actually tested. He collected a corpus of
note compound words and there is no agreed upon spelling stan- 50 sentences with ten words (the digits) from a single speaker.
dard for compounds. As a result of this — and of the size of While HTK was clearly faster than MSSTATE, the speaker dep-
the country leading to vast dialectal dispersion — lexical vari- endent recognition performance for both systems was compara-
ation and homophony is very common. In addition, not all the ble with 82.5% resp. 79.0% word accuracy and 72.5% resp.
letters of the Amharic script are strictly necessary for the pro- 67.5% sentence accuracy for HTK resp. MSSTATE.
nunciation patterns of the spoken language; some were simply Martha [13] worked on a small vocabulary isolated word
inherited from Ge’ez without having any semantic or phonetic recognizer for a command and control interface to Microsoft
distinction in modern Amharic. There are many cases where Word, while Zegaye [14] continued the work on speaker indep-
numerous symbols are used to denote a single phoneme, as well endent continuous Amharic ASR. He used a pure HMM-based
as words that have extremely different orthographic form and approach and reached 76.2% word accuracy and 26.1% sen-
slightly distinct phonetics, but with the same meaning. So are, tence level accuracy. However, there are still a lot of work
for example, most labialised consonants basically redundant, to be done towards achieving a full-fledged automatic Amha-
and there are actually only 39 context-independent phonemes ric speech recognition system. The intention of the present re-
(monophones): of the 275 symbols of the script, only about 233 search was to use an HMM/ANN hybrid model approach as an
remain if the redundant ones are removed. alternative for better performance. For this we utilized an im-
In contrast to the character redundancy, there is no mecha- plementation of such a model in the CSLU Toolkit.
nism in the Amharic writing system to mark gemination of con-
sonants. The words /w5n5/ (swimming) and /w5nn5/ (main, 1 In the text we follow the practice of referring to Ethiopians by their
core) are both written as Ź, but give two completely different given names. However, the reference list follows European standard
meanings by geminating the consonant n /n/. This requires dif- and also gives surnames (i.e., the father’s given name for an Ethiopian).

3350
INTERSPEECH 2005

4. An Amharic SR system sampling rate by Solomon [10]. 100 different sentences of read
speech were recorded for each speaker.
The attempt of this research is to design a prototype speech
The corpus was prepared and processed using
recognizer for the Amharic language. The recognizer uses
SpeechView, a part of the CSLU Toolkit providing a
phonemes as base units and is designed to recognize continu-
graphic-based interface to prepare speech data. The tool is used
ous speech and is speaker independent. In contrast to the pure
to record, display, save, and edit speech signals in their wave
HMM-based work done by Zegaye [14], the system implements
format. It also provides spectrograms and some other speech
the HMM/ANN hybrid model approach. The development pro-
wave related data like pitch and energy counters, neural net
cess was performed using the CSLU Toolkit installed on the
outputs, and phonetic labels. With the help of the SpeechView
Microsoft Windows 2000 platform. Various preprocessing pro-
tool, one can collect and prepare speech data in an easy way
grams and script editors were used to handle vocabulary files.
for training a recognizer. The process of annotating the speech
waveform, which is the most tedious and difficult process in
4.1. The CSLU Toolkit the development of speech recognition systems, can be done at
The CSLU Toolkit [1] was designed not only for speech recog- different transcription levels.
nition, but also for research and educational purposes in the area Ten spoken sentences each from ten female speakers were
of speech and human-computer interactions. It is developed and annotated at the phoneme level for the training corpus and time-
maintained by the Center of Speech Language Understanding, aligned word level transcriptions were generated automatically.
a research centre at the Oregon Graduate Institute of Science Two more speakers were annotated for evaluation purposes.
and Technology, Portland and the Center for Spoken Language Long silences at the beginning and end of the wave file were
Research at the University of Colorado. The toolkit, which is trimmed off and the boundaries of word-level transcriptions
available free of charge for educational, research, personal, and were adjusted accordingly.
evaluation purposes under a license agreement, supports core A vocabulary file was created based on the pronunciation
technologies for speech recognition and speech synthesis, plus of each word in the data set and parts of the phones. This gave a
a graphical based rapid application development environment vocabulary of 778 words represented by 34 phones that in turn
for developing spoken dialogue systems. mÍ •
were split into 57 phone parts: , , , and were defined to
ƒms
consist of three parts each; 15 phones have two parts ( , , ,
gk„qµfz}pÝ¥ ½
The toolkit supports the development of HMM or
, , , , , , , , , , , and ), while 15 have one part
…ntylwr€‡bd‚h v
HMM/ANN hybrid-based speech recognition systems. For this
purpose it has many modules or tools interacting with each other only ( , , , , , , , , , , , , , , and ). Each
in an environment called CSLU-HMM. The toolkit needs a con- phone group is here ordered internally according to frequency.
sistent organization and naming of directories and files which
has to be strictly followed. This is tedious work, but also clearly 4.3. Experiments
doable (still, this might have been the reason why Molalgne de-
cided that it was not possible to use the CSLU Toolkit [12]). Thereafter a recognizer was created, the frame vectors were
generated automatically in the toolkit, and the recognizers was
trained on the phone part files. The ANN of the recognizer con-
4.2. Speech data
tained an output layer with the phone parts, while the input layer
Apart from the specifics of the language itself, the main problem was a 180 node grid representing 20 features each from nine
with doing speech recognition for an under-resourced language time frames (t ± 4 ∗ 10ms).
like Amharic is the lack of previously available data: No stan- The recognizer was evaluated on two sentences each from
dard speech corpus has been developed for Amharic. However, ten speakers who were all found in the training data (in total 20
we were able to use a corpus of 50 speakers recorded at 16 kHz sentences and 236 words). The results were as shown in Table 1.

Itr Subst Insert Delete Word Acc Snt Corr Itr Subst Insert Delete Word Acc Snt Corr
15 13.62 4.89 5.83 75.66 42.31 15 16.34 5.87 7.00 70.79 35.27
16 13.62 5.83 5.83 74.72 42.31 16 16.34 7.00 7.00 69.65 35.17
17 13.62 4.89 6.83 74.67 41.72 17 16.34 5.87 8.20 69.59 33.79
18 14.61 4.89 5.83 74.67 42.31 18 17.53 5.87 7.00 69.60 34.27
19 15.56 3.89 4.89 75.66 41.72 19 18.68 4.66 5.87 70.80 33.79
20 11.67 5.79 4.89 77.65 42.90 20 14.00 6.93 5.87 73.20 36.75
21 11.67 5.83 4.89 77.61 42.90 21 14.00 7.00 5.87 73.13 35.35
22 14.61 5.83 5.83 73.73 41.13 22 17.53 7.00 7.00 68.46 33.62
23 13.62 4.89 4.89 76.61 42.90 23 16.34 5.87 5.87 71.92 37.75
24 13.62 2.93 5.79 77.66 42.90 24 16.34 3.52 6.95 73.19 34.75
25 14.61 2.93 4.89 77.57 42.31 25 17.53 3.52 5.87 73.08 34.27
26 14.61 4.89 4.89 75.62 42.31 26 17.53 5.87 5.87 70.73 34.27
27 15.56 3.89 4.89 75.66 42.31 27 18.68 4.66 5.87 70.80 34.27
28 12.66 3.89 4.89 78.56 44.07 28 15.19 4.66 5.87 74.28 39.70
29 12.66 5.83 4.89 76.62 42.31 29 15.19 7.00 5.87 71.94 35.27
30 12.66 4.89 4.89 77.56 42.90 30 15.19 5.87 5.87 73.07 35.64

Table 1: Recognition accuracy on known speakers. Table 2: Recognition accuracy on unknown speakers.
Best result: 78.56% word and 44.07% sentence level accuracy. Best result: 74.28% word and 39.70% sentence level accuracy.

3351
INTERSPEECH 2005

For each iteration the columns in Table 1 give the percentage of 7. References
substitutions, insertions, and deletions, as well as the word ac-
[1] J.-P. Hosom, R. Cole, M. Fanty, J. Schalkwyk, Y. Yan,
curacy, and the percentage of correct sentences. The best results
and W. Wei, “Training neural networks for speech
(78.56% word level accuracy and 44.07% sentence correctness)
recognition,” Webpage, Feb. 1999. [Online]. Available:
were obtained after 28 iterations.
speech.bme.ogi.edu/tutordemos/nnet training/tutorial.html
When the same recognizer was tested for another ten speak-
ers who were not included in the training data with two sen- [2] H. Bourlard and N. Morgan, “Hybrid HMM/ANN systems
tences each (218 words in total), the recognition rate degraded. for speech recognition: Overview and new research di-
As can be seen in Table 2, the best results were again obtained rections,” in Adaptive Processing of Sequences and Data
after the 28th iteration. The word accuracy was reduced by Structures, C. Giles and M. Gori, Eds. Springer-Verlag,
4.28%, while the sentence level recognition rate was reduced 1997, pp. 389–417.
by 4.37%, giving a 21.44% word level error rate and 55.93% [3] F. Beaufays, H. Bourlard, H. Franco, and N. Morgan,
sentence level error rate. “Neural networks in automatic speech recognition,” in The
Accordingly, the HMM/ANN hybrid recognizer gave a Handbook of Brain Theory and Neural Networks, 2nd ed.,
2.36% decrease in word error rate and 18.01% decrease in sen- M. Arbib, Ed. MIT Press, 2002, pp. 1076–1080.
tence error rate compared to Zegaye’s purely HMM-based re-
cognizer [14], which had 23.80% word and 73.94% sentence [4] J. Fritsch and M. Finke, “ACID/HNN clustering
error rates. The relative error reduction compared to Zegaye’s hiearchies of neural networks for context-dependent con-
work is thus 9.92% at the word level and 24.36% at the sen- nectionist acoustic modeling,” in Proc. International Con-
tence level. ference on Acoustics, Speech and Signal Processing.
Seattle, Washington: IEEE, Apr. 1998, pp. 505–508.
5. Conclusions [5] T. Bloor, “The Ethiopic writing system: a profile,” Jour-
nal of the Simplified Spelling Society, vol. 19, pp. 30–36,
The paper reported experiences with using the CSLU Toolkit 1995.
to build a hybrid HMM/ANN speaker independent continuous
speech recognizer for Amharic, the main language of Ethiopia. [6] Atelach Alemu, L. Asker, and Mesfin Getachew,
An annotated corpus was created from previously recorded “Natural language processing for Amharic: Overview and
speech data. Ten sentences each from twelve speakers were suggestions for a way forward,” in Proc. 10th Conference
marked up at the phoneme level and a vocabulary of 778 words ’Traitement Automatique des Langues Naturelles’, vol. 2,
was created. Batz-sur-Mer, France, June 2003, pp. 173–182.
For speakers found in the training data, the best results ob- [7] Samuel Eyassu and B. Gambäck, “Classifying Amharic
tained were 78.6% word and 44.1% sentence level accuracy. news text using Self-Organizing Maps,” in Proc. 43rd
When tested on data from ten previously unseen speakers, the Annual Meeting of the Association for Computational
recognizer had a 74.3% word accuracy and 39.7% sentence ac- Linguistics. Ann Arbor, Michigan, June 2005, Workshop
curacy; a relative error reduction of 24.4% compared to previ- on Computational Approaches to Semitic Languages.
ous work on Amharic, using pure HMM-based methods. [8] Laine Berhane, “Text-to-speech synthesis of the Amha-
The CSLU Toolkit proved to be a good vehicle to develop ric language,” MSc Thesis, Faculty of Technology, Addis
hybrid HMM/ANN-based recognizers, and the experiments in- Ababa University, Ethiopia, 1998.
dicate that a better recognizer can be developed with further op-
timization efforts. However, the implementation of the toolkit [9] Tesfay Yihdego, “Diphone based text-to-speech synthesis
in Windows needs some revisions. There were problems to fully system for Tigrigna,” MSc Thesis, Faculty of Informatics,
download the Toolkit Installer and after installation the system Addis Ababa University, Ethiopia, 2004.
integration with Windows required considerable efforts. [10] Solomon Berhanu, “Isolated Amharic consonant-vowel
syllable recognition: An experiment using the Hidden
6. Acknowledgements Markov Model,” Msc Thesis, School of Information Stud-
ies for Africa, Addis Ababa University, Ethiopia, 2001.
This research was carried out at the Department of Information
Science, Addis Ababa University and could not have come into [11] Kinfe Tadesse, “Sub-word based Amharic speech re-
being without the help of Solomon Berhanu who provided the cognizer: An experiment using Hidden Markov Model
corpus. Thanks to Zegaye Seifu and Kinfe Tadesse for construc- (HMM),” MSc Thesis, School of Information Studies for
tive comments and to Marek F. and Clemente Fragoso Eduardo Africa, Addis Ababa University, Ethiopia, June 2002.
for help with fixing CSLU Toolkit implementation problems. [12] Molalgne Girmaw, “An automatic speech recognition sys-
The work was funded by the Faculty of Informatics at Addis tem for Amharic,” MSc Thesis, Dept. of Signals, Sensors
Ababa University and the ICT support programme of SAREC, and Systems, Royal Institute of Technology, Stockholm,
the Department for Research Cooperation at Sida, the Swedish Sweden, Apr. 2004.
International Development Cooperation Agency. [13] Martha Yifiru, “Automatic Amharic speech recognition
system to command and control computers,” MSc Thesis,
School of Information Studies for Africa, Addis Ababa
University, Ethiopia, 2003.
[14] Zegaye Seifu, “HMM based large vocabulary, speaker in-
dependent, continuous Amharic speech recognizer,” MSc
Thesis, School of Information Studies for Africa, Addis
Ababa University, Ethiopia, 2003.

3352

View publication stats

You might also like