Automatic Speech Segmentation in Syllable Centric Speech Recognition System
Automatic Speech Segmentation in Syllable Centric Speech Recognition System
DOI 10.1007/s10772-015-9320-6
Received: 13 August 2015 / Accepted: 12 November 2015 / Published online: 21 November 2015
Ó Springer Science+Business Media New York 2015
Abstract Speech recognition is the process of under- Keywords Speech recognition Speech segmentation
standing the human or natural language speech by a com- Syllable Indian languages Vowel onset point Vowel
puter. A syllable centric speech recognition system in this offset point Zero crossing rate
aspect identifies the syllable boundaries in the input speech
and converts it into the respective written scripts or text
units. Appropriate segmentation of the acoustic speech 1 Introduction
signal into syllabic units is an important task for develop-
ment of highly accurate speech recognition system. This In the last few years, there has been a major change in the
paper presents an automatic syllable based segmentation technology and strong presence of the IT companies in the
technique for segmenting continuous speech signals in market leads to the development of a number of sophisti-
Indian languages at syllable boundaries. To analyze the cated information processing devices. This increases the
performance of the proposed technique, a set of experi- demand of human computer interaction via natural lan-
ments are carried out on different speech samples in three guages to enhance the ease of accessibility and user
Indian languages Hindi, Bengali and Odia and are com- friendliness. Research in the area of speech and language
pared with the existing group delay based segmentation processing enables machines to speak and understand
technique along with the manual segmentation technique. natural languages as like human leading to the develop-
The results of all our experiments show the effectiveness of ment of different essential and luxury products enhancing
the proposed technique in segmenting the syllable units the quality of life (Mao et al. 2014; Panda et al. 2015). As
from the original speech samples compared to the existing compared to other approaches, there have been sufficient
techniques. successes today that suggest that these technologies will
continue to be a major area of research and development in
creating intelligent systems now and far into the future.
Speech recognition is the ability of a machine to under-
stand and carry out spoken commands by a human. An
automatic speech recognition (ASR) system in this aspect
& Soumya Priyadarsini Panda converts the spoken speech segments in a language into the
[email protected] respective text units (Kitaoka et al. 2014). The speech
Ajit Kumar Nayak recognition systems makes use of various speech and lan-
[email protected] guage technologies and are increasingly being used to
1
facilitate and enhance human communication, particularly
Department of CSE, Institute of Technical Education and
through their use in human computer interfaces such as in
Research, Siksha ‘O’ Anusandhan University, Bhubaneswar,
Odisha, India internet search engines and mobile communications (He
2 et al. 2014). The possible application of speech recognition
Department of CS&IT, Institute of Technical Education and
Research, Siksha ‘O’ Anusandhan University, Bhubaneswar, includes voice dialing (Gałka et al. 2014), data entry,
Odisha, India designing applications for the elderly or communicatively
123
10 Int J Speech Technol (2016) 19:9–18
impaired as assistive technology, games, etc. (Lippmann complicated for the stress-timed languages due to the
1997; Wang and Sim 2014). presence of stress patterns and varied pronunciation rules.
Recognizing human generated speech by a computer and As Indian languages are syllable-centered, the focus of our
converting it into respective text units is an inherently work is to obtain a vocabulary independent syllable-level
complex activity as it requires a set of cognitive and lin- transcription of the spoken utterance.
guistic skills (Koolagudi and Rao 2012; McLoughlin 2014). A syllable centric ASR system identifies the syllable
Several ASR systems has been developed for different lan- boundaries in the continuous speech signal and segments
guages (Kelly et al. 2013) and the performance of the system the speech into syllable units for converting into the
may be measured by the overall accuracy obtained in correct respective written scripts or text (Wang 2000; Lin et al.
word identification which depends on appropriate identifi- 1996). One of the major reasons for considering syllable as
cation and segmentation of each pronounceable unit and its a basic unit for ASR systems is its better representational
correct transcriptions in the form of text. Speech segmen- and durational stability relative to the phonemes. A syllable
tation in ASR systems is the process of identifying the is typically made up of a syllable nucleus, a vowel (V) with
boundaries between words, syllables, or a phoneme in any optional initial and final margins, consonants(C) (Origlia
spoken natural languages. Transcription of a continuous et al. 2014). The syllables thus may encompass both CV
speech signal into a sequence of words is a difficult task, as (consonant–vowel) and VC (vowel-consonant) transitions,
continuous speech does not have any natural pauses in including most co-articulatory and other phonological
between words. However, most of the existing segmentation effects within its boundaries which makes syllable
methods use manual segmentation techniques. A brief boundary identification easier (Li et al. 2013). While, the
review on the existing works on speech segmentation and stress-timed languages like English suffers from issue like
transcription are presented in the next section. co-articulation effect (phoneme-to-phoneme transitions)
The conventional method of building a large vocabulary between phonemes due to the stress used in pronunciations
speech recognizer for any language uses a top-down making phone boundary identification difficult, the
approach to speech recognition. i.e. these recognizers first boundary identification process is quite simpler for the
hypothesize the sentence, then the words that make up the syllable-timed languages. For syllable based speech seg-
sentence and ultimately the sub-word units that make up mentation, the syllable end points are needed to be
the words requiring a large speech corpus with sentence or obtained. The proposed technique uses a vowel offset point
phoneme level transcription of the speech utterances (Sakai identification technique to segment the speech units at
and Doshita 1963). In addition, it also requires a dictionary syllable boundaries. As the onset of vowel is an important
with the transcription of the words at phonemic/sub-word event which makes the transitions from the consonant part
unit and extensive language models to perform large to the vowel part, it helps in identifying the anchor points
vocabulary continuous speech recognition. The recognizer for different important acoustic events like aspiration,
only recognizes the words that exist in the dictionary. burst, transitions, etc. playing an important role in different
However, adapting a new language requires building of a speech segment identification. To analyze the performance
dictionary and extensive language models for the new of the model a number of test samples in the three Indian
language along with existing recognizer available for a languages, Hindi, Bengali and Odia are considered and the
particular language. Building such a corpus, is a labor- model is compared with the existing group delay based and
intensive and time consuming process. In a country like manual segmentation techniques. A subjective evaluation
India having 22 official and a number of unofficial lan- test is also performed to analyze the performance of the
guages, designing ASR systems requires building huge text proposed technique for appropriate speech segmentation
and speech databases and transcriptions which is a difficult compared to the existing techniques.
and time consuming task. Our objective focuses on The remainder of the paper is organized into the fol-
designing an automatic speech segmentation technique for lowing sections. Section 2 describes a brief overview of the
the Indian languages. existing methods for speech segmentation. Section 3 dis-
There are fewer research documented for automatic cussed about the proposed segmentation technique for
speech segmentation in the Indian languages. As compared automatic speech segmentation at syllable boundaries. Sec-
to the stress-timed languages like, English, Japanese, tion 4 discusses about the result analysis of the proposed
French, the syllable-timed Indian languages are highly model showing the efficiency of the technique in producing
phonetic in nature. i.e. there is no variation in the written segmented syllable units from a set of continuous speech
scripts and their pronunciations. Also, most of the Indian samples in the three Indian languages, Hindi, Bengali and
languages have similar pronunciation rules making sub Odia. Section 5 concludes the discussion showing the
word unit identification simpler using the same technique. summary of the presented work and the future direction of
Whereas, the sub word unit identification is quite the work where further work may be carried out.
123
Int J Speech Technol (2016) 19:9–18 11
123
12 Int J Speech Technol (2016) 19:9–18
123
Int J Speech Technol (2016) 19:9–18 13
123
14 Int J Speech Technol (2016) 19:9–18
123
Int J Speech Technol (2016) 19:9–18 15
0.4
Duration (in sec)
0.3
0.2
Segmented
0.1 Group-delay
Predicted
0
a aa i u ae o
V type units
Fig. 11 Wave pattern of input speech ‘‘Hindi me sabd’’ in Hindi
language (top one) and segmented output syllable units ‘‘hi’’, ‘‘ndi’’, Fig. 12 Duration of V type units obtained by manual analysis and
‘‘me’’, ‘‘sa’’, ‘‘bd’’ (bottom ones) segmentation algorithm in male voice
123
16 Int J Speech Technol (2016) 19:9–18
0.4 1
Fig. 13 Duration of V type units obtained by manual analysis and Fig. 17 Duration of CCV type units obtained by manual analysis and
segmentation algorithm in female voice segmentation algorithm in female voice
0.4
Duration (in sec)
20
0.3 16 V
Error (in %)
12
CV
0.2
Segmenetd CCV
8
0.1 Group-delay
Predicted 4
0
ka kaa ki ku kae ko 0
S1 S2 S3 S4 S5 S6
CV type units
Syllable unit
Fig. 14 Duration of CV type units obtained by manual analysis and
segmentation algorithm in male voice Fig. 18 Percentage of error in duration values for proposed technique
0.4 V
Duration (in sec)
Error (in %) 36
30 CV
0.3
24 CCV
0.2 18
Segmenetd
12
0.1 Group-delay
6
Predicted
0 0
ka kaa ki ku kae ko S1 S2 S3 S4 S5 S6
CV type units Syllable unit
Fig. 15 Duration of CV type units obtained by manual analysis and Fig. 19 Percentage of error in duration values for group-delay based
segmentation algorithm in female voice technique
123
Int J Speech Technol (2016) 19:9–18 17
123
18 Int J Speech Technol (2016) 19:9–18
Lin, C. H., Wu, C. H., Ting, P. Y., & Wang, H. M. (1996). Frameworks modulation spectrum energies. IEEE Transactions on Audio,
for recognition of Mandarin syllables with tones using sub-syllabic Speech, and Language Processing, 17(4), 556–565.
units. Speech Communication, 18(2), 175–190. Sakai, T., & Doshita, S. (1963). The automatic speech recognition
Lippmann, R. P. (1997). Speech recognition by machines and system for conversational sound. IEEE Transactions on Elec-
humans. Speech Communication, 22(1), 1–15. tronic Computers, 6, 835–846.
Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning Salient Shastri, L., Chang, S., & Greenberg, S. (1999). Syllable detection and
Features for Speech Emotion Recognition Using Convolutional segmentation using temporal flow neural networks. In Interna-
Neural Networks. IEEE Transactions on Multimedia, 16(8), tional Congress of Phonetic Sciences (pp. 1721–1724).
2203–2213. Sirigos, J., Fakotakis, N., & Kokkinakis, G. (2002). A hybrid syllable
McLoughlin, I. V. (2014). Super-audible voice activity detection. recognition system based on vowel spotting. Speech Communi-
IEEE/ACM Transactions on Audio, Speech, and Language cation, 38(3), 427–440.
Processing, 22(9), 1424–1433. Sreenivas, T. V., & Niederjohn, R. J. (1992). Zero-crossing based
Musfir, M., Krishnan, K. R., & Murthy, H. (2014). Analysis of spectral analysis and SVD spectral analysis for formant
fricatives, stop consonants and nasals in the automatic segmen- frequency estimation in noise. IEEE Transactions on Signal
tation of speech using the group delay algorithm. In Twentieth Processing, 40(2), 282–293.
National Conference on Communications (NCC) (pp. 1–6). Wang, H. M. (2000). Experiments in syllable-based retrieval of
Obin, N., Lamare, F., & Roebel, A. (2013). Syll-O-Matic: an adaptive broadcast news speech in Mandarin Chinese. Speech Communi-
time-frequency representation for the automatic segmentation of cation, 32(1), 49–60.
speech into syllables. In IEEE International Conference on Acous- Wang, G., & Sim, K. C. (2014). Regression-based context-dependent
tics, Speech and Signal Processing (ICASSP), (pp. 6699–6703). modeling of deep neural networks for speech recognition. IEEE/
Origlia, A., Cutugno, F., & Galatà, V. (2014). Continuous emotion ACM Transactions on Audio, Speech, and Language Processing,
recognition with phonetic syllables. Speech Communication, 57, 22(11), 1660–1669.
155–169. Zhao, X., & Shaughnessy, D. O. (2008). A new hybrid approach for
Panda, S. P., & Nayak, A. K. (2015). An efficient model for text-to- automatic speech signal segmentation using silence signal
speech synthesis in Indian languages. International Journal of detection, energy convex hull, and spectral variation. In Cana-
Speech Technology, 18(3), 305–315. dian Conference on Electrical and Computer Engineering (pp.
Panda, S. P., Nayak, A. K., & Patnaik, S. (2015). Text-to-speech 145–148).
synthesis with an Indian language perspective. International Ziolko, B., Manandhar, S., Wilson, R. C., & Ziolko, M. (2006).
Journal of Grid and Utility Computing, 6(3–4), 170–178. Wavelet method of speech segmentation. In 14th European
Prasad, V. K., Nagarajan, T., & Murthy, H. A. (2004). Automatic Signal Processing Conference (pp. 1–5).
segmentation of continuous speech using minimum phase group
delay functions. Speech Communication, 42(3), 429–446.
Prasanna, S., Reddy, B. V. S., & Krishnamoorthy, P. (2009). Vowel
onset point detection using source, spectral peaks, and
123