0% found this document useful (0 votes)
13 views10 pages

Automatic Speech Segmentation in Syllable Centric Speech Recognition System

Uploaded by

RiZhe Jin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Automatic Speech Segmentation in Syllable Centric Speech Recognition System

Uploaded by

RiZhe Jin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Int J Speech Technol (2016) 19:9–18

DOI 10.1007/s10772-015-9320-6

Automatic speech segmentation in syllable centric speech


recognition system
Soumya Priyadarsini Panda1 • Ajit Kumar Nayak2

Received: 13 August 2015 / Accepted: 12 November 2015 / Published online: 21 November 2015
Ó Springer Science+Business Media New York 2015

Abstract Speech recognition is the process of under- Keywords Speech recognition  Speech segmentation 
standing the human or natural language speech by a com- Syllable  Indian languages  Vowel onset point  Vowel
puter. A syllable centric speech recognition system in this offset point  Zero crossing rate
aspect identifies the syllable boundaries in the input speech
and converts it into the respective written scripts or text
units. Appropriate segmentation of the acoustic speech 1 Introduction
signal into syllabic units is an important task for develop-
ment of highly accurate speech recognition system. This In the last few years, there has been a major change in the
paper presents an automatic syllable based segmentation technology and strong presence of the IT companies in the
technique for segmenting continuous speech signals in market leads to the development of a number of sophisti-
Indian languages at syllable boundaries. To analyze the cated information processing devices. This increases the
performance of the proposed technique, a set of experi- demand of human computer interaction via natural lan-
ments are carried out on different speech samples in three guages to enhance the ease of accessibility and user
Indian languages Hindi, Bengali and Odia and are com- friendliness. Research in the area of speech and language
pared with the existing group delay based segmentation processing enables machines to speak and understand
technique along with the manual segmentation technique. natural languages as like human leading to the develop-
The results of all our experiments show the effectiveness of ment of different essential and luxury products enhancing
the proposed technique in segmenting the syllable units the quality of life (Mao et al. 2014; Panda et al. 2015). As
from the original speech samples compared to the existing compared to other approaches, there have been sufficient
techniques. successes today that suggest that these technologies will
continue to be a major area of research and development in
creating intelligent systems now and far into the future.
Speech recognition is the ability of a machine to under-
stand and carry out spoken commands by a human. An
automatic speech recognition (ASR) system in this aspect
& Soumya Priyadarsini Panda converts the spoken speech segments in a language into the
[email protected] respective text units (Kitaoka et al. 2014). The speech
Ajit Kumar Nayak recognition systems makes use of various speech and lan-
[email protected] guage technologies and are increasingly being used to
1
facilitate and enhance human communication, particularly
Department of CSE, Institute of Technical Education and
through their use in human computer interfaces such as in
Research, Siksha ‘O’ Anusandhan University, Bhubaneswar,
Odisha, India internet search engines and mobile communications (He
2 et al. 2014). The possible application of speech recognition
Department of CS&IT, Institute of Technical Education and
Research, Siksha ‘O’ Anusandhan University, Bhubaneswar, includes voice dialing (Gałka et al. 2014), data entry,
Odisha, India designing applications for the elderly or communicatively

123
10 Int J Speech Technol (2016) 19:9–18

impaired as assistive technology, games, etc. (Lippmann complicated for the stress-timed languages due to the
1997; Wang and Sim 2014). presence of stress patterns and varied pronunciation rules.
Recognizing human generated speech by a computer and As Indian languages are syllable-centered, the focus of our
converting it into respective text units is an inherently work is to obtain a vocabulary independent syllable-level
complex activity as it requires a set of cognitive and lin- transcription of the spoken utterance.
guistic skills (Koolagudi and Rao 2012; McLoughlin 2014). A syllable centric ASR system identifies the syllable
Several ASR systems has been developed for different lan- boundaries in the continuous speech signal and segments
guages (Kelly et al. 2013) and the performance of the system the speech into syllable units for converting into the
may be measured by the overall accuracy obtained in correct respective written scripts or text (Wang 2000; Lin et al.
word identification which depends on appropriate identifi- 1996). One of the major reasons for considering syllable as
cation and segmentation of each pronounceable unit and its a basic unit for ASR systems is its better representational
correct transcriptions in the form of text. Speech segmen- and durational stability relative to the phonemes. A syllable
tation in ASR systems is the process of identifying the is typically made up of a syllable nucleus, a vowel (V) with
boundaries between words, syllables, or a phoneme in any optional initial and final margins, consonants(C) (Origlia
spoken natural languages. Transcription of a continuous et al. 2014). The syllables thus may encompass both CV
speech signal into a sequence of words is a difficult task, as (consonant–vowel) and VC (vowel-consonant) transitions,
continuous speech does not have any natural pauses in including most co-articulatory and other phonological
between words. However, most of the existing segmentation effects within its boundaries which makes syllable
methods use manual segmentation techniques. A brief boundary identification easier (Li et al. 2013). While, the
review on the existing works on speech segmentation and stress-timed languages like English suffers from issue like
transcription are presented in the next section. co-articulation effect (phoneme-to-phoneme transitions)
The conventional method of building a large vocabulary between phonemes due to the stress used in pronunciations
speech recognizer for any language uses a top-down making phone boundary identification difficult, the
approach to speech recognition. i.e. these recognizers first boundary identification process is quite simpler for the
hypothesize the sentence, then the words that make up the syllable-timed languages. For syllable based speech seg-
sentence and ultimately the sub-word units that make up mentation, the syllable end points are needed to be
the words requiring a large speech corpus with sentence or obtained. The proposed technique uses a vowel offset point
phoneme level transcription of the speech utterances (Sakai identification technique to segment the speech units at
and Doshita 1963). In addition, it also requires a dictionary syllable boundaries. As the onset of vowel is an important
with the transcription of the words at phonemic/sub-word event which makes the transitions from the consonant part
unit and extensive language models to perform large to the vowel part, it helps in identifying the anchor points
vocabulary continuous speech recognition. The recognizer for different important acoustic events like aspiration,
only recognizes the words that exist in the dictionary. burst, transitions, etc. playing an important role in different
However, adapting a new language requires building of a speech segment identification. To analyze the performance
dictionary and extensive language models for the new of the model a number of test samples in the three Indian
language along with existing recognizer available for a languages, Hindi, Bengali and Odia are considered and the
particular language. Building such a corpus, is a labor- model is compared with the existing group delay based and
intensive and time consuming process. In a country like manual segmentation techniques. A subjective evaluation
India having 22 official and a number of unofficial lan- test is also performed to analyze the performance of the
guages, designing ASR systems requires building huge text proposed technique for appropriate speech segmentation
and speech databases and transcriptions which is a difficult compared to the existing techniques.
and time consuming task. Our objective focuses on The remainder of the paper is organized into the fol-
designing an automatic speech segmentation technique for lowing sections. Section 2 describes a brief overview of the
the Indian languages. existing methods for speech segmentation. Section 3 dis-
There are fewer research documented for automatic cussed about the proposed segmentation technique for
speech segmentation in the Indian languages. As compared automatic speech segmentation at syllable boundaries. Sec-
to the stress-timed languages like, English, Japanese, tion 4 discusses about the result analysis of the proposed
French, the syllable-timed Indian languages are highly model showing the efficiency of the technique in producing
phonetic in nature. i.e. there is no variation in the written segmented syllable units from a set of continuous speech
scripts and their pronunciations. Also, most of the Indian samples in the three Indian languages, Hindi, Bengali and
languages have similar pronunciation rules making sub Odia. Section 5 concludes the discussion showing the
word unit identification simpler using the same technique. summary of the presented work and the future direction of
Whereas, the sub word unit identification is quite the work where further work may be carried out.

123
Int J Speech Technol (2016) 19:9–18 11

2 Related work on automatic speech recognition for under resourced lan-


guages is presented which addresses the lacking of auto-
Developing automatic speech recognition systems capable matic unsupervised techniques for syllable based speech
of obtaining high accuracy in different Indian languages is segmentation. Many models have been developed for
a difficult and ongoing process. There are a number of decades and some progress has been achieved in speech
techniques available for the stress-timed languages like, segmentation, nevertheless the quality in terms of the
English, French, Japanese, etc.; however fewer researches accuracy still presents gaps, particularly regarding the
have been documented for the syllable-timed languages adaptations in Indian languages.
like, the Indian languages. Speech segmentation is an
important problem in ASR systems as segmentation of the
continuous speech signals into smallest pronounceable 3 Automatic speech segmentation technique
units helps in proper identification of the units by the ASR
system. For extracting the syllable boundary information Segmenting continuous speech signal into syllable units is
from continuous speech signal a temporal flow model not a single phase conversion rules, instead may be carried
(TFM) network has been discussed in (Shastri et al. 1999), out through different phases. The phases of automatic
where the time varying properties of the speech are cap- speech segmentation at syllable boundaries are shown in
tured by the TFM. The TFM is a neural network archi- Fig. 1 and the details of the phases are discussed next.
tecture that supports arbitrary connectivity across different
layers, provides for feed-forward as well as recurrent links, 3.1 Time domain representation
and allows variable propagation delays along links. How-
ever, the approaches require analysis of the speech signals The input to the model is.wav files recorded at 8000 Hz,
with respect to various aspects for training the model. mono, which may contain speech samples for words or
Different studies have investigated the speech segmen- sentences in the considered Indian languages (Hindi,
tation problem of ASR systems in different cultures, most Bengali and Odia). For obtaining the syllable end points for
of which have focused on an ANN based approach (Sirigos automatic speech segmentation, the speech samples in.wav
et al. 2002) that needs training of the model with a speech files are needed to be represented in the time domain first.
corpus having information on end point in the respective However, the silent gaps (if present) in the input speech
languages. There are fewer studies that investigated sylla- sample are removed first before further processing. The
ble end point identification dynamically. Prasad et al., in time varying spectral characteristics of the speech signal
(2004) presented a new algorithm for automatic segmen- are represented graphically as wave patterns. The correct
tation of speech signals into syllable-like units based on number of points to represent the wave file are obtained by
short-term energy function. A syllable level segmentation considering the sampling frequency (fs = 1/ts) or the
technique is proposed in (Ziolko et al. 2006) for Japanese number of samples per second in the wav file. Figure 2
language based on a common syllable model where, the shows the time domain representation of the vowel ‘‘aa’’
segment boundaries of the units are detected by finding the
optimal HMM sequence. Zhao and Shaughnessy in (2008)
proposes a hybrid automatic segmentation method that Input speech sample
for segmentation
utilizes silence detection, convex hull energy analysis, and
spectral variation analysis for of syllable units in Mandarin Time Domain
Representation
speech. In (Obin et al. 2013) a novel paradigms for sylla-
ble-based segmentation is proposed that uses the time–
frequency representation, and the fusion of intensity and
Vowel Onset
voicing measures through the frequency regions for Identification
selecting the pertinent information for the segmentation.
In (Musfir et al. 2014) a discussion on a modified group
delay based approach is presented for reducing error pro- Syllable End Point
Identification
portions for the syllable based segmentation particularly
for the fricatives, nasals and unvoiced stop type of units.
For segmentation of the speech signal, the ratio of energy Speech Segmented
in the high frequency bands to the low frequency bands is Segmentation Speech
used. However, the issues arise for the semivowels are not
addressed. In (Besacier et al. 2014) a discussion focusing Fig. 1 Phases of speech segmentation

123
12 Int J Speech Technol (2016) 19:9–18

Fig. 2 Time domain representation of the vowel ‘‘aa’’

Fig. 4 Consonant and vowel regions in an utterance ‘‘kee’’ (CV)

before the vowel onset point is the consonant region, and


after the vowel onset point is the vowel region. The vari-
ation in the consonant and vowel regions and the offset and
onset points are shown in Fig. 4.
The pronunciation of Indian language consonants con-
tains an inherent vowel ‘‘a’’ within it. The syllable units
must have an ending vowel ‘‘a’’ or some other vowels.
Fig. 3 Time domain representation of the word ‘‘bhaa-shaa’’ Therefore the speech sample is segmented into the number
of segments equals to the available number of vowels in the
and Fig. 3 shows the wave pattern of the word ‘‘bhaa-shaa’’ input sample. The vowel sections may easily be identified
in Odia language (English meaning: language) on time axis in the wave representation as the vowel sounds are pro-
with two syllable units ‘‘bhaa’’ and ‘‘shaa’’. duced with higher energy compared to the consonants. The
high energy portions in the wave patterns shows the
3.2 Vowel onset identification vowels, the low energy sections shows the consonant sec-
tions and zero axis values shows the silent gap.
For identifying the syllable end points the event of vowel The vowel onset points (VOP) plays an important role
onset and offset concepts in speech production are used. not even for the number of segment identification but also
While, the event of occurrence of a vowel in a speech used for the syllable end point identification in the next
signal is the vowel onset point, the event of end of the phase. For obtaining the VOPs, the spectral peaks method
vowel section is called the offset point. Significant changes (Prasanna et al. 2009) is used. As the instance at which the
occurred in the speech signal may be noticed in the ener- event of onset of a vowel takes place must have high
gies of excitation source, spectral peaks, and modulation energy. The peak point in the energy spectrum shows the
spectrum at the VOP instances. Also, the speech signals for instance of the VOP over the time axis. However, to show
the vowels are produced with a higher energy compared to the accuracy of the time instance obtained for VOPs, first
the consonants and the variation may easily be detected by the VOPs are manually labeled through wave analysis
analyzing the spectral peaks in the speech signal. There- process as discussed in (Prasanna et al. 2009). The instance
fore, the use of vowel onset and offset events helps in of the VOP obtained by manual wave analysis process for
obtaining the syllable boundaries in the speech signals. the syllable unit ‘‘ka’’ is shown in Fig. 5a.
The syllable units in the Indian languages may be of the For obtaining the VOPs dynamically using the spectral
form: V, CV, CCV, CCVC and CVCC, where C and V peaks method, the speech signal needs to be processed in
represent consonant and vowel, respectively. In Indian blocks of 20 ms with a 10 ms shift. i.e. the time domain
languages, more than 90 % of the syllables are of CV type speech signal is processed in overlapping blocks. The
(Prasanna et al. 2009). Our model focuses on segmentation Discrete Fourier Transform (DFT) is used for calculating
of the three forms of syllables: V, CV and CCV from the frequency spectrum of the speech signal for examining
continuous speech samples in the considered three lan- the information encoded in the frequency, phase, and
guages. Segmenting a syllable into vowel and consonant amplitude. A 256-point DFT is computed on each block of
regions can be performed by determining the onset and speech signal and the sum of ten largest peak points is
offset points of the vowels as in a CV unit, speech segment computed from the first 128 points. The sum of ten peak

123
Int J Speech Technol (2016) 19:9–18 13

Fig. 6 Onset and offset point identification on speech sample ‘‘/bhaa-


shaa/’’ with high ZCR regions showing consonant sections

represented by a crossing of the axis (zero value) in the


graph of the function. The speech segments between two
Fig. 5 VOP instance identification a Original waveform of ‘‘ka’’ CV VOPs are processed to obtain the ZCR values and the
type unit with manually labeled VOP instance b sum of 10 peak higher ZCR after the VOP instance show the offset points
points in each block c enhanced values using FOD showing VOP for the vowel units and starting point of the next syllables
evidence
as shown in Fig. 6. The rate at which zero crossings occur
is a simple measure of the frequency content of a signal.
The ZCR may be defined as given in (Lau and Chan 1985)
points in each block plotted as a function of time is shown
as:
in Fig. 5b. The change at the VOP available in the spectral
peaks energy is further enhanced by computing its slope X1
Zn ¼ sgn½xðmÞ  sgn½xðm  1Þjwðn  mÞ ð1Þ
using First Order Difference (FOD). These enhanced nor- m¼1
malized values show the VOP evidence. The VOP evi-
dence plot using spectral peaks method for the speech where,
signal ‘‘/ka/’’ is presented in Fig. 5c. 
1; xðnÞ  0
sgn½xðnÞ ¼
1; xðnÞ\0
3.3 Syllable end point identification
and w(n) is the windowing function with a window size of
The instance at which a vowel section ends is known as the N samples as given in Eq. (2).
vowel offset points (VOF) and offset identifications leads 
1=2N; 0  n  N  1
to the identification of the syllable end points as Indian W¼ ð2Þ
0; otherwise
language syllables ends with a vowel. To identify the
syllable end points, the zero crossing rate (Kay and Sud-
haker 1986; Sreenivas and Niederjohn 1992) identification 3.4 Speech segmentation
method on the speech signal within two identified VOPs or
after the VOP instance is used. The zero-crossing rate The proposed algorithm used for syllable based speech
(ZCR) is the rate of sign-changes along a signal (Lau and segmentation is given in algorithm 1. The input to the
Chan 1985), i.e., the rate at which the signal changes from algorithm is.wav files containing pronunciation of different
positive to negative or back. The zero crossing count is an words in the considered languages and the algorithm pro-
indicator of the frequency at which the energy is concen- duces segmented syllable units in separate.wav files as
trated in the signal spectrum. Voiced speech is produced outputs. The wave data in the input speech samples (words)
because of excitation of vocal tract by the periodic flow of are copied into an array and are represented in the time
air at the glottis and usually shows a low zero-crossing domain, where t is the total time duration of the input
count, whereas the unvoiced speech is produced by the speech as discussed in Sect. 3.1. The time instances of the
constriction of the vocal tract narrow enough to cause VOPs are obtained by using the spectral peaks method
turbulent airflow which results in noise and shows high (Sect. 3.2). For obtaining the syllable end points the ZCR
zero-crossing count. values are computed and the VOFs are identified as dis-
A zero-crossing is a point where the sign of a mathe- cussed in Sect. 3.3. The input speech sample is then seg-
matical function changes (e.g. from positive to negative), mented into p samples based on the identified VOPs and

123
14 Int J Speech Technol (2016) 19:9–18

VOFs using the function f(t) as presented in step 5 of the


proposed algorithm, where S1 represents wave data
between 0 and VOF1, S2 represents the wave data between
VOFi to VOFi?1 (1 \ i \ n) and S3 is the wave data
between VOFn to end.
For example, for the wave pattern presented in Fig. 6
discussed above, the number of VOPs identified is two;
therefore the input speech segment is segmented into two
syllable units. The syllable boundaries may be identified by
the VOFs. For the first segment, f(t) = S1 i.e. the data
values from 0 time instance to VOF1 time instance are
segmented as first syllable segment and for the 2nd VOF,
f(t) = S3. i.e. the wave data between VOF1 to end are
segmented for the second syllable. Therefore, the two
segmented speech units for the input word sample ‘‘bhaa-
shaa’’ are ‘‘bhaa’’ and ‘‘shaa’’ as shown in Fig. 7.

Fig. 7 Syllable based segmentation based on the proposed algorithm


for the speech sample ‘‘bhaashaa’’ in Odia language

Fig. 8 Wave pattern of input speech ‘‘sa-bda’’ in Odia language (top


one) and segmented output syllable units ‘‘/sa/’’ and ‘‘/bda/’’ (bottom
ones)

shows the speech segmentation process on a Bengali word


4 Result analysis ‘‘baanglaa’’ with three syllable units ‘‘baa’’, ‘‘ng’’ and ‘‘laa’’.
To show the efficiency of the technique to perform syllable
For analyzing the performance of the segmentation algo- based speech segmentation on sentences, a set of experi-
rithm in Indian languages, a set of random test samples ments are also performed on different sentences in the
(words) in Hindi, Bengali and Odia language are considered considered languages. An example sentence is presented in
with the three syllable forms (V, CV, CCV) and the algo- Fig. 11 with the wave patterns of the segmented syllable
rithm is run for segmenting the speech at syllable bound- units. For all the experiments performed, the same set of
aries. The output of the speech segmentation algorithm for words and sentences are testes on the existing group delay
the input speech sample ‘‘sabda’’ in Odia language is shown based technique (Musfir et al. 2014). Even though, the group
in Fig. 8, where the speech unit is segmented into two syl- delay based approach produces poor quality results for the
lable units ‘‘sa’’ (CV) and ‘‘bda’’ (CCV). Figure 9 shows the fricatives stop consonants or nasal sounds, few results are
output of the segmentation algorithm for the word ‘‘hindi’’ presented showing the duration of identified syllable units in
in Hindi language, where the input speech is segmented into the next section along with the proposed and manual seg-
two syllable units ‘‘hi’’ (CV) and ‘‘ndi’’ (CCV) and Fig. 10 mentation technique.

123
Int J Speech Technol (2016) 19:9–18 15

4.1 Variation in segmented syllable durations

Experiments are performed for sixty different units from


each category of considered syllable types out of which six
units are selected from each category to show the effec-
tiveness of the technique in producing appropriate syllable
segments. The speech samples in both male and female
voices recorded at 8000 Hz, mono,.wav format are con-
sidered for all the experiments. The duration data for these
six different units from the three considered syllable forms
are presented in this section in male and female voice. For
obtaining the predicted duration values of all the consid-
Fig. 9 Wave pattern of input speech ‘‘hindi’’ in Hindi language (top
one) and segmented output syllable units ‘‘/hi/’’ and ‘‘/ndi/’’ (bottom
ered syllable units, a manual labeling approach is used. The
ones) same sets of words are used for segmenting the speech at
syllable boundaries using the proposed algorithm as well as
by the group delay based technique. All the experiments on
syllable duration identification are performed on same set
of speech data in male and female voice. The difference in
the manually predicted and segmented speech durations for
the V type units are presented in Figs. 12 and 13 in male
and female voice respectively. Figures 14 and 15 shows the
syllable duration values for the test results obtained for the
CV type units. The duration values for the CCV type units
are shown in Figs. 16 and 17 respectively. In all the
experiments performed it may be observed from the results,
the proposed segmentation technique achieves close results
compared to the predicted syllable durations, whereas the
group delay based technique shows a high degree of vari-
ation in syllable durations compared to the predicted
durations.
The error percentage in the predicted syllable duration
Fig. 10 Wave pattern of input speech ‘‘Baanglaa’’ in Bengali (actual duration) and the segmented duration is computed
language (top one) and segmented output syllable units ‘‘/baa/’’, by the formula given in Eq. (3). Where, E is the percentage
‘‘ng’’ and ‘‘/laa/’’ (bottom ones) of error, DS is the duration of segmented speech and DA is
the actual duration of the syllable obtained by manual
segmentation. The average % of error in the three forms of
syllable units (V, CV and CCV) for the 6 considered
sample units are obtained separately for the proposed and
existing segmentation technique. The proportion of error
obtained in obtaining the exact duration syllable for the
proposed and existing group delay based technique are

0.4
Duration (in sec)

0.3

0.2
Segmented
0.1 Group-delay
Predicted
0
a aa i u ae o
V type units
Fig. 11 Wave pattern of input speech ‘‘Hindi me sabd’’ in Hindi
language (top one) and segmented output syllable units ‘‘hi’’, ‘‘ndi’’, Fig. 12 Duration of V type units obtained by manual analysis and
‘‘me’’, ‘‘sa’’, ‘‘bd’’ (bottom ones) segmentation algorithm in male voice

123
16 Int J Speech Technol (2016) 19:9–18

0.4 1

Duration (in sec)


Duration (in sec)
0.3 0.8
0.6
0.2
Segmented 0.4 Segmented
0.1 Group-delay Group-delay
0.2
0 Predicted Predicted
a aa i u ae o 0
kri kta ndi bda sta sya
V type units CCV type units

Fig. 13 Duration of V type units obtained by manual analysis and Fig. 17 Duration of CCV type units obtained by manual analysis and
segmentation algorithm in female voice segmentation algorithm in female voice

0.4
Duration (in sec)

20
0.3 16 V

Error (in %)
12
CV
0.2
Segmenetd CCV
8
0.1 Group-delay
Predicted 4
0
ka kaa ki ku kae ko 0
S1 S2 S3 S4 S5 S6
CV type units
Syllable unit
Fig. 14 Duration of CV type units obtained by manual analysis and
segmentation algorithm in male voice Fig. 18 Percentage of error in duration values for proposed technique

0.4 V
Duration (in sec)

Error (in %) 36
30 CV
0.3
24 CCV
0.2 18
Segmenetd
12
0.1 Group-delay
6
Predicted
0 0
ka kaa ki ku kae ko S1 S2 S3 S4 S5 S6
CV type units Syllable unit

Fig. 15 Duration of CV type units obtained by manual analysis and Fig. 19 Percentage of error in duration values for group-delay based
segmentation algorithm in female voice technique

4 % error in average for all type of units), the existing


1
Duration (in sec)

group delay based technique achieves 86 % accuracy (with


0.8
14 % error in average for all type of units) for the con-
0.6
Segmented sidered speech samples.
0.4
0.2 Group-delay DS  DA
Predicted E¼  100 ð3Þ
0 DA
kri kta ndi bda sta sya
CCV type units
4.2 Subjective evaluation
Fig. 16 Duration of CCV type units obtained by manual analysis and
segmentation algorithm in male voice
To analyze the quality of the segmented speech, a sub-
jective evaluation test is performed, where a listeners test is
presented in Figs. 18 and 19 respectively. In an average performed by five different listeners to evaluate the quality
while 5, 4 and 2 % error occurred for the V, CV and CCV of the segmented speech by the proposed and group delay
type units respectively by the proposed technique, the based technique compared to the manually segmented
existing group delay based technique shows 14, 16 and speech. In these tests, the MOS (mean opinion score)
10 % of errors respectively for the syllable types in aver- (Panda and Nayak 2015) for a set of words in Hindi,
age. In other words, while the proposed segmentation Bengali and Odia language with the three syllable forms
algorithm gives 96 % accurate segmentation results (with are considered. The speech units for the selected words for

123
Int J Speech Technol (2016) 19:9–18 17

5 Dynamically segmentation of the continuous speech signals dynamically


4 segmented at syllable boundaries overcoming the time requirement
3 Manually and database limitation of manual segmentation technique.
2 segmented
Experiments were performed to analyze the overall per-
Group-delay
1 formance of the model in segmenting different type of
L1 L2 L3 L4 L5 syllable units in the three Indian languages, Hindi, Bengali
Fig. 20 Average MOS score for 10 samples by 5 listeners for Hindi and Odia. The presented results show very less proportion
language speech units of error on segmented syllable durations compared to the
existing technique on the actual durations of the speech
samples. Also, this method saves a lot of time as needed by
5 Dynamically the manually labeling process for speech segmentation by
4 segmented giving the segmented output in few mille seconds only.
3 Manually Even though, the proposed technique works well for the
2 segmented
three considered syllable forms (V, CV, and CCV) in
Group-delay
1 Hindi, Bengali and Odia language speech, the algorithm
L1 L2 L3 L4 L5 may occasionally produce un-natural results when multiple
vowel regions fused together (e.g.: the words ‘‘aa-i’’ and
Fig. 21 Average MOS score for 10 samples by 5 listeners for Bengali
language speech units ‘‘u-ee’’ in Odia language of type V–V). Therefore, the
proposed algorithm may further be enhanced to get
appropriate results for V–V pairs. This algorithm may also
5 Dynamically be incorporated in the Indian language speech recognition
4 segmented systems for segmenting the continuous speech signals into
3 Manually
syllable units for further processing avoiding the need of a
2 segmented
Group-delay large manually labeled database for syllable duration
1
L1 L2 L3 L4 L5 information.

Fig. 22 Average MOS score for 10 samples by 5 listeners for Odia


language speech units

the manually segmented and dynamically segmented by the


References
algorithms and the output speech are played one by one. Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014).
The listeners have to measure the speech quality in a five Automatic speech recognition for under-resourced languages:
point scale (1: very low, 2: low, 3: average, 4: high, 5: very A survey. Speech Communication, 56, 85–100.
high) on the basis of their feeling to the speech for Gałka, J., Masior, M., & Salasa, M. (2014). Voice authentication
embedded solution for secured access control. IEEE Transac-
understandability and clarity. All the tests were performed tions on Consumer Electronics, 60(4), 653–661.
with a headphone set and the only information the listeners He, Y., Han, J., Zheng, T., & Sun, G. (2014). A new framework for
are provided with is, they have to compare the three speech robust speech recognition in complex channel environments.
segments for the sample words. Figures 20, 21 and 22 Digital Signal Processing, 32, 109–123.
Kay, S. M., & Sudhaker, R. (1986). A zero crossing-based spectrum
shows the average MOS test results for ten words by each analyzer. IEEE Transactions on Acoustics, Speech, and Signal
listener for our experiments on Hindi, Bengali and Odia Processing, 34(1), 96–104.
language speech respectively. The results show comparable Kelly, F., Drygajlo, A., & Harte, N. (2013). Speaker verification in
results of the proposed segmentation technique with the score-ageing-quality classification space. Computer Speech &
Language, 27(5), 1068–1084.
manually segmented speech. While, the group delay based Kitaoka, N., Enami, D., & Nakagawa, S. (2014). Effect of acoustic
technique shows relatively poor results due to the improper and linguistic contexts on human and machine speech recogni-
segmentations. tion. Computer Speech & Language, 28(3), 769–787.
Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from
speech using source, system, and prosodic features. International
Journal of Speech Technology, 15(2), 265–289.
5 Conclusions Lau, Y. K., & Chan, C. K. (1985). Speech recognition based on zero
crossing rate and energy. IEEE Transactions on Acoustics,
In this work, an automatic speech segmentation algorithm Speech, and Signal Processing, 33(1), 320–323.
Li, M., Han, K. J., & Narayanan, S. (2013). Automatic speaker age
is proposed for dynamically segmenting Indian language and gender recognition using acoustic and prosodic level
speech at syllable boundaries. The use of the VOP identi- information fusion. Computer Speech & Language, 27(1),
fication techniques along with the ZCR technique performs 151–167.

123
18 Int J Speech Technol (2016) 19:9–18

Lin, C. H., Wu, C. H., Ting, P. Y., & Wang, H. M. (1996). Frameworks modulation spectrum energies. IEEE Transactions on Audio,
for recognition of Mandarin syllables with tones using sub-syllabic Speech, and Language Processing, 17(4), 556–565.
units. Speech Communication, 18(2), 175–190. Sakai, T., & Doshita, S. (1963). The automatic speech recognition
Lippmann, R. P. (1997). Speech recognition by machines and system for conversational sound. IEEE Transactions on Elec-
humans. Speech Communication, 22(1), 1–15. tronic Computers, 6, 835–846.
Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning Salient Shastri, L., Chang, S., & Greenberg, S. (1999). Syllable detection and
Features for Speech Emotion Recognition Using Convolutional segmentation using temporal flow neural networks. In Interna-
Neural Networks. IEEE Transactions on Multimedia, 16(8), tional Congress of Phonetic Sciences (pp. 1721–1724).
2203–2213. Sirigos, J., Fakotakis, N., & Kokkinakis, G. (2002). A hybrid syllable
McLoughlin, I. V. (2014). Super-audible voice activity detection. recognition system based on vowel spotting. Speech Communi-
IEEE/ACM Transactions on Audio, Speech, and Language cation, 38(3), 427–440.
Processing, 22(9), 1424–1433. Sreenivas, T. V., & Niederjohn, R. J. (1992). Zero-crossing based
Musfir, M., Krishnan, K. R., & Murthy, H. (2014). Analysis of spectral analysis and SVD spectral analysis for formant
fricatives, stop consonants and nasals in the automatic segmen- frequency estimation in noise. IEEE Transactions on Signal
tation of speech using the group delay algorithm. In Twentieth Processing, 40(2), 282–293.
National Conference on Communications (NCC) (pp. 1–6). Wang, H. M. (2000). Experiments in syllable-based retrieval of
Obin, N., Lamare, F., & Roebel, A. (2013). Syll-O-Matic: an adaptive broadcast news speech in Mandarin Chinese. Speech Communi-
time-frequency representation for the automatic segmentation of cation, 32(1), 49–60.
speech into syllables. In IEEE International Conference on Acous- Wang, G., & Sim, K. C. (2014). Regression-based context-dependent
tics, Speech and Signal Processing (ICASSP), (pp. 6699–6703). modeling of deep neural networks for speech recognition. IEEE/
Origlia, A., Cutugno, F., & Galatà, V. (2014). Continuous emotion ACM Transactions on Audio, Speech, and Language Processing,
recognition with phonetic syllables. Speech Communication, 57, 22(11), 1660–1669.
155–169. Zhao, X., & Shaughnessy, D. O. (2008). A new hybrid approach for
Panda, S. P., & Nayak, A. K. (2015). An efficient model for text-to- automatic speech signal segmentation using silence signal
speech synthesis in Indian languages. International Journal of detection, energy convex hull, and spectral variation. In Cana-
Speech Technology, 18(3), 305–315. dian Conference on Electrical and Computer Engineering (pp.
Panda, S. P., Nayak, A. K., & Patnaik, S. (2015). Text-to-speech 145–148).
synthesis with an Indian language perspective. International Ziolko, B., Manandhar, S., Wilson, R. C., & Ziolko, M. (2006).
Journal of Grid and Utility Computing, 6(3–4), 170–178. Wavelet method of speech segmentation. In 14th European
Prasad, V. K., Nagarajan, T., & Murthy, H. A. (2004). Automatic Signal Processing Conference (pp. 1–5).
segmentation of continuous speech using minimum phase group
delay functions. Speech Communication, 42(3), 429–446.
Prasanna, S., Reddy, B. V. S., & Krishnamoorthy, P. (2009). Vowel
onset point detection using source, spectral peaks, and

123

You might also like