Voiced/Unvoiced Decision For Speech Signals Based On Zero-Crossing Rate and Energy
Voiced/Unvoiced Decision For Speech Signals Based On Zero-Crossing Rate and Energy
Abstract--In speech analysis, the voiced -unvoiced recognition techniques were used to separate the speech
decision is usually performed in extracting the segments into voiced/unvoiced [8].
information from the speech signals. In this paper, two The method we used in this work is a simple and fast
methods are performed to separate the voiced and approach and may overcome the problem of classifying the
unvoiced parts of the speech signals. These are zero
crossing rate (ZCR) and energy. In here, we evaluated the
speech into voiced/unvoiced using zero-crossing rate and
results by dividing the speech sample into some segme nts energy of a speech signal. The methods that are used in this
and used the zero crossing rate and energy calculations to study are presented in the second part. The results are given in
separate the voiced and unvoiced parts of speech. The the third part.
results suggest that zero crossing rates are low for voiced
part and high for u nvoiced part where as the energy is II. METHOD
high for voiced part and low for unvoiced part. In our design, we combined zero crossings rate and energy
Therefore, these methods are proved effective in calculation. Zero-crossing rate is an important parameter for
separation of voiced and unvoiced speech. voiced/unvoiced classification. It is also often used as a part of
the front-end processing in automatic speech recognition
I. I NTRODUCT ION
system. The zero crossing count is an indicator of the
Speech can be divided into numerous voiced and unvoiced frequency at which the energy is concentrated in the signal
regions. The classification of speech signal into voiced, spectrum. Voiced speech is produced because of excitation of
unvoiced provides a preliminary acoustic segmentation for vocal tract by the periodic flow of air at the glottis and usually
speech processing applications, such as speech synthesis, shows a low zero-crossing count [9], whereas the unvoiced
speech enhancement, and speech recognition. speech is produced by the constriction of the vocal tract
“Voiced speech consists of more or less constant frequency narrow enough to cause turbulent airflow which results in
tones of some duration, made when vowels are spoken. It is noise and shows high zero-crossing count.
produced when periodic pulses of air generated by the Energy of a speech is another parameter for classifying the
vibrating glottis resonate through the vocal tract, at voiced/unvoiced parts. The voiced part of the speech has high
frequencies dependent on the vocal tract shape. About two- energy because of its periodicity and the unvoiced part of
thirds of speech is voiced and this type of speech is also what speech has low energy. The analysis for classifying the
is most important for intelligibility. Unvoiced speech is non- voiced/unvoiced parts of speech has been illustrated in the
periodic, random-like sounds, caused by air passing through a block diagram in Fig.1.
narrow constriction of the vocal tract as when consonants are At the first stage, speech signal is divided into intervals in
spoken. Voiced speech, because of its periodic nature, can be frame by frame without overlapping. It is given with Fig.2.
identified, and extracted [1]”.
In recent years considerable efforts has been spent by A. End-Point Detection
researchers in solving the problem of classifying speech into
voiced/unvoiced parts [2-8]. A pattern recognition approach One of the most basic but problematic aspects of speech
and statistical and non statistical techniques has been applied processing is to detect when a speech utterance starts and
for deciding whether the given segment of a speech signal ends. This is called end-point detection. In the case of
should be classified as voiced speech or unvoiced speech unvoiced sounds occurring at the beginning or end of the
[2,3,5, and 7]. Qi and Hunt classified voiced and unvoiced utterance, it is difficult to detect accurately the speech signal
speech using non-parametric methods based on multi-layer from the background noise signal.
feed forward network [4]. Acoustical features and pattern
Chapter: Advanced Techniques in Computing Sciences and Software Engineering,
pp 279-282, 2010; DOI 10.1007/978-90-481-3660-5_47
In this work, end-point detection is applied to the energy function of the entire utterance is then computed using
voiced/unvoiced algorithm at the beginning of the algorithm to Eq.4.
separate silence and speech signal. A small sample of the
background noise is taken during the silence interval just prior
to the commencement of the speech signal. The short-time
Subdivision of
the frame
Short - time
Hamming
Energy Not sure
Window
Speech signal calculation(E)
x(n)
Frame by Frame Yes Voiced Speech
End-point Detection If ZCR is small
Signal Processing Signals
and E is high
Short - time Average
Zero - crossings rate
calculation ( ZCR) No
Unvoiced Speech
Signals
B. Zero-Crossing Rate
In the context of discrete-time signals, a zero crossing is
said to occur if successive samples have different algebraic
signs. The rate at which zero crossings occur is a simple
Fig. 3: Definition of zero-crossings rate
Chapter: Advanced Techniques in Computing Sciences and Software Engineering,
pp 279-282, 2010; DOI 10.1007/978-90-481-3660-5_47
En [ x(m)w(n m)]
m
2
(4)
where
1, x(n) 0 (2)
sgn[ x(n)]
1, x(n) 0
and
1
for ,0 n N 1 (3)
w(n) 2 N
0 for , otherwise
duration at the beginning. The algorithm reduces the duration TABLE I. VOICED/UNVOICED DECISIONS FOR THE WORD
time of the window by half at each feedback if the decision is “FOUR” USING THE MODEL.
not clear. The results of voiced/unvoiced decision using our Energy
ZCR Decision
model are presented in Table 1. (J)
Frame-1 ( 50 ms) 152 0.0018 u nvoiced
Four
Frame-21( 25 ms) 52 0.0543 unvoiced
0.2
-0.05
-0.1
Frame-5 ( 50 ms) 43 252.98 v oiced
-0.15
Frame-6( 50 ms) 56 193.70 v oiced
-0.2
0 1000 2000 3000 4000 5000 6000 7000 8000
Samples
Frame-71( 25 ms) 31 27.2842 voiced
Fig.6: Original speech signal for the word “four.” Frame-72( 25 ms) 30 25.960 voiced
The frame by frame representation of the algorithm is Frame- 811( 12.5 ms) 24 3.4214 voiced
presented with Fig.7. At the beginning and the ending points Frame- 812( 12.5 ms) 11 0.4765 unvoiced
of the speech signal, the algorithm decreases the window Frame-82( 25 ms) 19 0.166 unvoiced
duration time. At the beginning, word starts with an “f” sound
Frame-9 ( 50 ms) 89 0.0054 u nvoiced
which is unvoiced. At the end, word ends with a “r” sound
which is unvoiced.
In the frame-by-frame processing stage, the speech signal is
segmented into a non-overlapping frame of samples. It is
processed into frame by frame until the entire speech signal is
covered. Table 1 includes the voiced/unvoiced decisions for
word “four.” It has 3600 samples with 8000Hz sampling rate.
At the beginning, we set the frame size as 400 samples (50
ms). At the end of the algorithm if the decision is not clear,
energy and zero-crossing rate is recalculated by dividing the
related frame size into two frames. This phenomenon can be
seen for Frame 2, 7, and 8 in the Table 1.
IV. CONCLUSION
REFERENCES
[1] J. K. Lee, C. D. Yoo, “Wavelet speech enhancement
based on voiced/unvoiced decision”, Korea Advanced
Chapter: Advanced Techniques in Computing Sciences and Software Engineering,
pp 279-282, 2010; DOI 10.1007/978-90-481-3660-5_47