0% found this document useful (0 votes)
109 views5 pages

Voiced/Unvoiced Decision For Speech Signals Based On Zero-Crossing Rate and Energy

This document summarizes a research paper that proposes using zero-crossing rate and energy to classify speech segments as voiced or unvoiced. It describes dividing speech samples into frames and calculating the zero-crossing rate and energy of each frame. A low zero-crossing rate and high energy indicates voiced speech, while a high zero-crossing rate and low energy indicates unvoiced speech. The method is presented as a simple and fast approach to speech segmentation that could overcome issues associated with other techniques. Evaluation of the zero-crossing rate and energy calculations on divided speech segments showed the method effectively separates voiced and unvoiced parts of speech.

Uploaded by

cossybr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views5 pages

Voiced/Unvoiced Decision For Speech Signals Based On Zero-Crossing Rate and Energy

This document summarizes a research paper that proposes using zero-crossing rate and energy to classify speech segments as voiced or unvoiced. It describes dividing speech samples into frames and calculating the zero-crossing rate and energy of each frame. A low zero-crossing rate and high energy indicates voiced speech, while a high zero-crossing rate and low energy indicates unvoiced speech. The method is presented as a simple and fast approach to speech segmentation that could overcome issues associated with other techniques. Evaluation of the zero-crossing rate and energy calculations on divided speech segments showed the method effectively separates voiced and unvoiced parts of speech.

Uploaded by

cossybr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Chapter: Advanced Techniques in Computing Sciences and Software Engineering,

pp 279-282, 2010; DOI 10.1007/978-90-481-3660-5_47

Voiced/Unvoiced Decision for Speech Signals Based


on Zero-Crossing Rate and Energy
Bachu R.G., Kopparthi S., Adapa B., Barkana B.D.
Department of Electrical Engineering
School of Engineering, University of Bridgeport
221 University Ave. Bridgeport, CT 06604, USA

Abstract--In speech analysis, the voiced -unvoiced recognition techniques were used to separate the speech
decision is usually performed in extracting the segments into voiced/unvoiced [8].
information from the speech signals. In this paper, two The method we used in this work is a simple and fast
methods are performed to separate the voiced and approach and may overcome the problem of classifying the
unvoiced parts of the speech signals. These are zero
crossing rate (ZCR) and energy. In here, we evaluated the
speech into voiced/unvoiced using zero-crossing rate and
results by dividing the speech sample into some segme nts energy of a speech signal. The methods that are used in this
and used the zero crossing rate and energy calculations to study are presented in the second part. The results are given in
separate the voiced and unvoiced parts of speech. The the third part.
results suggest that zero crossing rates are low for voiced
part and high for u nvoiced part where as the energy is II. METHOD
high for voiced part and low for unvoiced part. In our design, we combined zero crossings rate and energy
Therefore, these methods are proved effective in calculation. Zero-crossing rate is an important parameter for
separation of voiced and unvoiced speech. voiced/unvoiced classification. It is also often used as a part of
the front-end processing in automatic speech recognition
I. I NTRODUCT ION
system. The zero crossing count is an indicator of the
Speech can be divided into numerous voiced and unvoiced frequency at which the energy is concentrated in the signal
regions. The classification of speech signal into voiced, spectrum. Voiced speech is produced because of excitation of
unvoiced provides a preliminary acoustic segmentation for vocal tract by the periodic flow of air at the glottis and usually
speech processing applications, such as speech synthesis, shows a low zero-crossing count [9], whereas the unvoiced
speech enhancement, and speech recognition. speech is produced by the constriction of the vocal tract
“Voiced speech consists of more or less constant frequency narrow enough to cause turbulent airflow which results in
tones of some duration, made when vowels are spoken. It is noise and shows high zero-crossing count.
produced when periodic pulses of air generated by the Energy of a speech is another parameter for classifying the
vibrating glottis resonate through the vocal tract, at voiced/unvoiced parts. The voiced part of the speech has high
frequencies dependent on the vocal tract shape. About two- energy because of its periodicity and the unvoiced part of
thirds of speech is voiced and this type of speech is also what speech has low energy. The analysis for classifying the
is most important for intelligibility. Unvoiced speech is non- voiced/unvoiced parts of speech has been illustrated in the
periodic, random-like sounds, caused by air passing through a block diagram in Fig.1.
narrow constriction of the vocal tract as when consonants are At the first stage, speech signal is divided into intervals in
spoken. Voiced speech, because of its periodic nature, can be frame by frame without overlapping. It is given with Fig.2.
identified, and extracted [1]”.
In recent years considerable efforts has been spent by A. End-Point Detection
researchers in solving the problem of classifying speech into
voiced/unvoiced parts [2-8]. A pattern recognition approach One of the most basic but problematic aspects of speech
and statistical and non statistical techniques has been applied processing is to detect when a speech utterance starts and
for deciding whether the given segment of a speech signal ends. This is called end-point detection. In the case of
should be classified as voiced speech or unvoiced speech unvoiced sounds occurring at the beginning or end of the
[2,3,5, and 7]. Qi and Hunt classified voiced and unvoiced utterance, it is difficult to detect accurately the speech signal
speech using non-parametric methods based on multi-layer from the background noise signal.
feed forward network [4]. Acoustical features and pattern
Chapter: Advanced Techniques in Computing Sciences and Software Engineering,
pp 279-282, 2010; DOI 10.1007/978-90-481-3660-5_47

In this work, end-point detection is applied to the energy function of the entire utterance is then computed using
voiced/unvoiced algorithm at the beginning of the algorithm to Eq.4.
separate silence and speech signal. A small sample of the
background noise is taken during the silence interval just prior
to the commencement of the speech signal. The short-time

Subdivision of
the frame

Short - time
Hamming
Energy Not sure
Window
Speech signal calculation(E)
x(n)
Frame by Frame Yes Voiced Speech
End-point Detection If ZCR is small
Signal Processing Signals
and E is high
Short - time Average
Zero - crossings rate
calculation ( ZCR) No

Unvoiced Speech
Signals

Fig.1: Block diagram of the voiced/unvoiced classification.

Fig. 2: Frame-by–frame processing of speech signal.

measure of the frequency content of a signal. Zero-crossing


rate is a measure of number of times in a given time
interval/frame that the amplitude of the speech signals passes
A speech threshold is determined which takes into account through a value of zero, Fig3 and Fig.4. Speech signals are
the silence energy and the peak energy. Initially, the endpoints broadband signals and interpretation of average zero-crossing
are assumed to occur where the signal energy crosses this rate is therefore much less precise. However, rough estimates
threshold. Corrections to these initial estimates are made by of spectral properties can be obtained using a representation
computing the zero-crossing rate in the vicinity of the based on the short-time average zero-crossing rate [12].
endpoints and by comparing it with that of the silence. If
detectable changes in zero-crossing rate occur outside the
initial thresholds, the endpoints are re-designed to the points at
which the changes take place [10-11]

B. Zero-Crossing Rate
In the context of discrete-time signals, a zero crossing is
said to occur if successive samples have different algebraic
signs. The rate at which zero crossings occur is a simple
Fig. 3: Definition of zero-crossings rate
Chapter: Advanced Techniques in Computing Sciences and Software Engineering,
pp 279-282, 2010; DOI 10.1007/978-90-481-3660-5_47


En  [ x(m)w(n  m)]
m 
2
(4)

The choice of the window determines the nature of the


short-time energy representation. In our model, we used
Hamming window. The hamming window gives much greater
attenuation outside the band pass than the comparable
rectangular window.
h(n)  0.54  0.46 cos(2n /( N  1)) , 0  n  N 1
(5)
h(n)  0 , otherwise

The attenuation of this window is independent of the


Fig. 4: Distribution of zero-crossings for unvoiced and voiced window duration. Increasing the length, N, decreases the
speech [12]. bandwidth, Fig 5. If N is too small, E n will fluctuate very
rapidly depending on the exact details of the waveform. If N is
too large, E n will change very slowly and thus will not
adequately reflect the changing properties of the speech signal
[12].
A definition for zero-crossings rate is:

Zn   sgn[ x(m)]  sgn[ x(m  1)] w(n  m)
m  
(1)

where
1, x(n)  0 (2)
sgn[ x(n)]  
 1, x(n)  0
and
 1
 for ,0  n  N  1 (3)
w(n)   2 N

0 for , otherwise

The model for speech production suggests that the energy of


voiced speech is concentrated below about 3 kHz because of
the spectrum fall of introduced by the glottal wave, whereas
for unvoiced speech, most of the energy is found at higher
frequencies. Since high frequencies imply high zero crossing
rates, and low frequencies imply low zero-crossing rates, there Fig.5. Computation of Short-Time Energy [12].
is a strong correlation between zero-crossing rate and energy
distribution with frequency. A reasonable generalization is
that if the zero-crossing rate is high, the speech signal is III. RESULTS
unvoiced, while if the zero-crossing rate is low, the speech MATLAB 7.0 is used for our calculations. We chose
signal is voiced [12]. MATLAB as our programming environment as it offers many
advantages. It contains a variety of signal processing and
C. Short-Time Energy statistical tools, which help users in generating a variety of
The amplitude of the speech signal varies with time. signals and plotting them. MATLAB excels at numerical
Generally, the amplitude of unvoiced speech segments is computations, especially when dealing with vectors or
much lower than the amplitude of voiced segments. The matrices of data.
energy of the speech signal provides a representation that One of the speech signal used in this study is given with
reflects these amplitude variations. Short-time energy can Fig.6. Proposed voiced/unvoiced classification algorithm uses
define as: short-time zero-crossings rate and energy of the speech signal.
The signal is windowed with a rectangular window of 50ms
Chapter: Advanced Techniques in Computing Sciences and Software Engineering,
pp 279-282, 2010; DOI 10.1007/978-90-481-3660-5_47

duration at the beginning. The algorithm reduces the duration TABLE I. VOICED/UNVOICED DECISIONS FOR THE WORD
time of the window by half at each feedback if the decision is “FOUR” USING THE MODEL.
not clear. The results of voiced/unvoiced decision using our Energy
ZCR Decision
model are presented in Table 1. (J)
Frame-1 ( 50 ms) 152 0.0018 u nvoiced

Four
Frame-21( 25 ms) 52 0.0543 unvoiced
0.2

0.15 Frame- 22(25 ms) 19 21.1189 voiced


0.1

Frame-3 ( 50 ms) 41 186.6628 v oiced


0.05

Frame-4 ( 50 ms) 41 230.5772 v oiced


Amplitude

-0.05

-0.1
Frame-5 ( 50 ms) 43 252.98 v oiced
-0.15
Frame-6( 50 ms) 56 193.70 v oiced
-0.2
0 1000 2000 3000 4000 5000 6000 7000 8000
Samples
Frame-71( 25 ms) 31 27.2842 voiced
Fig.6: Original speech signal for the word “four.” Frame-72( 25 ms) 30 25.960 voiced

The frame by frame representation of the algorithm is Frame- 811( 12.5 ms) 24 3.4214 voiced
presented with Fig.7. At the beginning and the ending points Frame- 812( 12.5 ms) 11 0.4765 unvoiced
of the speech signal, the algorithm decreases the window Frame-82( 25 ms) 19 0.166 unvoiced
duration time. At the beginning, word starts with an “f” sound
Frame-9 ( 50 ms) 89 0.0054 u nvoiced
which is unvoiced. At the end, word ends with a “r” sound
which is unvoiced.
In the frame-by-frame processing stage, the speech signal is
segmented into a non-overlapping frame of samples. It is
processed into frame by frame until the entire speech signal is
covered. Table 1 includes the voiced/unvoiced decisions for
word “four.” It has 3600 samples with 8000Hz sampling rate.
At the beginning, we set the frame size as 400 samples (50
ms). At the end of the algorithm if the decision is not clear,
energy and zero-crossing rate is recalculated by dividing the
related frame size into two frames. This phenomenon can be
seen for Frame 2, 7, and 8 in the Table 1.

IV. CONCLUSION

We have presented an approach for separating the voiced


Fig.7. Representation of the frames. /unvoiced part of speech in a simple and efficient way. The
algorithm shows good results in classifying the speech as we
segmented speech into many frames. In our future study, we
plan to improve our results for voiced/unvoiced discrimination
in noise.
ACKNOWLEDGEMENT

This paper is presented at the International Joint


Conferences on Computer, Information, and Systems
Sciences, and Engineering (CISSE 08), December 5-13, 2008.

REFERENCES
[1] J. K. Lee, C. D. Yoo, “Wavelet speech enhancement
based on voiced/unvoiced decision”, Korea Advanced
Chapter: Advanced Techniques in Computing Sciences and Software Engineering,
pp 279-282, 2010; DOI 10.1007/978-90-481-3660-5_47

Institute of Science and Technology The 32nd


International Congress and Exposition on Noise
Control Engineering, Jeju International Convention
Center, Seogwipo, Korea, August 25-28, 2003.
[2] B. Atal, and L. Rabiner, “A Pattern Recognition
Approach to Voiced-Unvoiced-Silence Classification
with Applications to Speech Recognition,” IEEE
Trans. On ASSP, vol. ASSP-24, pp. 201-212, 1976.
[3] S. Ahmadi, and A.S. Spanias, “Cepstrum-Based Pitch
Detection using a New Statistical V/UV Classification
Algorithm,” IEEE Trans. Speech Audio Processing,
vol. 7 No. 3, pp. 333-338, 1999.
[4] Y. Qi, and B.R. Hunt, “Voiced-Unvoiced-Silence
Classifications of Speech using Hybrid Features and a
Network Classifier,” IEEE Trans. Speech Audio
Processing, vol. 1 No. 2, pp. 250-255, 1993.
[5] L. Siegel, “A Procedure for using Pattern Classification
Techniques to obtain a Voiced/Unvoiced Classifier”,
IEEE Trans. on ASSP, vol. ASSP-27, pp. 83- 88, 1979.
[6] T.L. Burrows, “Speech Processing with Linear and
Neural Network Models”, Ph.D. thesis, Cambridge
University Engineering Department, U.K., 1996.
[7] D.G. Childers, M. Hahn, and J.N. Larar, “Silent and
Voiced/Unvoiced/Mixed Excitation (Four-Way)
Classification of Speech,” IEEE Trans. on ASSP, vol.
37 No. 11, pp. 1771-1774, 1989.
[8] J. K. Shah, A. N. Iyer, B. Y. Smolenski, and R. E.
Yantorno “Robust voiced/unvoiced classification using
novel features and Gaussian Mixture model”, Speech
Processing Lab., ECE Dept., Temple University, 1947
N 12th St., Philadelphia, PA 19122-6077, USA.
[9] J. Marvan, “Voice Activity detection Method and
Apparatus for voiced/unvoiced decision and Pitch
Estimation in a Noisy speech feature extraction”,
08/23/2007, United States Patent 20070198251.
[10] T. F. Quatieri, Discrete-Time Speech Signal
Processing: Principles and Practice, MIT Lincoln
Laboratory, Lexington, Massachusetts, Prentice Hall,
2001, ISBN-13:9780132429429.
[11] F.J. Owens, Signal Processing of Speech, McGraw-
Hill, Inc., 1993, ISBN-0-07-0479555-0.
[12] L. R. Rabiner, and R. W. Schafer, Digital Processing of
Speech Signals, Englewood Cliffs, New Jersey,
Prentice Hall, 512-ISBN-13:9780132136037, 1978.

You might also like