0% found this document useful (0 votes)
15 views

Preprocessing Signal

The document discusses preprocessing techniques in automatic speech recognition for human-computer interaction. It covers topics like noise removal, voice activity detection, pre-emphasis, framing and windowing. The preprocessing aims to classify speech into voiced or unvoiced segments and make the signal more suitable for feature extraction and analysis to develop an efficient ASR system.

Uploaded by

MEM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Preprocessing Signal

The document discusses preprocessing techniques in automatic speech recognition for human-computer interaction. It covers topics like noise removal, voice activity detection, pre-emphasis, framing and windowing. The preprocessing aims to classify speech into voiced or unvoiced segments and make the signal more suitable for feature extraction and analysis to develop an efficient ASR system.

Uploaded by

MEM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Anale. Seria Informatică. Vol. XV fasc.

1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017

PREPROCESSING TECHNIQUE IN AUTOMATIC SPEECH RECOGNITION


FOR HUMAN COMPUTER INTERACTION: AN OVERVIEW
Yakubu A. Ibrahim 1, Juliet C. Odiketa 2, Tunji S. Ibiyemi 3
1
Department of Computer Science, Bingham University, Karu, Nigeria
2
Department of Computer Science, The Federal Polytechnic Idah, Idah, Nigeria
3
Department of Electrical Engineering, University of Ilorin, Ilorin, Nigeria

Corresponding Author: Yakubu A. Ibrahim, [email protected]

ABSTRACT: Automatic Speech Recognition has found silence/unvoiced ([D+00]) sounds finds other
its application on various aspects of our daily lives such applications mainly in Fundamental Frequency
as automatic phone answering service, dictating text and Estimation, Formant Extraction or Syllable Marking,
issuing voice commands to computers. Speech Stop Consonant Identification and End Point
recognition is one of the fastest developing fields in the
Detection for isolated utterances. There are several
framework of speech science and engineering. Also, in
computing technology, it comes as the next major ways of classifying (labeling) events in speech. It is
innovation in human computer interaction. However, in accepted convention to use a three-state
speech signal processing, Pre-processing of speech plays representation in which states are (i) silence (S),
a vital role in development of an efficient automatic where no speech is produced; (ii) unvoiced (U), in
speech recognition system. Nowadays, Humans are able which the vocal cords ([AR76]) are not vibrating, so
to interact with computer hardware and other machines the resulting speech waveform is a periodic or
through human language. In view of the above, random in nature and (iii) voiced (V), in which the
researchers are putting efforts to develop a perfect and vocal chords are tensed and therefore vibrate
efficient speech recognition system but machines are periodically when air flows from the lungs, so the
unable to match the performance of human utterances in
resulting waveform is quasi-periodic ([CHL89]).
terms of accuracy of matching and speed of response.
Therefore, preprocessing of signal is based on number of
applications and drawback of the available techniques of II PREPROCESSING
ASR systems. Hence, the process of preprocessing in
speech recognition discussed in the study includes: Noise In development of an ASR system, preprocessing is
removal, Voice Activity Detection, Pre-emphasis, considered the first phase of other phases in speech
Framing and Windowing. recognition to differentiate the voiced or unvoiced
KEYWORDS: Automatic Speech Recognition (ASR), signal and create feature vectors. Preprocessing
Human Computer Interaction (HCI), Pre-processing. adjusts or modifies the speech signal, x(n), so that it
will be more acceptable for feature extraction
I INTRODUCTION analysis. The major factor to consider when it comes
to speech signal processing is to check the speech,
Speech is the most natural form of human-to-human x(n) if is corrupted by some background or ambient
communications and is related to human noise, d(n), for example as additive disturbance
physiological capability. It is the most important,
effective and convenient form of information (1)
exchange. Speech processing is a complete subject
and a popular research field, which involves a wide Where s(n) is the clean speech signal. In noise
range of content ([ZB15]). In Automatic Speech reduction, there are different methods that can be
Recognition system the first phase is pre-processing adopted to perform the task on a noisy speech signal.
phase. Moreover, Pre-Processing of Speech is very However, to develop perfect speech recognition
important in the applications where silence or system, the two frequently used methods of noise
ambient noise is completely undesirable. Voice reduction algorithms in speech recognition system is
activity detection is a well known technique adopted spectral subtraction and adaptive noise cancellation
for many years in preprocessing of speech signal, ([D+00]).
Noise canceling, pre-emphasis and dimensionality
reduction of speech facilitates the system to be
computationally more efficient. This type of
classification of speech into voiced or
186
Anale. Seria Informatică. Vol. XV fasc. 1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017

(a) BACKGROUND/AMBIENT NOISE speech utterance, it seems to be relatively trivial, and


REMOVAL has been found to be very difficult in practice in
speech recognition systems. When a proper SNR is
The ability to detect the useful parts of a speech given, the work of developing ASR system is made
signal from stream of signals can be of high easier. Voice activity detectors (VAD) are devices
importance during the initial processing stages of an used to divide the speech signal into voiced or
audio analysis system process. Ambient noise is any unvoiced, speech segments and non-speech
signal other than the signal being monitored. It is a segments. Non-speech or unvoiced parts of speech
form of noise pollution or interference. As a matter utterance are pre-utterance, post-utterance and
of fact, background noise is an important concept in between words silences. Although, methods or
setting noise levels in ASR systems. The algorithms to detect automatically non-speech parts
performance measure of speech recognition systems of utterance are necessary for a wide range of
degrades drastically when training and testing data applications like speech coding, speech recognition,
are carried out with different noise levels. Signal-to- speech enhancement, etc. In the case of estimation
Noise Ratio (SNR) is the ratio of the power of the of noise characteristics during nonspeech segments,
correct signal to the noise ([ZM04]). SNR is usually VADs have to adapt to the changes of the noise
measured in decibels (dB). characteristics ([MJR92]). Robustness against noise
variations is difficult to obtain. Unvoiced segments
(2) of the speech signal are more difficult to detect than
voiced segments, because they are more similar to
the noise and the SNR is generally lower in
Where Vsignal is the voltage of correct signal, Vnoise is unvoiced than in voiced segments. Speech
the voltage of the noise. Background or ambient recognition adopts the following commonly used
noise is normally produced by sounds of air techniques for finding VAD in speech recognition
conditioning system, fans, fluorescent lamps, type are as follows:
writers, computer systems, back conversation,
footsteps, traffic noise, alarms, bird’s noise, opening 1. THE ZERO-CROSSING RATE
and closing of doors. The developers of ASR system
usually have little control over these noises in the ZCR of a speech signal frame is the rate of sign-
real life environments. Every noise is additive in changes of the signal during the frame. In other
nature and usually steady state except for impulse words, it is the number of times the signal changes
noise sources like type writers ([HC14]). In training value, from positive to negative and vice versa,
and testing stage, the frequently used method to divided by the length of the frame. The ZCR is
reduce the effect the ambient noise on speech defined according to the following equation:
recognition is to use a close-talk microphone. When
a speaker is generating speech utterance at normal
(4)
communication level, the average signal to noise
ratio (speech level) increase by about 3dB any time
the microphone is filtering the speech utterance. The Where sgn() is the sign function, that is
filter adopted to remove the background or ambient
noise is as follows ([JMR94]):
(5)
(3)
ZCR is used to discern unvoiced speech. Usually
Where, the Es is log energy of block of N samples unvoiced speech has a low short-term energy but a
and ϵ is a small positive constant added to prevent high ZCR.
the computing of log zero. S(n) be the nth speech
sample in the block of N samples. 2. ENERGY (ENTROPY OF ENERGY)

(b) VOICE ACTIVITY DETECTION /SPEECH Let xi(n), n = 1, . . . ,N be the sequence of audio
WORD DETECTION samples of the ith frame, where WL is the length of
the frame. The short-term energy is computed
The major issue of getting or locating the endpoints according to the equation ([TA14]):
of a signal in a speech is a main problem for the
speech recognizer. Inaccurate endpoint detection (6)
will decrease the performance of the speech
recognizer. However, in detecting endpoints of a
187
Anale. Seria Informatică. Vol. XV fasc. 1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017

Usually, energy is normalized by dividing it with WL Speech


to remove the dependency on the frame length. Signal
Therefore, Equation (5) which provides the so called
power of signal becomes:
Block of
(7) Samples

It is observed that short-term energy is the most


effective energy parameter for VAD. Speech signal
Zero Crossing Entropy of Autocorrel
has most of its energy collected in the lower Rate Energy ation
frequencies, whereas most energy of the unvoiced
speech exists in the higher frequencies ([L+81]). The auto
short-term entropy of energy can be defined as a Compute Distance
calculation of unexpected changes in the energy
level of a speech signal. To compute it, divide every
short-term frame in K sub-frames of fixed duration
by short-term frame in K short-frames. Then, for Select Minimum Distance
each Esub-frame, j, its energy is computed as in
Equation (6) and divides it by the total energy,
EshortFrame, i, of the short-term frame. The division
operation is a standard procedure and serves as the Voice, Unvoiced and Silence
Decision
means to treat the resulting sequence of sub-frame
energy values, ej, j = 1, . . . ,K, as a sequence of Figure 1: Block diagram of end point detection
probabilities, as in Equation (7) ([TA14]): ([BVN12])

(8) (ii) FILTER FOR END POINT DETECTION

Where Filters are widely employed in signal processing and


communication systems in applications such as
channel equalization, noise reduction, radar, audio
(9)
processing, video processing, biomedical signal
processing, and analysis of economic and financial
At a final step, the entropy, H(i) of the sequence ej is
data.
computed according to the equation:
The essence of the filter can also be defined as a
process of flattening, where the spectrum is
(10) whitened. It is believed that a speech may have
diverse components separated by some pauses.
The resulting value is lower if abrupt changes in the Every component can be determined by detecting a
energy envelope of the frame exist. This is because, two of endpoints named component beginning and
if a sub-frame yields a high energy value, then one ending points. In the energy contours of speech,
of the resulting probabilities will be high, which in there is always a higher edge following a beginning
turn reduces the entropy of sequence ej . point and a lowering edge preceding an ending point
([BR04]). These points are known as beginning and
3. THE AUTOCORRELATION FUNCTION ending edges of the speech signal. However, to be
certain that the low-complexity, short-term energy is
It allows computing the correlation of a signal with adopted in the cepstral feature to be the feature for
itself as a function of time. endpoint detection. The energy filter is given as:
Normalization auto-correlation coefficient at unit
sample delay C1 is defined ([BVN12]):
(12)

(11) Where, o(j) is data sample, L is frame number, ‘l’ is


window length, E(L) is frame energy in decibel, n(L)
is number of first data sample in the window.
Thus, the detected endpoints can be aligned to the
ASR feature vector automatically and the
computation can be reduced from the speech-
188
Anale. Seria Informatică. Vol. XV fasc. 1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017

sampling rate to the frame rate. For correct and Where the preemphasis factor α is computed as:
effective endpoint detection, we need a good
detector that can detect all available endpoints from (17)
the energy feature. Since the output of the detector
may contain false acceptances, a decision module is Where F is the spectral slope will increase by
then required to make final decisions based on the 6dB/octave and is the sampling period of the sound.
detection output. However, since endpoints detection The pre-emphasis factor is chosen as a trade-off
always comes with the edges, the intention is to between vowel and consonants discrimination
detect the edges first and thereafter to find the capability ([SS11]).
corresponding endpoints ([RS78]). The usual form for the pre-emphasis filter is a high-
pass finite impulse response (FIR) filter with a
(iii) ENERGY NORMALIZATION single zero near the origin. It intends to whiten the
speech signal spectrum as well as emphasizing those
At this stage, the aim of normalization of energy is to frequencies at which the human auditory system is
normalize the speech energy E(l). The normalization most sensitive. However, for human ear, this is only
of energy is performed by finding the maximum suitable at 3 to 4 kHz. Above this range, the
energy value Emax over the spoken words as: sensitivity of human hearing falls off, and there is
relatively little linguistic information. Therefore, it is
(13) appropriate to adopt a second order pre-emphasis
filter. This causes the frequency response to roll off
By subtracting Emax from El to give at higher frequencies. This becomes very important
in the presence of noise. The pre-emphasizer is used
(14) to spectrally flatten the speech signal. This is usually
done by a high pass filter. The most frequently
In this way the peak energy value of each word is adopted filter for this phase is the FIR filter.
zero decibels and the recognition system is relatively Typically, the speech signal produced by human
insensitive to the difference in gain between being has a spectral slope of approximately-6dB for
different recordings. In performing the above voiced sounds. The slope is because of two major
calculations, there is constraints that word energy reasons namely: (a) the shape of the glottal pulse
contour normalization cannot take place until the introduces a slope of - 12dB and (b) The lip
end of the word is located ([Kul84]). radiation introduces a slope of +dB. Therefore, the
resultant slope of approximately -6dB exists in the
(c) PRE-EMPHASIS recorded voiced speech sounds. Pre-emphasis is
performed to remove this slope of -6 dB.
A spoken audio signal may have frequency To accomplish the task, the speech signal is passed
components that fall off at high frequencies. As a through a high-pass finite impulse response (FIR) filter
matter of fact, in some systems such as speech coding, of order 1. The pre-emphasis is defined by ([Kul84]):
to avoid overlooking the high frequencies, the high-
frequency components are compensated using pre- (18)
emphasis filtering ([Pic93]). Pre-emphasis is therefore,
aimed at compensating for lip radiation and necessary Where, s[n] is the nth speech sample, y[n] is the
attenuation of high frequencies in the sampling corresponding pre-emphasized sample and P is the
process. High frequency components are emphasized pre-emphasis factor typically having a value
and low frequency components are attenuated. This is between 0:9 and 1. Pre-emphasis ensures that in the
quite a standard preprocessing step. The digitized frequency domain all the formats of the speech
speech waveform has a high dynamic range and suffers signal have similar amplitude so that they get equal
from additive noise to reduce this range pre-emphasis importance in subsequent processing stages
is applied. By pre-emphasis, we imply the application ([D+00]). In the frequency domain, it looks like:
of a high pass filter, which is usually a first-order FIR
of the form ([Q+07]): (19)

(15) (d) FRAMING OR FRAME BLOCKING

Normally, a single coefficient filter digital filter Framing is the process of breaking the continuous
known as preemphasis filter is used: stream of speech samples into components of
constant length to facilitate block-wise processing of
(16) the signal. In the same vein, speech can be thought

189
Anale. Seria Informatică. Vol. XV fasc. 1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017

of been a quasi-stationary signal and is stationary proportional to the frequency resolution and inversely
only for a short period of time ([BVN12]). As a proportional to the time resolution. (ii) The signal
result, the speech signal is slowly varying over time overlap is proportional to the frame rate, but it is also
(quasi-stationary) that is when the signal is proportional to the correlation of subsequent frames.
examined over a short period of time (5-100msec), Where w(n) designates the window function. Types
the signal is fairly stationary. Therefore, speech of common window functions used in FIR filter
signals are often analyzed in short time components, design for speech are given below:
which are sometimes referred to as short-time
spectral analysis in speech processing. (i) Rectangular window:
This simply means that the signal is divided or w(n) (20)
blocked in to frames of typically 20-30 msec. In this
aspect, adjacent frames normally overlap each other (ii) Triangular window:
with 30-50%, this is done in order not to lose any (21)
vital information of the speech signal due to the
windowing.
(iii) Hanning window:
(e) WINDOWING (22)

At this stage the signal has been framed into (iv) Hamming window:
segments, each frame is multiplied with a window (23)
function w(n) with length N, where N is the length
of the frame. Windowing is the process of (v) Bartlett window:
multiplying a waveform of speech signal segment by
a time window of given shape, to stress pre-defined
characteristics of the signal. To reduce the (24)
discontinuity of speech signal at the beginning and
end of each frame, the signal should be tapered to III CONCLUSION
zero or close to zero, and hence minimize the
mismatch. Moreover, this can be arrived at by The study of preprocessing has been carried out to
windowing each frame of the signal to increase the develop a speech recognition based for human
correlation of the Mel Frequency Cepstrum computer interaction system. This system can be
Coefficients (MFCC) and spectral estimates between used in various applications related with disable
consecutive frames ([BVN12]). ASR system persons those are unable to operate computer
designers have always had to solve for an issue of a through keyboard and mouse, these type of persons
compromise in their selection of analysis window. can use computer with the use of automatic speech
To obtain good frequency resolution, a long window recognition system, with this system user can
is desirable but the linguistic importance of some operate computer with speech commands so extra
short transients makes a short window desirable and advantages of human computer interaction will be
effective. The normal compromise that is always that if any disable person is using this system he/she
available to settle for is the frame lengths of about feels that he/she is working in real environment as
20 or 30 ms, with a frame spacing of 5 to 10 ms. On what they want to do. Also, the application is
the other hand, a shorter window is always adequate available for those computer users which are not
to capture the salient spectral features, given that the comfortable with English language or any of the
frame spacing is also sufficiently short enough. An available international language but feel good to
eight (8) ms window, with two (2) ms frame spacing work with their native language such as Hausa
is always adopted. However, when the feature language.
curves are represented as described in the following
subsections, the frequency resolution appears to be REFERENCES
very similar to that obtained with the longer
window. The windowing is always performed to a [AR76] B. Atal, L. Rabiner - A pattern
speech signal to avoid problems due to truncation of recognition approach to voiced
the signal as windowing helps in the smoothing of unvoiced- silence classification with
the signal ([ZM04]). applications to speech recognition
The proper selection in the choice of window w(n) is Acoustics, Speech, and Signal
a grade-off between different factors: (i) The shape of Processing [see also IEEE transactions
the window may reduce differences, but it may on Signal Processing], Vol. 24, pp. 201-
increase signal shape alteration. The length is 212, 1976.

190
Anale. Seria Informatică. Vol. XV fasc. 1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017

[BR04] C. Becchetti, L. Ricotti - Speech [MJR92] B. Mak, J.-C. Junqua, B. Reaves - A


Recognition Theory and C++ robust speech/non speech detection
Implementation, John Wiley & Sons, algorithm using time and frequency-
Wiley Student Edition, Singapure, pp. based features, in Proceedings of the
121-188, 2004. IEEE International Conference on
Acoustics, Speech and Signal
[BVN12] S. Bhupinder, R. Vanita, M. Namisha Processing, Vol. I, pp. 269–272, 1992.
- Preprocessing in ASR for Computer
Machine Interaction with Humans: A [Pic93] L. Picone - Signal modeling technique
Review, International Journal of in Speech Recognition, IEEE ASSP
Advanced Research in Computer Magazine, Vol. 81, Issue 9, pp. 1215-
Science and Software Engineering, 1247, 1993.
Vol.2 pp 396-399, 2012.
[RS78] L. R. Rabiner, R. W. Schafer - Digital
[CHL89] D. G. Childers, M. Hand, M. J. Larar Processing of Speech Signals,
- Silent and Voiced/Unvoiced/Mixed Englewood Cliffs, New Jersey, Prentice
Excitation (Four Way), Classification Hall, 512-ISBN-13:9780132136037,
of Speech, IEEE Trans. On ASSP, Vol. 1978.
37, 11, Nov 1989, pp1771-74, 1989.
[Q+07] L. Qi, Z. Jinsong, T. Augustine, Z.
[D+00] J. R. Deller, J. L. Hanse, J. G. Qiru - Robust Endpoint Detection and
Proakis - Discrete-Time Processing of Energy Normalization for real-Time
speech signals. IEEE Press, ISBN 0- Speech Recognition and Speaker
7803-5386-2, 2000. Recognition. IEEE Transactions. On
Speech and audio processing. Vol. 10
[HC04] T. Hwang, S. Chang - Energy Contour no 3 pp. 146–157, 2007.
enhancement for noisy speech
recognition, International Symposium [SS11] B. Singh, P. Singh - Voice Based user
on Chinese Spoken Language Machine Interface for Punjabi using
Processing, Vol. 1, pp. 249-252, 2004. Hidden Markov Model, in the
proceeding of IJCST Vol. 2, Issue 3,
[JMR94] J.-C. Junqua, B. Mak, B. Reaves - A pp. 222-224, 2011.
robust algorithm for word boundary
detection in presence of noise, IEEE [TA14] G. Theodoros, P. Aggelos -
Trans. on Speech and Audio Introduction to Audio Analysis: A
Processing, Vol. 2, pp. 406– 412, 1994. MATLAB Approach, Elsevier Academic
Press USA, pp 77-110, 2014.
[Kul84] K. P. Kuldip - Effect of Pre-emphasis
on Vowel Recognition Performance, [ZB15] G. Zhuo, W. D. Bian-Ba - A Study of
speech communication 3 (pg.101-106), Tibetan Speech Pitch Detection
North-Holland, 1984. Algorithm Based on Matlab. Modern
Electronics Technique, 10, 20-22, 2015.
[L+81] L. Lamel, L. Rabiner, A. Rosenberg,
J. Wilpon - An improved endpoint [ZM04] L. Ze-nian, S. D. Mark -
detector for isolated word recognition, Fundamentals of multimedia, Pearson
IEEE Trans. on Acoustics, Speech and Pretence Hall Press, USA, pp 130-140,
Signal Processing, vol. 29, pp. 777– 2004.
785, 1981.

191

You might also like