Preprocessing Signal
Preprocessing Signal
1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017
ABSTRACT: Automatic Speech Recognition has found silence/unvoiced ([D+00]) sounds finds other
its application on various aspects of our daily lives such applications mainly in Fundamental Frequency
as automatic phone answering service, dictating text and Estimation, Formant Extraction or Syllable Marking,
issuing voice commands to computers. Speech Stop Consonant Identification and End Point
recognition is one of the fastest developing fields in the
Detection for isolated utterances. There are several
framework of speech science and engineering. Also, in
computing technology, it comes as the next major ways of classifying (labeling) events in speech. It is
innovation in human computer interaction. However, in accepted convention to use a three-state
speech signal processing, Pre-processing of speech plays representation in which states are (i) silence (S),
a vital role in development of an efficient automatic where no speech is produced; (ii) unvoiced (U), in
speech recognition system. Nowadays, Humans are able which the vocal cords ([AR76]) are not vibrating, so
to interact with computer hardware and other machines the resulting speech waveform is a periodic or
through human language. In view of the above, random in nature and (iii) voiced (V), in which the
researchers are putting efforts to develop a perfect and vocal chords are tensed and therefore vibrate
efficient speech recognition system but machines are periodically when air flows from the lungs, so the
unable to match the performance of human utterances in
resulting waveform is quasi-periodic ([CHL89]).
terms of accuracy of matching and speed of response.
Therefore, preprocessing of signal is based on number of
applications and drawback of the available techniques of II PREPROCESSING
ASR systems. Hence, the process of preprocessing in
speech recognition discussed in the study includes: Noise In development of an ASR system, preprocessing is
removal, Voice Activity Detection, Pre-emphasis, considered the first phase of other phases in speech
Framing and Windowing. recognition to differentiate the voiced or unvoiced
KEYWORDS: Automatic Speech Recognition (ASR), signal and create feature vectors. Preprocessing
Human Computer Interaction (HCI), Pre-processing. adjusts or modifies the speech signal, x(n), so that it
will be more acceptable for feature extraction
I INTRODUCTION analysis. The major factor to consider when it comes
to speech signal processing is to check the speech,
Speech is the most natural form of human-to-human x(n) if is corrupted by some background or ambient
communications and is related to human noise, d(n), for example as additive disturbance
physiological capability. It is the most important,
effective and convenient form of information (1)
exchange. Speech processing is a complete subject
and a popular research field, which involves a wide Where s(n) is the clean speech signal. In noise
range of content ([ZB15]). In Automatic Speech reduction, there are different methods that can be
Recognition system the first phase is pre-processing adopted to perform the task on a noisy speech signal.
phase. Moreover, Pre-Processing of Speech is very However, to develop perfect speech recognition
important in the applications where silence or system, the two frequently used methods of noise
ambient noise is completely undesirable. Voice reduction algorithms in speech recognition system is
activity detection is a well known technique adopted spectral subtraction and adaptive noise cancellation
for many years in preprocessing of speech signal, ([D+00]).
Noise canceling, pre-emphasis and dimensionality
reduction of speech facilitates the system to be
computationally more efficient. This type of
classification of speech into voiced or
186
Anale. Seria Informatică. Vol. XV fasc. 1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017
(b) VOICE ACTIVITY DETECTION /SPEECH Let xi(n), n = 1, . . . ,N be the sequence of audio
WORD DETECTION samples of the ith frame, where WL is the length of
the frame. The short-term energy is computed
The major issue of getting or locating the endpoints according to the equation ([TA14]):
of a signal in a speech is a main problem for the
speech recognizer. Inaccurate endpoint detection (6)
will decrease the performance of the speech
recognizer. However, in detecting endpoints of a
187
Anale. Seria Informatică. Vol. XV fasc. 1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017
sampling rate to the frame rate. For correct and Where the preemphasis factor α is computed as:
effective endpoint detection, we need a good
detector that can detect all available endpoints from (17)
the energy feature. Since the output of the detector
may contain false acceptances, a decision module is Where F is the spectral slope will increase by
then required to make final decisions based on the 6dB/octave and is the sampling period of the sound.
detection output. However, since endpoints detection The pre-emphasis factor is chosen as a trade-off
always comes with the edges, the intention is to between vowel and consonants discrimination
detect the edges first and thereafter to find the capability ([SS11]).
corresponding endpoints ([RS78]). The usual form for the pre-emphasis filter is a high-
pass finite impulse response (FIR) filter with a
(iii) ENERGY NORMALIZATION single zero near the origin. It intends to whiten the
speech signal spectrum as well as emphasizing those
At this stage, the aim of normalization of energy is to frequencies at which the human auditory system is
normalize the speech energy E(l). The normalization most sensitive. However, for human ear, this is only
of energy is performed by finding the maximum suitable at 3 to 4 kHz. Above this range, the
energy value Emax over the spoken words as: sensitivity of human hearing falls off, and there is
relatively little linguistic information. Therefore, it is
(13) appropriate to adopt a second order pre-emphasis
filter. This causes the frequency response to roll off
By subtracting Emax from El to give at higher frequencies. This becomes very important
in the presence of noise. The pre-emphasizer is used
(14) to spectrally flatten the speech signal. This is usually
done by a high pass filter. The most frequently
In this way the peak energy value of each word is adopted filter for this phase is the FIR filter.
zero decibels and the recognition system is relatively Typically, the speech signal produced by human
insensitive to the difference in gain between being has a spectral slope of approximately-6dB for
different recordings. In performing the above voiced sounds. The slope is because of two major
calculations, there is constraints that word energy reasons namely: (a) the shape of the glottal pulse
contour normalization cannot take place until the introduces a slope of - 12dB and (b) The lip
end of the word is located ([Kul84]). radiation introduces a slope of +dB. Therefore, the
resultant slope of approximately -6dB exists in the
(c) PRE-EMPHASIS recorded voiced speech sounds. Pre-emphasis is
performed to remove this slope of -6 dB.
A spoken audio signal may have frequency To accomplish the task, the speech signal is passed
components that fall off at high frequencies. As a through a high-pass finite impulse response (FIR) filter
matter of fact, in some systems such as speech coding, of order 1. The pre-emphasis is defined by ([Kul84]):
to avoid overlooking the high frequencies, the high-
frequency components are compensated using pre- (18)
emphasis filtering ([Pic93]). Pre-emphasis is therefore,
aimed at compensating for lip radiation and necessary Where, s[n] is the nth speech sample, y[n] is the
attenuation of high frequencies in the sampling corresponding pre-emphasized sample and P is the
process. High frequency components are emphasized pre-emphasis factor typically having a value
and low frequency components are attenuated. This is between 0:9 and 1. Pre-emphasis ensures that in the
quite a standard preprocessing step. The digitized frequency domain all the formats of the speech
speech waveform has a high dynamic range and suffers signal have similar amplitude so that they get equal
from additive noise to reduce this range pre-emphasis importance in subsequent processing stages
is applied. By pre-emphasis, we imply the application ([D+00]). In the frequency domain, it looks like:
of a high pass filter, which is usually a first-order FIR
of the form ([Q+07]): (19)
Normally, a single coefficient filter digital filter Framing is the process of breaking the continuous
known as preemphasis filter is used: stream of speech samples into components of
constant length to facilitate block-wise processing of
(16) the signal. In the same vein, speech can be thought
189
Anale. Seria Informatică. Vol. XV fasc. 1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017
of been a quasi-stationary signal and is stationary proportional to the frequency resolution and inversely
only for a short period of time ([BVN12]). As a proportional to the time resolution. (ii) The signal
result, the speech signal is slowly varying over time overlap is proportional to the frame rate, but it is also
(quasi-stationary) that is when the signal is proportional to the correlation of subsequent frames.
examined over a short period of time (5-100msec), Where w(n) designates the window function. Types
the signal is fairly stationary. Therefore, speech of common window functions used in FIR filter
signals are often analyzed in short time components, design for speech are given below:
which are sometimes referred to as short-time
spectral analysis in speech processing. (i) Rectangular window:
This simply means that the signal is divided or w(n) (20)
blocked in to frames of typically 20-30 msec. In this
aspect, adjacent frames normally overlap each other (ii) Triangular window:
with 30-50%, this is done in order not to lose any (21)
vital information of the speech signal due to the
windowing.
(iii) Hanning window:
(e) WINDOWING (22)
At this stage the signal has been framed into (iv) Hamming window:
segments, each frame is multiplied with a window (23)
function w(n) with length N, where N is the length
of the frame. Windowing is the process of (v) Bartlett window:
multiplying a waveform of speech signal segment by
a time window of given shape, to stress pre-defined
characteristics of the signal. To reduce the (24)
discontinuity of speech signal at the beginning and
end of each frame, the signal should be tapered to III CONCLUSION
zero or close to zero, and hence minimize the
mismatch. Moreover, this can be arrived at by The study of preprocessing has been carried out to
windowing each frame of the signal to increase the develop a speech recognition based for human
correlation of the Mel Frequency Cepstrum computer interaction system. This system can be
Coefficients (MFCC) and spectral estimates between used in various applications related with disable
consecutive frames ([BVN12]). ASR system persons those are unable to operate computer
designers have always had to solve for an issue of a through keyboard and mouse, these type of persons
compromise in their selection of analysis window. can use computer with the use of automatic speech
To obtain good frequency resolution, a long window recognition system, with this system user can
is desirable but the linguistic importance of some operate computer with speech commands so extra
short transients makes a short window desirable and advantages of human computer interaction will be
effective. The normal compromise that is always that if any disable person is using this system he/she
available to settle for is the frame lengths of about feels that he/she is working in real environment as
20 or 30 ms, with a frame spacing of 5 to 10 ms. On what they want to do. Also, the application is
the other hand, a shorter window is always adequate available for those computer users which are not
to capture the salient spectral features, given that the comfortable with English language or any of the
frame spacing is also sufficiently short enough. An available international language but feel good to
eight (8) ms window, with two (2) ms frame spacing work with their native language such as Hausa
is always adopted. However, when the feature language.
curves are represented as described in the following
subsections, the frequency resolution appears to be REFERENCES
very similar to that obtained with the longer
window. The windowing is always performed to a [AR76] B. Atal, L. Rabiner - A pattern
speech signal to avoid problems due to truncation of recognition approach to voiced
the signal as windowing helps in the smoothing of unvoiced- silence classification with
the signal ([ZM04]). applications to speech recognition
The proper selection in the choice of window w(n) is Acoustics, Speech, and Signal
a grade-off between different factors: (i) The shape of Processing [see also IEEE transactions
the window may reduce differences, but it may on Signal Processing], Vol. 24, pp. 201-
increase signal shape alteration. The length is 212, 1976.
190
Anale. Seria Informatică. Vol. XV fasc. 1 – 2017
Annals. Computer Science Series. 15th Tome 1st Fasc. – 2017
191