0% found this document useful (0 votes)
220 views9 pages

Frame Blocking and Windowing Speech Signal: December 2018

Frame blocking and windowing is used to analyze speech signals. A speech signal is divided into short frames of 10-30 milliseconds that are assumed to be stationary. Adjacent frames overlap by 0-50% using a window function like Hamming to smooth the signal at frame edges. This allows interpreting the signal characteristics properly by analyzing short stationary segments. Speech recognition systems are then examined using frame blocking and windowing of the speech signal.

Uploaded by

rahmid farezi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
220 views9 pages

Frame Blocking and Windowing Speech Signal: December 2018

Frame blocking and windowing is used to analyze speech signals. A speech signal is divided into short frames of 10-30 milliseconds that are assumed to be stationary. Adjacent frames overlap by 0-50% using a window function like Hamming to smooth the signal at frame edges. This allows interpreting the signal characteristics properly by analyzing short stationary segments. Speech recognition systems are then examined using frame blocking and windowing of the speech signal.

Uploaded by

rahmid farezi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331635757

Frame Blocking and Windowing Speech Signal

Article · December 2018

CITATIONS READS

6 3,071

1 author:

Oday Kamil
Dijlah University College
9 PUBLICATIONS   11 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Frame Blocking and Windowing Speech Signal View project

All content following this page was uploaded by Oday Kamil on 15 March 2019.

The user has requested enhancement of the downloaded file.


ISSN: 2413-6999
Journal of Information, Communication, and Intelligence Systems (JICIS)
Volume 4, Issue 5, December 2018

Frame Blocking and Windowing Speech Signal

Oday Kamil Hamid

Abstract— The key objective of this research is frame blocking With this information presented so far one question comes
and windowing, a speech signal is a slowly time varying signal in the naturally: now is speech recognition done? to get knowledge of how
sense that, when examined over a short period of time (between 10 to speech recognition problems can be approached today, a review of
30 ms), its characteristics are short time stationary. This is not the some research high lights will be presented the earliest attempt to
case if we look at a speech signal under a longer time perspective device systems for automatic speech recognition by machine were
(approximately time T › 0.5 s).in this case the signals characteristics made in the 1950's ,when various researchers tried to exploit the
are non-stationary, meaning that it changes to reflect the different fundamental idea of acoustic –phonetics in 1952, at bell laboratories ,
sounds spoken by the talker. For this reason we use frame blocking davis biddulph ,and balashek built a system for isolated digit
and windowing to be able to use a speech signal and interpret its recognition for a single speaker the system relied heavily on
characteristics in proper manner. In this project speech signal is measuring spectral resonances during the vowel region of each digit .
blocked into frames of N sample with adjacent frames being separated in 1959 another attempt was made by forgie and forgie , constructed
by M (M ‹ N) where N=256 sample correspond to (≈23 ms) and at M IT Lincoln laboratories ten vowel embedded in a/b/-vowel-/t
M (overlapping)=50℅ (128 sample)(11.37 ms)and signal is sampled format were recognized in speaker independent manner .in the 1970's
at 11.25 ms, and then we use hamming window because it is the most speech recognition research achieved a number of significant mile
widely used in speech processing. stones ,first the area of isolated word or discrete utterance recognition
The proposed speaker recognition systems are examined through became a viable and usable technology based on the fundamental
theoretical analysis and computer simulation using M atlab version 6 studies by velichko andzagoruyko in Russia ,sakoe and chiba in japan
programming language and sound forge 5 as a speech analyzer under and itakura, in united state .the Russian studies helped advance the
M icrosoft Windows 2007 operating system use of pattern recognition ideas in speech recognition ,the Japanese
research showed how dynamic program ming methods could be
I. Introduction successfully applied and itakura's research showed now the idea of
Speech recognition is a topic that is very useful in many linear predicting coding (LPC).
application and environment in our daily life. generally speech
recognizer is a machine which understand human and their spoken The purpose with this research is getting a deeper theoretical and
word in some way and can act thereafter it can be used, for example in practical understanding of speech recognition .the work started by
a car environment to voice control non critical operations, such as cutting the speech data signal into frames before analysis and the
dialing a phone number another possible scenario is on – board frame size is 10---30 ms and frames can be overlapped normally the
havigation, presenting the driving route to the driver applying voice over lapping region range from 0 to 50% of the frame size and then use
control the traffic safety will be increased. the matlab to process the speech signal .in the future it could be
A different aspect of speech recognition is to facilitate for people possible to use this information to create chip that could be used as
with functional disability or other kinds of handicap to make their anew interface to humans .for example it would be desired to get rid of
daily chores easier, voice control could be helpful .with their voice all remote controls in the home and just tell the TV,stereo or any
they could operate the light switch turn of/on the coffee machine or desired device what to do with the voice[2].
operate some other domestic appliances this leads to the discussion
about intelligent homes where these operation can be made available
for the common man as well as for handicapped [1]. II. Theory
Framing
Decompose the speech signal into a series of overlapping frames
– Traditional methods for spectral evaluation are reliable in the
Oday K .Hamid, Dept. of Computer Techniques Engineering, Dijlah case of a stationary signal (i.e., a signal whose statistical
University College, (e-mail: [email protected]). Baghdad, Iraq. characteristics are invariant with respect to time)

87
ISSN: 2413-6999
Journal of Information, Communication, and Intelligence Systems (JICIS)
Volume 4, Issue 5, December 2018

• Imply that the region is short enough for the behavior of Time frame and overlap
(periodicity or noise-like appearance) the signal to be approximately ❧Since our ear cannot response to very fast change of speech data
constant content, we normally cut the speech data into frames before analysis
• In sense, the speech region has to be short enough so that it can ❧Frame size is 10~30ms
reasonably be assumed to be stationary ❧Frames can be overlapped:
• stationary in that region: i.e., the signal characteristics whether Normally the overlapping region ranges from 0 to 75% of the frame
periodicity or noise-like appearance) are uniform in that region. size.
Frame duration ranges are between 10 ~ 25 ms in the case of speech
processing [3].

Frame blocking and Windowing


Due to the differences in phoneme‘s spectral features, changes in
prosody, and random variations in the vocal tract, speech is a
non-stationary signal. However, in a short time interval (generally
from 10 to 20 ms) it is assumed that the speech signal is stationary,
and therefore it is analyzed over these shot-time windows. So the
frame blocking procedure consists essentially dividing the speech
signal into short frames of N samples, which overlap by M samples,
with adjacent frames.
In order to minimize spectral distortions when blocking the speech
signal, each frame is multiplied with a Hamming window of the form Figure (1) (sampled speech signal)

where N is the duration (in samples) of the speech frame. The


output
y (n) of the windowed signal becomes:

❧This windowing function acts as a low pass filter, enhancing the


signal at the window center and smoothening it at the edges.

Figure (2) (frame length of 256 samples and overlap of 128


samples)[5].
❧ To choose the frame size (N samples) and adjacent frames
separated by m samples.
Frame shifting
❧ i.e... A 11.5 KHz sampling signal, a 8ms window has
It is normal to use overlapping windows to ensure better temporal
N=256samples, (neighboring shift) m=128 sample[4].
continuity in the transform domain. An overlap of half the window
size (or less) is typical.

• Frame rate: the number of frames computed per second, in general


33 to 100 frames per second in sort-term speech processing [6].

88
ISSN: 2413-6999
Journal of Information, Communication, and Intelligence Systems (JICIS)
Volume 4, Issue 5, December 2018

Framing and windowing–short-term processing


A frame-based analysis is essential for sp eech signals as shown in
figure (3).

Function of window
– rectangular window:
• h[n]=1, 0≤n≤L-1 and 0 otherwise
– Hamming window (raised cosine window):
• h[n]=0.54-0.46 cos(2πn/(L-1)), 0≤n≤L-1 and 0 otherwise
– rectangular window gives equal weight to all L
samples in the window (n,...,n-L+1)
– Hamming window gives most weight to middle samples and
tapers off strongly at the beginning and the end of the window
Figure (3) (frame analysis)
[7].

Windowing and the types of window


Windows in S TFT

Since speech is non-stationary, we are interesting in short-term


For Xn(ejw )to represent the short-time spectral properties of
estimates of parameters such as the Fourier spectrum. This requires
X(n) inside the window, Xn(ejw ) should be much narrower in
that a speech segment be chosen for analysis. We are effectively
frequency than significant spectral regions of Xn(ejw ) i.e., almost an
cross-multiplying the signal by a window function.
impulse in frequency. Consider rectangular and hamming windows,
where width of the main spectral lobe is inversely proportional to
window length and side lobe levels are essentially independent of
window length.
• Rectangular Window: flat window of length N samples; first zero
in
Rectangular window
frequency response occurs at FS/N, with side lobe levels of -14 dB
or lower.
• Hamming Window: raised cosine window of length L samples;
1 0n N
wn    first zero in frequency response occurs at 2FS/N, with side lobe levels
0 Otherwise of -40 dB or lower as shown in figure (4) below:

• Just extract the frame part of signal without further processing


• Whose frequency response has high side lobes.
–M ain lobe: spreads out in a wider frequency range the narrow
band power of the signal, and thus reduces the local frequency
resolution
–Side lobe: swaps energy from different and distant frequencies of
xm[n], which is called leakage
However, it is desirable to use a tapered window such as:

Hamming window

Figure (4) Frequency response

0.54  0.46 cos2 n N  1 0  n  N 1


w n   
0 Otherwise
• 500sample windows (50 msec)
• can see periodicity in time and in frequency
• can see strong first formant (300-400 Hz), strong resonance at
2200 Hz, resonance at 3800 Hz as shown in figure(5)[8].:

89
ISSN: 2413-6999
Journal of Information, Communication, and Intelligence Systems (JICIS)
Volume 4, Issue 5, December 2018

S ampling rate in frequency

Figure (5) Frequency response

Total” S ampling Rate of S TFT


• the ―total‖ sampling rate for the STFT is the product of the
sampling rates in time and frequency, i.e.,
SR = SR(time) x SR(frequency)
= 2B x L samples/sec
B = frequency bandwidth of window (Hz)
L = time width of window (samples)
• for most windows of interest, B is a multiple of FS/L, i.e.,
B = C FS/L (Hz), C=1 for Rectangular Window
C=2 for Hamming Window
SR = 2C FS samples/second
• can define an ‗over sampling rate‘ of
SR/ FS = 2C = over sampling rate of STFT as compared to
conventional sampling representation of x(n)
for RW, 2C=2; for HW 2C=4 => range of over sampling is 2-4
this over sampling gives a very flexible rep resentation of the
speech signal

S hort-term Energy

The long term definition of signal energy is as below:

There is little or no utility of this definition for time-varying


signals, say speech.
For a short-term speech signal (the n-th frame speech after framing
Figure (6) sampling theorem
and windowing:

90
ISSN: 2413-6999
Journal of Information, Communication, and Intelligence Systems (JICIS)
Volume 4, Issue 5, December 2018

Xn(m)=x(m)w(n-m) n-N+1<=m<=n Since speech is time-varying in that the vocal-tract configuration


changes over time, an accurate set of predictor coefficients is
Where w (n) is the window, n is the sample that the analysis adaptively determined over short time frames (typically 10ms to
window is centered on, and N is the window size. Window 30ms) during which time-invariance is assumed [31]. So for this
jumps/slides across sequence of squared values, selecting interval for reason the continuous speech signal is blocked into frames of
processing, as shown below in figure (7) [9].: N sample with adjacent frames being separated by MM  N . In this
research N  256 sample correspond to (~23 msec) and M (over
lapping) =50% (128 sample) (11.37 msec). The first frame consists
of the first N samples; the second frame being M samples after the
first frame, and overlaps it by N-M sample. Similarly, the third frame
beings 2M samples after the first frame (or M samples after the
second frame) and overlaps it by N-2M samples. This process
continues until all the speech is accounted for within one or more
frames. We denote the l th frame of speech by x l n  , and there are L
frames within the speech signal, then x l n   xMl  n  n=0,1,……..,
N-1 l=0,1,…..L-1 that is, the first frame of
speech, x 0 n  encompasses speech samples x0, x1,.......xN 1 ,the
Figure (7) Original sample and windowed
second frame of speech x1 n  ,encompasses samples
sample
xM, xM  1,.......xM  N 1 ,and the Lth frame of
Speech, x L1 n  ,Encompasses speech samples
xML 1, xML 1  1,............xML 1  N 1 as shown in
III. Database
figure (8)
Any speech or speaker recognition system depend on the type of
data input to this system, so some elements must be available in order
to get on a good data as: 1
 High quality microphones used in recording for both training and 0.
testing sessions. 8
0.6
 Ideal recording must be used in rooms with little or no 0.4
background noise or reverberation for both training and testing 0.2
sessions. 0
 Collect a large database for many tries. -0.2
 Using modern programs in recording because it has a capability -0.4
for cutting voices or synthesis and it gives a good representation of -0.6
signal‘s shape. -0.8
database had been recorded by using high quality microphone to -10
record voice( ‫ ( الخير‬this word are recorded by using Sound Forge 500 1000 1500 2000
program figure (3.1) shows recorded signal displayed by this
programs screen. II. M
Data is sampled at 11.25 KHZ (sampling rate), with 16-bit sample N A. M
value A/D, but unfortunately we did not have the chance of recording
in suitable place for such a purpose and that deal of the collected
N M
database from each speaker considered as a little for suggested N
systems.

Figure (8): Blocking of speech into overlapping frames.


Preprocessing speech signal
The basic idea of the preprocessing is not to use the high
dimensional redundant speech signal for the recognition procedure, Windowing
but to describe it by a low dimensional set of features being typical The next step in the processing is to window each individual frame,
mainly for the speaker‘s identity . The speech signal coming from the most widely used windows in speech processing are the
microphone pass through these steps: rectangular window that weights all samples in the analysis frame
1.Reading the database (convert it from analog to digital equally, and Hamming window which is used to taper the segment
signal x n  . because prediction residual must be kept to minimize at the beginning
and end of the frame.
Frame blocking

91
ISSN: 2413-6999
Journal of Information, Communication, and Intelligence Systems (JICIS)
Volume 4, Issue 5, December 2018

There is an important difference between rectangular window and 2- Then the original sampled speech signal was cut into
hamming window that the bandwidth of hamming window is about frames each frame have 256 sample as discussed
twice the bandwidth of a rectangular window of the same length as before, figure(11) shows frame number 25.
shown in It is also clear that the hamming window gives much greater
attenuation outside the pass band than the comparable rectangular
window

 Hamming Window
Which it was used in this research, figure (9) and it‘s defined as:

0.54  0.46 cos2 n  N  1 0  n  N 1


wn   
0 Otherwise

n=Length of window.
N=Number of sample [10].
Figure (11) frame number 25
Volt
3- Applied hamming window for each frame as shown in
figure (12).
Figure (13) shows both frame and windowed sampled and
0 N Time observe how the hamming window taper the beginning and
end of the frame.
Figure (9): Hamming window

Windows are used in signal analysis so as to minimize the signal


discontinuities at the beginning and end of each frame. The concept
here is to minimize the spectral distortion by using the window to
taper the signal to zero at the beginning and end of each frame [6].
In this research a typical hamming window was used as in equation
(3-2):
~x n   x n wn  0  n  N 1
l l
N=256.

IV. Evaluation test for the proposed method

1- convert the analog signal to digital signal X(n) as


shown in figure (10) below:

Figure (12) windowed frame

Figure (10) sampled speech signal

92
ISSN: 2413-6999
Journal of Information, Communication, and Intelligence Systems (JICIS)
Volume 4, Issue 5, December 2018

Figure (13) frame and windowed frame

The using hamming window instead of rectangular window is


because this has to do with the assumption of periodicity
made by the DFT, and become clearer in the frequency Figure (14): frequency response with N=51, for hamming
domain. window
A window (on its own) tends to have an averaging effect.
Thus it has a low pass spectral characteristic. Ideally, we want
• To preserve spectral detail
• To produce little spectral distortion The log magnitude
spectrum of a rectangular window can be compared with that
of a Hamming window:
• The Hamming has a wider main lobe, but much better
attenuation of side lobes (typically 20-30 dB better than
rectangular). For a designed window, wish that:
- A narrow bandwidth main lobe
- Large attenuation in the magnitudes of the side lobes
However, this is a trade-off!
Notice that:
1. A narrow main lobe will resolve the sharp details of speech
signal (the frequency response of the framed signal) as the
convolution proceeds in frequency domain
2. The attenuated side lobes prevents noise from other parts of
the spectrum from corrupting the true spectrum at a given
frequency
3. Band width of hamming window is about twice the band
width of rectangular window of the same length as shown in
figures (14) and (15).

Figure (15): frequency response with N=51, for rectangular window

93
ISSN: 2413-6999
Journal of Information, Communication, and Intelligence Systems (JICIS)
Volume 4, Issue 5, December 2018

REFERENCES

1. J. P. Hosom, R. Cole and M . Fanty, “Speech Recognition Using


Neural Networks”, Center for Spoken Language Understanding
(cslu)
Oregon Graduate Institute of Science and Technology, July 6,
1999, ―http://cslu.cse.ogi.edu/corpora/available/‖.
2. B. S. Atal, “Automatic Recognition of Speakers From Their
Voices”, IEEE, Vol. 64, pp. 460-475, April1975.
3. B. S. Atal, “Automatic Speaker Recognition Based Upon Pitch
Contours”, Journal of acoustic of America society (JAAS), Vol.
52, pp. 1687-1697, 1972.
4. L. R. Rabinar and B. H. Juang, “Fundamentals of Speech
Recognition”, Prentice-Hell, New Jersey, 1993.
5. M . N. AL-Trfi, “Speaker Recognition Based Upon Phonemes
Using Wavelet Packet Transform”, M .Sc. Thesis, College of
Engineering, University of Baghdad, 2000.
6. N. Do. M inh, “An Automatic Speaker Recognition System”,
Swiss Federal Institute of Technology, lausanne-Epel,
―http://lcavwww. Epfl. Ch./~mindho/asr_project/. Html‖, January
2000.
7. R. D. Rodman, “Speaker Recognition of Disguised Voices”,
―www.csc.Ncsu.edu/factly/rodman/‖, Speaker Recognition,
Disguised Voices, 1997.and references there in
8. A. H. Al-Nakkash, “A Novel Approach For Speakers
Recognition Using Vector Quantization Technique”, M .Sc. Thesis,
University of Technology, Baghdad, 2001.
9. H. Fenglie and W. Bingxi, “An Integrated System for
Text-Independent Speaker Recognition Using Binary Neural
Network Classifiers”, Proceeding of ICASSP 2001.
10. K. G. M argaritis, “Development of a Text-Dependent Speaker
Identification System with the OGI Toolkit”, 2nd Hellenic Conf. on
AI, SETN-2002, 11-12 April 2002, Thessaloniki, Greece,
Proceedings, Companion Volume, pp. 525-530.

94

View publication stats

You might also like