Speech Processing Using MATLAB111
Speech Processing Using MATLAB111
The choice of the window in short-time speech processing determines the nature of the
measurement representation. A long window would result in very little changes of the
measurement in time whereas the measurement with a short window would not be sufficiently
smooth. Two representative windows are demonstrated, Rectangular and Hamming. The latter
has almost twice the bandwidth of the former, for the same length. Furthermore, the attenuation
for the Hamming window outside the passband is much greater.
% Sampling frequency in Hz (for the data used in this demo i.e TIMIT
database signals)
Fs = 16000;
subplot(2,1,1);
plot(linspace(0,0.5,ceil(fftLength/2)),
20*log10(magFWRect(1:ceil(fftLength/2))));
ylabel('dB');
legend('Rectangular Window');
subplot(2,1,2);
plot(linspace(0,0.5,ceil(fftLength/2)),
20*log10(magFWHamm(1:ceil(fftLength/2))));
ylabel('dB');
xlabel('Normalized Frequency');
legend('Hamming Window');
C:\Desktop\shaikyaseenbaba_demo
2.Short-Time Speech Measurements, Short-Time energy calculation
This measurement can in a way distinguish between voiced and unvoiced speech segments, since
unvoiced speech has significantly smaller short- time energy. For the length of the window a
practical choice is 10-20 msec that is 160-320 samples for sampling frequency 16kHz. This way
the window will include a suitable number of pitch periods so that the result will be neither too
smooth, nor too detailed.
subplot(1,1,1);
plot(t, speechSignal);
title('speech: He took me by surprise');
xlims = get(gca,'Xlim');
hold on;
Effects of the choice of window length are demonstrated. As the length increases, short-time
energy becomes smoother, as expected. It should be noticed that the measurement is not taken
for every sample. Due to the lowpass character of the window, short-time energy is actually
bandlimited to the bandwidth of the window, which is much smaller than 16 KHz. Actually for
the lengths we are interested in, it is less than 160 Hz. So, we could calculate the energy every 50
samples, that is with 320 Hz frequency for 16kHz speech sampling frequency.
k = 0;
for iWinLen = winLens
k = k+1;
wHamm = hamming(iWinLen);
% Display results
subplot(nWindows, 1, k);
The rectangular window has smaller bandwidth for the same length, compared to the Hamming
window. So the results are expected to be smoother. In any case, as the length of the window
increases short-time energy is less detailed.
k = 0;
for iWinLen = winLens
k = k+1;
wRect = rectwin(iWinLen);
% Display results
subplot(nWindows, 1, k);
delay = (iWinLen - 1)/2;
plot(t(delay+1:period:end - delay), ienergyST);
if (k==1)
title('Short-Time Energy for various Rectangular window lengths')
end
legend(['Window length:',num2str(iWinLen),' Samples']);
end
5.Short-Time Speech Measurements, Average Magnitude calculation
does not emphasize large signal levels so much as short-time energy since its calculation does
not include squaring. The signals in the graphs are normalized.
winLen = 301;
winOverlap = 300;
wHamm = hamming(winLen);
magnitudeAv = STAmagnitude(speechSignal, wHamm, winOverlap);
subplot(1,1,1);
plot(t, speechSignal/max(abs(speechSignal)));
title('speech: He took me by surprise');
hold on;
delay = (winLen - 1)/2;
plot(t(delay+1:end-delay), energyST/max(energyST), 'g');
plot(t(delay+1:end-delay), magnitudeAv/max(magnitudeAv), 'r');
legend('Speech','Short-Time Energy','Average Magnitude');
xlabel('Time (sec)');
hold off;
6.Short-Time Speech Measurements, Average Zero-Crossing Rate
w(n) = 1/2N, 0 <= n <= N-1 This measure could allow the discrimination between voiced and
unvoiced regions of speech, or between speech and silence. Unvoiced speech has in general,
higher zero-crossing rate. The signals in the graphs are normalized.
wRect = rectwin(winLen);
ZCR = STAzerocross(speechSignal, wRect, winOverlap);
subplot(1,1,1);
plot(t, speechSignal/max(abs(speechSignal)));
title('speech: He took me by surprise');
hold on;
delay = (winLen - 1)/2;
plot(t(delay+1:end-delay), ZCR/max(ZCR),'r');
xlabel('Time (sec)');
legend('Speech','Average Zero Crossing Rate');
hold off;
7.Short-Time Speech Measurements, Short-Time Autocorrelation, Varying
Window Length
and is actually the autocorrelation of a windowed speech segment. The window used is the
rectangular. Notice the attenuation of the autocorrelation as the window length becomes shorter.
This is expected, since the number of the samples used in the calculation decreases.
subplot(3,1,1);
plot(lags1, ac1);
legend('Window Length: 606 Samples')
title('Short-Time Autocorrelation Function')
grid on;
subplot(3,1,2);
plot(lags2, ac2);
xlim([lags1(1) lags1(end)]);
legend('Window Length: 456 Samples')
grid on;
subplot(3,1,3);
plot(lags3, ac3);
xlim([lags1(1) lags1(end)]);
legend('Window Length: 306 Samples')
xlabel('Lag in samples')
grid on;
8.Short-Time Speech Measurements, Short-Time Autocorrelation, Voiced and
Unvoiced Speech
subplot(2,2,1);
plot(ss1);
legend('Voiced Speech')
subplot(2,2,2);
plot(lags1, ac1);
xlim([lags1(1) lags1(end)]);
legend('Autocorrelation of Voiced Speech')
grid on;
subplot(2,2,3);
plot(ss4);
legend('Unvoiced Speech')
subplot(2,2,4);
plot(lags4, ac4);
xlim([lags1(1) lags1(end)]);
legend('Autocorrelation of Unvoiced Speech')
grid on;
9.Spectrogram, FFT of speech
speechDemo_config_spec_params;
subplot(1,1,1);
% Next is shown a sample way to loop over selected phonemes and do some
kind of processing after reading
% create a list of selected phonemes to process
selected_phonindx=[1:length(wavname)]; % in this list put all the phons
for index=1:length(selected_phonindx) % loop over the selected phons
end
% Window the signal with hamming and apply Fast Fourier Transform
HFSignal=fft(y.*hamming(length(y)),2^9);
subplot(3,1,1),
plot_speech(y,Fs,linewidth,titlestr,fontsize);
subplot(3,1,2),
plot_spectrum(FSignal,Fs,linewidth,'',fontsize,'b');
subplot(3,1,3),
plot_spectrum(HFSignal(1:ceil(length(HFSignal)/
2)),Fs,linewidth,'',fontsize,'b');
10.Spectrogram, FFT of speech, Long window
Demonstration of the calculation of the spectrum using long windows, either Hamming or
rectangular
% plot all frames sequentially , both time and abs spectra, both
rectangular and Hamming windowd signals
plot_frames_SigFreq_HF(FFrameSignal,HFFrameSignal,FrameSignal,HWinFrameSi
gnal,Fs,linewidth,fontsize,1,2)
11.Spectrogram, FFT of speech, Short window
Demonstration of the calculation of the spectrum using short windows, either Hamming or
rectangular.
% rectangular window
FFrameSignal=[];
FrameSignal=[];
winlen=32; % 4msec in 16KHz
FrameSignal=buffer(y,winlen,round(winlen/3));
FFrameSignal = fft(FrameSignal,1024);
FFrameSignal=FFrameSignal(1:ceil(size(FFrameSignal,1)/2),:);
plot_frames_SigFreq_HF(FFrameSignal,HFFrameSignal,FrameSignal,HWinFrameSi
gnal,Fs,linewidth,fontsize,1,0.30)
12.Spectrogram, A longer phrase
Frequency domain representation of a longer speech phrase. Firstly, the phrase is plotted.
% Raw wav read instead of wavread (NOTE: phrase wavs have header, in
contrast with previous case i.e. phonemes)
fid=fopen([fullfile(phrase_datapath,phrase),'.wav'],'r');
fread(fid,skiplen,'int'); % who cares about the header
y=fread(fid,inf,'short'); % get the binary
fclose(fid);
plot(str2num(line(1:wsepind(1)))/Fs*ones(1,10),linspace(min(y),max(y),10)
,'k')
end
fclose(fid);
14.Spectrogram, Narrowband or Wideband
Demonstration of the spectrogram, narrowband or wideband. Notice that the latter has better
time-resolution (since it uses a short window in time) but worse frequency resolution. In the case
of the wideband spectrogram,the window is of about the same duration as the pitch period.
Notice the vertical striations due to this fact. For unvoiced speech there are no vertical striations.
Function specgram of MATLAB is used.
window = window_narrowband;
noverlap = noverlap_narrowband;
nfft = nfft_narrowband;
subplot(2,1,1);
specgram(y,nfft,Fs,window,noverlap);
xlabel('Time (sec)','fontsize',fontsize);
ylabel('Frequency (Hz)','fontsize',fontsize);
window = window_wideband;
noverlap = noverlap_wideband;
nfft = nfft_wideband;
subplot(2,1,2)
specgram(y,nfft,Fs,window,noverlap);
xlabel('Time (sec)','fontsize',fontsize);
ylabel('Frequency (Hz)','fontsize',fontsize);
15.Linear Prediction, Autocorrelation method
Linear predictive analysis of speech is demonstrated. The methods used are either the
autocorrelation method or the covariance method. The autocorrelation method assumes that the
signal is identically zero outside the analysis interval (0<=m<=N-1). Then it tries to minimize the
prediction error wherever it is nonzero, that is in the interval 0<=m<=N-1+p, where p is the order
of the model used. The error is likely to be large at the beginning and at the end of this interval.
This is the reason why the speech segment analyzed is usually tapered by the application of a
Hamming window, for example. For the choice of the window length it has been shown that it
should be on the order of several pitch periods to ensure reliable results. One advantage of this
method is that stability of the resulting model is ensured. The error autocorrelation and spectrum
are calculated as a measure of the whiteness of the prediction error.
% Voiced sound
phons = readdata(wavpath, '_', 6, 'phonemes');
x = phons.aa{5}(200:890);
len_x = length(x);
% Display results
subplot(5,1,1);
plot([wx; zeros(order,1)],'g');
title('Phoneme /aa/ - Linear Predictive Analysis, Autocorrelation
Method');
hold on;
plot(estx);
hold off;
xlim([0 length(er)])
legend('Speech Signal','Estimated Signal');
subplot(5,1,2);
plot(er);
xlim([0 length(er)])
legend('Error Signal');
subplot(5,1,3);
plot(linspace(0,0.5,513), 20*log10(abs(H)));
hold on;
plot(linspace(0,0.5,513), 20*log10(S(1:513)), 'g');
legend('Model Frequency Response','Speech Spectrum')
hold off;
subplot(5,1,4);
plot(lags, acs);
legend('Prediction Error Autocorrelation')
subplot(5,1,5);
plot(linspace(0,0.5,513), 20*log10(eS(1:513)));
legend('Prediction Error Spectrum')
16.Linear Prediction, Covariance method
Demonstration of the covariance method for linear prediction. Compared to the autocorrelation
method, the difference of the covariance method is that it fixes the interval over which the mean
- square prediction error is minimized and speech is not taken to be zero outside this interval.
Stability of the resulting model cannot be guaranteed but usually for sufficiently large analysis
interval, the predictor coefficients will be stable. The error autocorrelation and spectrum are
calculated as a measure of its whiteness.
x = phons.aa{5}(200-order:890);
Fs = 16000;
len_x = length(x);
subplot(5,1,1);
plot(x, 'g');
title('Phoneme /aa/ - Linear Predictive Analysis, Covariance Method');
hold on;
plot(estx)
xlim([0 length(x)])
legend('Speech Signal','Estimated Signal');
hold off
subplot(5,1,2);
plot(order+1:length(x), er);
xlim([0 length(x)])
legend('Error Signal');
subplot(5,1,3);
plot(linspace(0,0.5,513), 20*log10(abs(H)));
hold on;
plot(linspace(0,0.5,513), 20*log10(S(1:513)), 'g');
legend('Model Frequency Response','Speech Spectrum')
subplot(5,1,4);
plot(lags, acs);
legend('Prediction Error Autocorrelation')
subplot(5,1,5);
plot(linspace(0,0.5,513), 20*log10(eS(1:513)));
legend('Prediction Error Spectrum')
17.Linear Prediction, Prediction model order variation
Linear predictive analysis of speech is demonstrated for various model orders. Notice how the
model frequency response becomes more detailed and resembles the speech spectrum as the
model order increases. The prediction error steadily decreases as the order of the model
increases. There is an order however further from which any increases lead to minor decreases of
the prediction error. The choice of the order basically depends on the sampling frequency and is
essentially independent of the LPC method used. Usually, the model is chosen to have one pole
for each kHz of the speech sampling frequency, due to vocal tract contribution and 3-4 poles to
represent the source excitation spectrum and the radiation load. So for 16kHz speech a model
order of 20 is usually suitable.
subplot(l_ord + 2,1,1);
title('LP Analysis for various model orders')
plot(wx);
subplot(l_ord + 2,1,2);
plot(linspace(0,0.5,513), 20*log10(S(1:513)), 'g');
for o=orders
[lpcoefs, e] = lpc(wx, o);
% Estimated signal
estx = filter([0 -lpcoefs(2:end)], 1, [wx; zeros(o,1)]);
% Prediction error
er = [wx; zeros(o,1)] - estx;
erEn = sum(er.^2);