Documention Theory
Documention Theory
1.1INTRODUCTION
and intelligibility of Speech in noisy environments. The problem has been widely
discussed over the years. Many Approaches have been proposed like subtractive
subtraction and the Wiener filtering algorithms are widely because of their low
Such methods return residual noise known as musical noise. This type of noise is
quite annoying. In order to reduce the effect of musical noise, several solutions
so as to offer more flexibility as in [2] and [3]. Other such as proposed in [4], are
techniques to improve the signal to noise ratio (SNR), the problem of eliminating
1
1.1 OVERVIEW OF PWF
attracted great deal of interest. The objective is to improve the perceptual quality of
the enhanced signal. In [3], a psychoacoustic model is used to control the parameters
of the spectral subtraction in order to find the best trade of between noise reduction
and speech distortion. To make musical noise inaudible, the linear estimator proposed
in [5] incorporates the masking properties of the human auditory system. In [6], the
masking threshold and intermediate signal, which is slightly denoised and free of
musical noise, are used to detect musical tones generated by the spectral subtraction
detected tones. These perceptual speech enhancement systems reduce the musical
noise but introduce some undesired distortion to the enhanced speech signal. When
this distorted estimated speech signal is applied to the recognition systems their
remove, perceptually significant noise components from the noisy signal, so that the
clean speech components are not affected by processing. In addition, the technique
requires very little a priori information of the features of the noise. In the present
motivated filter that can be regarded as weighting factor. Tpurpose is to minimize the
perception of musical noise without degrading the clarity of the enhanced speech.
2
CHAPTER 2
Consider the block diagram of Fig.1 built around a linear discrete-time filter.
The filter input consists of a time series x(0), x(1), x(2), …, and the filter is itself
characterized by the impulse response w0, w1, w2, … At some discrete time n, the
filter produces an output denote by y(n). This output is used to provide an estimate
of a desired response denoted by d(n). With the filter input and the desired response
estimation error e(n) “as small as possible” in some statistical sense. Two
1. The filter is linear, which makes the mathematical analysis easy to handle.
2. The filter operates in discrete time, which makes it possible for the filter to
3
Fig. 2.1 Block diagram representation of the statistical filtering problem
The final details of the filter specification, however, depend on two other choices
1. Whether the impulse response of the filter has finite or infinite duration.
impulse response (IIR) for the filter is dictated by practical considerations. The
4
For the initial developed includes that for FIR filters as a special case.
However, for much of the material presented in this tutorial, we will confine our
attention to the use if FIR filters. We do so for the following reason. An FIR filter
is inherently stable, because its structure involves the use of forward paths only.
In others words, the only mechanism for input-output interaction in the filter us
via forward paths from the filter input to its output. Indeed, it is this form of signal
transmission through the filter that limits its impulse response to a finite duration.
On the other hand, an IIR filter involves both feedforward and feedback. The
presence of feedback means that portions of the filter output and possibly other
internal variables in the filter are fed back to the input. Consequently, unless it is
properly designed, feedback in the filter can indeed make it unstable with the
result that the filter oscillates; this kind of operation is clearly unacceptable when
the requirement is that of filtering for which stability is a “must.” By itself, the
stability problems of its own, the inclusion of the adaptivity combined with
feedback that is inherently present in an IIR filter makes a difficult problem that
much more difficult to handle. It is for this reason that we find that in the majority
if applications requiring the use if adaptivity, the use if an FIR filter is preferred
over IIR filter even through the latter is less demanding in computational
requirements.
5
Turning next to the issue of what criterion to choose for statistical
possibilities:
estimation error
Option 1 has a clear advantage over the other two, because it leads to tractable
a second order dependence for the cost function on the unknown coefficients in
the impulse response of the filter. Moreover, the cost function has a distinct
minimum that uniquely defines the optimum statistical design of the filter.
We may now summarize the essence of the filtering problem it making the
following statement:
desired signal response d(n), given a set of input samples x(0), x(1), x(2), …, such
6
that the mean-square value of the estimation error e(n), defined as the difference
between the desired response d(n) and the actual response y(n), is minimized.
known as the principle of orthogonality. The other approach highlights the error-
function one the filter coefficients. We will proceed by deriving the principle of
orthogonality first, because the derivation is relatively simple and because the
sense stationary (WSS) process with zero mean. Suppose now we want to find
a linear estimate of d(n) based on the L-most recent samples of x(n), i.e.,
L −1
y ( n )=w X ( n)= ∑ w k x (n−k ),
T
w , X (n )∈ R L and n=0,1,2 , .. .. . .
k =0 (1)
7
The introduction of a particular criterion to quantify how well d(n) is
where E[·] is the expectation operator and e(n) is the estimation error. Then,
the estimation problem can be seen as finding the vector w that minimizes the
cost function JMSE(w). The solution to this problem is sometimes called the
stochastic least squares solution. If we choose the MSE cost function (2), the
2 T T T
J MSE (w )=E [|d (n )| −2d (n ) X (n ) w+w X (n )X (n)w ] (4)
As this is a quadratic form, the optimal solution will be at the point where the
8
∂ J MSE
∇ w|J MSE ( w)= =0 Lx 1
∂w (5)
or in other words, the partial derivative of J MSE with respect to each coefficient
wk should be zero. Under this set of conditions the filter is said to be optimum
gradient as:
∂J MSE ∂ e(n )
=2 E [e (n ) ]=−2 E [ e(n ) X (n )]
∂w ∂w (6)
or equivalently
9
This is called the principle of orthogonality, and it implies that the optimal
condition is achieved if and only if the error e(n) is decorrelated from the
samples x(n−k), k = 0,1,..., L−1. Actually, the error will also be decorrelated
T T
opt opt
E[ e min ( n) y opt ( n) ]=E[ e min ( n) w ( n ) X ( n )]=w E[ e min (n ) X ( n) ]=0 (9)
Fig. 2.2
yopt(n)
10
When the filter operates in its optimum condition, the estimate of the desired
response defined by the filter output y opt(n), and the corresponding estimation
exist at the output of the optimum filter, as illustrated in Fig.2 for case L = 2. In
this figure, the desired response, the filter output, and the corresponding
estimation error are represented by vectors labeled d, y opt, and emin, respectively.
We see that for the optimum filter the vector representing the estimation error is
normal (i.e. perpendicular) to the vector representing the filter output. It should,
where random variables and expectations are replaced with vectors and vector
inner products, respectively. Also for obvious reasons the geometry depicted in
11
Rearranging the terms, we have:
2
J MSE =E [|e min (n )| ] (12)
Hence, evaluating the mean-square values of both sides of (11), and applying to
σ 2d =σ 2y +J MSE
opt (13)
2
where σ 2d is the variance of the desired response, and σ y is the variance of the
opt
estimate yopt ; both of these random variables are assumed to be of zero mean.
12
J MSE=σ 2d −σ 2y
opt (14)
This relation shows that for the optimum filter, the MMSE equals the difference
between the variance of the desired response and the variance of the estimate that
minimum value if the mean-squared error always lies between zero and one. We
2
J MSE σ y opt
=1−
σ 2d σ 2d (15)
Clearly, this is possible because σ 2d is never zero, expect in the trivial case of a
J MSE
κ=
σ 2d (16)
13
2
σ yopt
κ=1− 2
σd (16)
σ 2y
opt
2
We note that the ratio κ can never be negative, and the ratio σ d is always
0≤κ≤1 (17)
If κ is zero, the optimum filter operates perfectly in the sense that there is
complete agreement between the estimate yopt(n) at the filter output and the
whatsoever between these two quantities; this corresponds to the worst possible
situation.
(1) Where x(n) is the original clean speech signal and d (n) is the additive random
noise
signal, uncorrelated with the original signal. Taking DFT to the observed signal gives
14
(2)Where m=1,2,...,M is the frame index, k=1,2,....,K is the frequency bin index,
M is
the total number of frames and K is the frame length, Y (m ,k ), S (m, k ) and D(m ,k )
represent
the short time spectral components of the y(n ), S(n )and( n) , respectively. Clean
^ ,k)
S(m
speech spectrum is obtained by multiplying noisy speech spectrum with filter
gain function
as given in eqation
^ ,k)=H ( m,k )Y (m ,k)
S(m
(3)Where H (m,k ) is the noise suppression filter gain function (conventional Wiener
filter (WF)), which is derived according to MMSE estimator
ξ(m , k )
H (m, k )=
1+ξ (m, k )
Γ s (m, k )
ξ (m, k )=
Γ d (m , k ) . (5) Γ d (m ,k)=E {|D(m ,k )|2 } and Γ s (m , k)=E {|S(m , k)|2 }
represents the estimated noise power spectrum and clean speech power spectrum,
respectively. A posteriori
estimation is given by
|Y (m , k)|2
γ (m, k )=
Γ d (m , k)
^
An estimate of ξ(m,k ) of ξ (m, k ) is given by the well known decision directed
approach and is expressed as
2
^ |H(m−1,k)Y (m−1,k)|
ξ(m,k)=α +(1−α)P' [ V (m,k ] .
Γd
15
Where V (m , k)=γ( m, k )−1 , P [ x ] =x if x≥0 and P [ x ] =0 otherwise.
The noise suppression gain function is chosen as the Wiener filter
CHAPTER 3
does not eliminate it [15]. Musical noise exists and perceptually annoying. In an
effort to make the residual noise perceptually inaudible, many perceptual speech
enhancement methods have been proposed which incorporates the auditory masking
the signal masking threshold [9, 13]. Figure 1 depicts the complete block diagram of
The perceptual Wiener filter (PWF) gain function H 1 (m , k) is calculated based cost
function,
J which is defined as
J=[|S(m,k)−S(m,k)|
^ 2
]
16
Substituting (2) and (3) in (9) results to
=d i +r i (9)
Where
d i=( H 1 (m , k)−1)2 E [|S(m ,k )|2 ] and r i =H 21 (m ,k) E [|D(m ,k)|2 ] represents speech
distortion
ri ¿ T ( m,k ) (10)
By including the above constraint and substituting Γ d (m ,k)=E {|D(m ,k)| } and
2
H 1 (m , k) and equating to zero. The obtained perceptually defined Wiener filter gain
Γ s (m ,k)
H 1(m ,k)=
function is given by Γ s ( m,k )+max( Γ d (m ,k )−T (m,k ),0) (12)
T ( m,k ) is noise masking threshold which is estimated based on[16] noisy speech
spectrum.
A priori SNR and noise power spectrum were estimated using the two -step a priori
SNR
17
estimator proposed in [15] and weighted noise estimation method proposed
in[17],respectively.
Noisy
signal
Noise
estimation
PWF WPW
Phase F
ATH
Enhanced *
signal
IFFT-
Overlap-
Add *
FIG 3.1
perceptual methods, most of them still return annoying residual musical noise.
18
Enhanced speech signal obtained using above mentioned perceptual Wiener filter still
contains some residual noise due to the fact that only noise above the noise masking
threshold is filtered and noise below the noise masking threshold is remain. It can
Where ATH (m, k ) is the absolute threshold of hearing. This weighting factor is used
to
weight the perceptual wiener filter. The gain function of the H 2 (m , k) of the proposed
H 2 =H 1 ( m , k )W (m , k ) (16)
19
CHAPTER –4
APPLICATIONS OF PWF
signalled to the driver by an audio prompt. Following the audio prompt, the
system has a "listening window" during which it may accept a speech input for
recognition
Simple voice commands may be used to initiate phone calls, select radio
loaded flash drive. Voice recognition capabilities vary between car make and
model. Some of the most recent car models offer natural-language speech
recognition in place of a fixed set of commands. allowing the driver to use full
sentences and common phrases. With such systems there is, therefore, no
20
In the health care sector, speech recognition can be implemented in
engine, the recognized words are displayed as they are spoken, and the
dictator is responsible for editing and signing off on the document. Back-end
and the recognized draft document is routed along with the original voice file
to the editor, where the draft is edited and report finalised. Deferred speech
relatively minimal for people who are sighted and who can operate a
A more significant issue is that most EHRs have not been expressly tailored
certain phrases - e.g., "normal report", will automatically fill in a large number
of default values and/or generate boilerplate, which will vary with the type of
the exam - e.g., a chest X-ray vs. a gastrointestinal contrast series for a
radiology system.
Students who are blind (see Blindness and education) or have very low vision
can benefit from using the technology to convey words and then hear the
22
technology to freely enjoy searching the Internet or using a computer at home
CHAPTER 5
SIMULATION RESULTS
enhancement, simulations are carried out with the NOIZEUS, A noisy speech
database contains 30 IEEE sentences (produced by three male and three female
levels of 0 dB, 5 dB, 10 dB and 15 dB. In this evaluation only five noises are
considered those are babble, car, train, airport and street noise. The objective quality
measures used for the evaluation of the proposed speech enhancement method are the
segmental SNR and PESQ measures [19]. It is well known that the segmental SNR is
more accurate in indicating the speech distortion than the overall SNR. The higher
value of the segmental SNR indicates the weaker speech distortion. The higher PESQ
score indicates better perceived quality of the proposed signal [19]. The performance
of the proposed method is compared with Wiener filter and perceptual Wiener filter.
23
Noise Input WF PWF Proposed
Type SNR method
(dB)
0 -4.59 -0.61 0.22
Babble 5 -1.39 0.01 0.32
10 0.02 0.65 2.14
15 0.75 2.71 3.97
0 -3.93 -0.24 0.85
Car 5 -1.65 0.52 1.20
10 0.69 0.70 2.37
15 0.72 2.31 3.81
0 -3.45 -0.49 0.15
Train 5 -0.86 0.38 0.43
10 -0.39 0.77 2.20
15 0.75 2.62 3.5
0 -4.37 -0.24 0.19
Airport 5 -2.57 0.15 0.43
10 -0.06 0.14 1.09
15 0.75 1.88 3.65
0 -2.88 -0.15 0.08
Street 5 -2.13 0.61 0.73
10 0.69 1.20 2.70
15 0.77 2.25 3.42
24
The simulation results are summarized in Table 1 and Table 2. The proposed method
leads to better denoising quality for temporal and the better improvements are
obtained for the high noise level. The time-frequency distribution of speech signals
provides more accurate information about the residual noise and speech distortion
than the corresponding time domain wave forms. we compared the spectrograms for
each of the method and confirmed a reduction of the residual noise and speech
distortion. Figure2. Represents the spectrograms of the clean speech signal, noisy
25
5.3 Spectrums
26
FIG 5.1
CHAPTER 6
27
ADVANTAGES
DISADVANTAGES
28
CHAPTER 7
CONCLUSION
In this paper, an effective approach for suppressing musical noise presented after
wiener filtering has been introduced. Based on the perceptual properties of the human
auditory system, a weighting factor accentuates the denoising process when noise is
perceptually insignificant and prevents that residual noise components might become
audible in the absence of adjacent maskers. When the speech signal is additively
corrupted by babble noise and car noise objective measure results showed the
SCOPE OF PWF
29
REFERENCE
[1] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean
square error short-time spectral amplitude estimator,” IEEE Trans. Acoust.,
Speech, Signal Processing,vol. ASSP-32, pp. 1109–1121, Dec 1984.
[2] R. Schwartz M. Berouti and J. Makhoul, “Enhancement of speech corrupted
by acoustic noise,” Proc. of ICASSP, 1979, vol. I, pp. 208–211.
[3] N.Virag, “Single channel speech enhancement based on masking properties of
the human auditory system,” IEEE Trans. Speech and Audio Processing, vol.
7, pp. 126–137, 1999.
[4] Y. Ephraim and H.L. Van Trees, “A signal subspace approach for speech
enhancement,” IEEE Trans. Speech and Audio Processing, vol. 3, pp. 251–
266, 1995.
[5] Y. Hu and P. Loizou, “Incorporating a psychoacoustic model in frequency
domain speech enhancement,” IEEE Signal Processing Letters, vol. 11(2), pp.
270–273, 2004.
[6] F. Jabloun and B. Champagne, “Incorporating the human hearing properties in
the signal subspace approach for speech enhancement,” IEEE Trans. Speech
and Audio Processing,vol. 11, pp. 700–708, 2003.
[7] Y.M. Cheng and D. O’Shaughnessy, “Speech enhancement based
conceptually on auditory evidence,” IEEE Trans. Signal Processing,
vol.39, no.9, pp.1943–1954, 1991.
[8] D. Tsoukalas, M. Paraskevas, and J. Mourjopoulos, “Speech enhancement
using psychoacoustic criteria,” IEEE ICASSP, pp.359–362, Minneapolis, MN,
1993.
[9] Y. Hu and P.C. Loizou, "A perceptually motivated approach for speech
enhancement," IEEE Trans. Speech Audio Processing, pp. 457-465. Sept.
2003.
[10] L. Lin, W. H. Holmes and E. Ambikairajah, “Speech denoising using
perceptual modification of Wiener filtering,” IEE Electronic Letters, vol. 38,
pp. 1486–1487, Nov 2002.
[11] P. Scalart C. Beaugeant, V. Turbin and A. Gilloire,
“New optimal filtering approaches for hands-free telecommunication
terminals, ”Signal Processing, vol. 64 (15), pp. 33–47, Jan 1998.
[12] T. Lee and Kaisheng Yao, “Speech enhancement by perceptual filter
with sequential noise parameter estimation,” Proc. of ICASSP, vol. I, pp. 693–
696, 2004.
APPENDIX -I
30
SOURSE CODES
1 Absolute Program:
function TH=absoluteth()
BLOCK=280;
FS=8000;
BITS=16;
p=sin(2*pi*[1:BLOCK]*4000/FS)*(1/(2^BITS)); %definition
TH=max(abs(fft(p)).^2);
2 Berouti Program:
signal=wavread('sp01_airport_sn0');
fs=8000;
if (nargin<3 | isstruct(IS))
IS=.25; %seconds
end
31
W=fix(.025*fs); %Window length is 25 ms
nfft=W;
wnd=hamming(W);
W=IS.windowsize
SP=IS.shiftsize/W;
nfft=IS.nfft;
wnd=IS.window;
if isfield(IS,'IS')
IS=IS.IS;
else
IS=.25;
end
end
32
NIS=fix((IS*fs-W)/(SP*W) +1);%number of initial silence
segments
performance
y=segment(signal,W,SP,wnd);
Y=fft(y,nfft);
Y=abs(Y(1:floor(end/2)+1,:)).^Gamma;%Specrogram
numberOfFrames=size(Y,2);
FreqResol=size(Y,1);
NoiseCounter=0;
updating
Beta=.03;
minalpha=1;
maxalpha=3;
minSNR=-5;
maxSNR=20;
alphaSlope=(minalpha-maxalpha)/(maxSNR-minSNR);
33
alphaShift=maxalpha-alphaSlope*minSNR;
BN=Beta*N;
for i=1:numberOfFrames
Dist]=vad(Y(:,i).^(1/Gamma),N.^(1/Gamma),NoiseCounter);
if SpeechFlag==0
N=(NoiseLength*N+Y(:,i))/(NoiseLength+1); %Update
BN=Beta*N;
end
SNR=10*log(Y(:,i)./N);
alpha=alphaSlope*SNR+alphaShift;
alpha=max(min(alpha,maxalpha),minalpha);
Specrum Subtraction
spectrum.
34
end
output=OverlapAdd2(X.^(1/Gamma),YPhase,W,SP*W);
3 Evalution Program
sample_rate=8000;
wave1='sp03';
clean_speech=wavread(wave1);
% wave='sp03_train_sn15';
% processed_speech=wavread(wave);
load signal
d1=signal;
load output
d=output;
lens=length(clean_speech);
lenes=length(d);
if lens>lenes
out=[output' zeros(1,(lens-lenes))];
else
out1=[clean_speech' zeros(1,(lenes-lens))];
end
processed_speech=out';
35
% processed_speech=d;
clean_length = length(clean_speech);
processed_length = length(processed_speech);
% if (clean_length ~= processed_length)
% return
% end
---------------------------------------------------------
-------------
% Global Variables
---------------------------------------------------------
-------------
samplin_rate=8000;
% M=0.4*fs;
M=fix(0.5*winlength);
36
L=winlength-M-1;
% L=numberOfFrames;
skiprate = floor(clean_length/L);
% N=fix(0.025*sample_rate);
% % winlength = N;
in samples
LP spectrum
n_fft = 2^nextpow2(2*winlength);
1280
1280
37
%
---------------------------------------------------------
-------------
Bandwidths in Hz)
---------------------------------------------------------
-------------
38
cent_freq(16) = 1442.54; bandwidth(16) = 168.154;
bandwidth
---------------------------------------------------------
-------------
Gaussianly shaped
are equivalent
and set to
% zero.
39
%
---------------------------------------------------------
-------------
point of filter
for i = 1:num_crit
all_f0(i) = floor(f0);
j = 0:1:n_fftby2-1;
./bw).^2) + norm_factor);
crit_filter(i,:) = crit_filter(i,:).*(crit_filter(i,:)
> min_factor);
end
---------------------------------------------------------
-------------
Spectral
% Slope Measure
40
%
---------------------------------------------------------
-------------
num_frames = clean_length/skiprate-(winlength/skiprate);
% number of frames
% window = 0.5*(1 -
cos(2*pi*(1:winlength)'/(winlength+1)));
window=hamming(winlength);
---------------------------------------------------------
speech.
---------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame =
processed_speech(start:start+winlength-1);
41
% clean_frame = clean_frame.*window';
% processed_frame = processed_frame.*window;
---------------------------------------------------------
Processed
---------------------------------------------------------
if (USE_FFT_SPECTRUM)
clean_spec = (abs(fft(clean_frame,n_fft)).^2);
processed_spec =
(abs(fft(processed_frame,n_fft)).^2);
else
a_vec = zeros(1,n_fft);
a_vec(1:11) = lpc(clean_frame,10);
clean_spec = 1.0/(abs(fft(a_vec,n_fft)).^2)';
a_vec = zeros(1,n_fft);
a_vec(1:11) = lpc(processed_frame,10);
processed_spec = 1.0/(abs(fft(a_vec,n_fft)).^2)';
end
42
%
---------------------------------------------------------
---------------------------------------------------------
for i = 1:num_crit
.*crit_filter(i,:)');
processed_energy(i) =
sum(processed_spec(1:n_fftby2) ...
.*crit_filter(i,:)');
end
clean_energy = 10*log10(max(clean_energy,1E-10));
processed_energy = 10*log10(max(processed_energy,1E-
10));
---------------------------------------------------------
43
%
---------------------------------------------------------
clean_energy(1:num_crit-1);
processed_energy(1:num_crit-1);
---------------------------------------------------------
to
we
the
% right.
---------------------------------------------------------
for i = 1:num_crit-1
44
% find the peaks in the clean speech signal
n = i;
n = n+1;
end
clean_loc_peak(i) = clean_energy(n-1);
n = i;
n = n-1;
end
clean_loc_peak(i) = clean_energy(n+1);
end
n = i;
n = n+1;
end
processed_loc_peak(i) = processed_energy(n-1);
n = n-1;
end
processed_loc_peak(i) = processed_energy(n+1);
end
end
---------------------------------------------------------
function.
---------------------------------------------------------
dBMax_clean = max(clean_energy);
dBMax_processed = max(processed_energy);
frame.
46
% These weights W_clean and W_processed should range
spectral
of
clean_energy(1:num_crit-1));
clean_loc_peak - ...
clean_energy(1:num_crit-1));
...
processed_energy(1:num_crit-1));
processed_loc_peak - ...
processed_energy(1:num_crit-1));
W_processed = Wmax_processed .*
Wlocmax_processed;
W = (W_clean + W_processed)./2.0;
47
distortion(frame_count) =
sum(W.*(clean_slope(1:num_crit-1) - ...
processed_slope(1:num_crit-1)).^2);
helps
by the
wssdistortion(frame_count) =
distortion(frame_count)/sum(W);
end
Fwss=mean(wssdistortion);
% Outputsnr=10* log10(
sum(processed_speech.^2)/sum((clean_speech-
processed_speech).^2));
sum(clean_speech.^2)/sum((clean_speech-
processed_speech).^2));
48
%
---------------------------------------------------------
-------------
% Global Variables
---------------------------------------------------------
-------------
% samplin_rate=8000;
% % M=0.4*fs;
% M=fix(0.4*winlength);
% L=winlength-M-1;
% % L=numberOfFrames;
% skiprate = floor(length(signal)/L);
in samples
49
%
---------------------------------------------------------
-------------
SNR
---------------------------------------------------------
-------------
num_frames = clean_length/skiprate-(winlength/skiprate);
% number of frames
window = 0.5*(1 -
cos(2*pi*(1:winlength)'/(winlength+1)));
---------------------------------------------------------
speech.
50
%
---------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame =
processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
---------------------------------------------------------
---------------------------------------------------------
signal_energy = sum(clean_frame.^2);
noise_energy = sum((clean_frame-processed_frame).^2);
segmental_snr(frame_count) =
10*log10(signal_energy/(noise_energy+eps)+eps);
segmental_snr(frame_count) =
max(segmental_snr(frame_count),MIN_SNR);
51
segmental_snr(frame_count) =
min(segmental_snr(frame_count),MAX_SNR);
end
Segsnr=mean(segmental_snr);
processed_speech,sample_rate)
---------------------------------------------------------
-------------
---------------------------------------------------------
-------------
% clean_length = length(clean_speech);
% processed_length = length(processed_speech);
% if (clean_length ~= processed_length)
52
% disp('Error: Both Speech Files must be same
length.');
% return
% end
---------------------------------------------------------
-------------
% Global Variables
---------------------------------------------------------
-------------
in samples
in samples
if sample_rate<10000
else
53
P=16; % this could vary depending on sampling
frequency.
end
---------------------------------------------------------
-------------
Likelihood Ratio
---------------------------------------------------------
-------------
num_frames = clean_length/skiprate-(winlength/skiprate);
% number of frames
window = 0.5*(1 -
cos(2*pi*(1:winlength)'/(winlength+1)));
---------------------------------------------------------
54
% (1) Get the Frames for the test and reference
speech.
---------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame =
processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
---------------------------------------------------------
used
---------------------------------------------------------
lpcoeff(clean_frame, P);
55
[R_processed, Ref_processed, A_processed] = ...
lpcoeff(processed_frame, P);
---------------------------------------------------------
---------------------------------------------------------
numerator =
A_processed*toeplitz(R_clean)*A_processed';
denominator = A_clean*toeplitz(R_clean)*A_clean';
llrdistortion(frame_count) =
log(numerator/denominator);
end
Llr=mean(llrdistortion);
56
function T=noisemaskingthreshold(a)
FS=8000;
a2=fft(a);
Sp=abs(a2).^2;
BLOCK=280;
BITS=16;
each frame
critical bands
band
domain f
vlimit(2,ZT) = BLOCK/2;
for ii=1:BLOCK/2
57
z= barkme2(f); %freq -> z
if z==0
end
vf(ii)=f; %frequency
vz(ii)=z; %bark
critical band
vlimit(2,ceil(z)-1)=ii-1;
vlimit(1,ceil(z))=ii;
end
end
p=sin(2*pi*[1:BLOCK]*4000/FS)*(1/(2^BITS)); %definition
TH=max(abs(fft(p)).^2);
% Sw=fft(signal);
%##power spectrum
Sp=abs(Sw).^2;
Spz=zeros(1,ZT);
58
for ii=1:ZT,
Spz(ii)=sum(Sp([vlimit(1,ii):vlimit(2,ii)]));
end
Sm=conv(Spz,B);
temp=round(length(B)/2);
%SFM in dB
SFM=10*log10(Gm/Am);
SFMmax=-60; %maximum in dB
0 -> noise-like
O=alpha*(14.5+[1:ZT]+0.5) + (1-alpha)*5.5;
Traw=10.^(log10(Sm)-(O/10));
%normalization of threshold
59
Tnorm=Traw./Pz;
T=Tnorm; %copy
1 SSB Berouti1Program
clear all
close all
IS=.25;
fs=8000;
[data,FS,BITS]=wavread('sp01_babble_sn0');
% figure
subplot(2,2,1),plot(data);
xlabel('Time(ms)'),ylabel('amplitude');
% figure
60
data=data(1:length(data)/1)';
subplot(2,2,2),specgram(data,[],8000);
xlabel('Time(ms)'),ylabel('Normalized Freq.(Hz)');
of Original Audio
nfft=W;
wnd=hamming(W);
segments
performance
y=segment(data,W,SP,wnd);
Y=fft(y,nfft);
Y=abs(Y(1:(end/2)+1,:)).^Gamma;%Specrogram
numberOfFrames=size(Y,2);
FreqResol=size(Y,1);
61
NoiseCounter=0;
updating
Beta=.03;
minalpha=1;
maxalpha=3;
minSNR=-5;
maxSNR=20;
alphaSlope=(minalpha-maxalpha)/(maxSNR-minSNR);
alphaShift=maxalpha-alphaSlope*minSNR;
BN=Beta*N;
for i=1:numberOfFrames
Dist]=vad(Y(:,i).^(1/Gamma),N.^(1/Gamma),NoiseCounter);
if SpeechFlag==0
N=(NoiseLength*N+Y(:,i))/(NoiseLength+1); %Update
BN=Beta*N;
end
SNR=10*log(Y(:,i)./N);
alpha=alphaSlope*SNR+alphaShift;
62
alpha=max(min(alpha,maxalpha),minalpha);
Specrum Subtraction
spectrum.
end
output=OverlapAdd2(X.^(1/Gamma),YPhase,W,SP*W);
subplot(2,2,3),plot(output) ;
xlabel('Time(ms)'),ylabel('amplitude');
subplot(2,2,4),specgram(output,[],8000);
xlabel('Time(ms)'),ylabel('Normalized Freq.(Hz)');
of Original Audio
63
2 SSB Berouti2 Program
function output=SSBerouti791(signal,fs,IS)
% OUTPUT=SSBEROUTI79(S,FS,IS)
Power spectral
adjustment is
IS is the initial
is .25 sec)
% Required functions:
% SEGMENT
% VAD
% Sep-04
% Esfandiar Zavarehei
if (nargin<3 | isstruct(IS))
IS=.25; %seconds
64
end
nfft=W;
wnd=hamming(W);
W=IS.windowsize
SP=IS.shiftsize/W;
nfft=IS.nfft;
wnd=IS.window;
if isfield(IS,'IS')
IS=IS.IS;
else
IS=.25;
end
end
65
NIS=fix((IS*fs-W)/(SP*W) +1);%number of initial silence
segments
performance
y=segment(signal,W,SP,wnd);
Y=fft(y,nfft);
Y=abs(Y(1:fix(end/2)+1,:)).^Gamma;%Specrogram
numberOfFrames=size(Y,2);
FreqResol=size(Y,1);
NoiseCounter=0;
updating
Beta=.03;
minalpha=1;
maxalpha=3;
minSNR=-5;
maxSNR=20;
alphaSlope=(minalpha-maxalpha)/(maxSNR-minSNR);
66
alphaShift=maxalpha-alphaSlope*minSNR;
BN=Beta*N;
for i=1:numberOfFrames
Dist]=vad(Y(:,i).^(1/Gamma),N.^(1/Gamma),NoiseCounter);
if SpeechFlag==0
N=(NoiseLength*N+Y(:,i))/(NoiseLength+1); %Update
BN=Beta*N;
end
SNR=10*log(Y(:,i)./N);
alpha=alphaSlope*SNR+alphaShift;
alpha=max(min(alpha,maxalpha),minalpha);
Specrum Subtraction
spectrum.
67
end
output=OverlapAdd2(X.^(1/Gamma),YPhase,W,SP*W);
signal=wavread('sp01_airport_sn0');
fs=8000;
if (nargin<3 | isstruct(IS))
seconds
end
wnd=hamming(W);
68
W=IS.windowsize
SP=IS.shiftsize/W;
%nfft=IS.nfft;
wnd=IS.window;
if isfield(IS,'IS')
IS=IS.IS;
else
IS=.25;
end
end
% ......................................UP TO HERE
pre_emph=0;
signal=filter([1 -pre_emph],1,signal);
segments
Y=fft(y);
Y=abs(Y(1:(end/2)+1,:));%Specrogram
numberOfFrames=size(Y,2);
FreqResol=size(Y,1);
69
N=mean(Y(:,1:NIS)')'; %initial Noise Power Spectrum mean
Spectrum variance
NoiseCounter=0;
updating
new xi
Gamma=G;
h=waitbar(0,'Wait...');
for i=1:numberOfFrames
SpeechFlag=0;
NoiseCounter=100;
70
[NoiseFlag, SpeechFlag, NoiseCounter,
Distance VAD
end
Parameters
N=(NoiseLength*N+Y(:,i))/(NoiseLength+1); %Update
LambdaD=(NoiseLength*LambdaD+(Y(:,i).^2))./(1+NoiseLength
end
xi=alpha*(G.^2).*Gamma+(1-alpha).*max(gammaNew-1,0);
Gamma=gammaNew;
G=(xi./(xi+1));
71
waitbar(i/numberOfFrames,h,num2str(fix(100*i/numberOfFram
es)));
end
close(h);
output=OverlapAdd2(X,YPhase,W,SP*W); %Overlap-add
Synthesis of speech
of Pre-emphasis
72