0% found this document useful (0 votes)
62 views72 pages

Documention Theory

The document discusses speech enhancement techniques to improve speech quality and intelligibility in noisy environments. It introduces perceptual wiener filtering which uses a psychoacoustic model as a weighting factor to minimize the perception of musical noise without degrading speech clarity. The goal is to remove perceptually significant noise components while preserving clean speech. Linear optimum filtering techniques like the wiener filter are also discussed, along with the principle of orthogonality which provides the stochastic least squares solution to minimize the mean squared error between the desired and estimated signals.

Uploaded by

SaNtosh Komakula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views72 pages

Documention Theory

The document discusses speech enhancement techniques to improve speech quality and intelligibility in noisy environments. It introduces perceptual wiener filtering which uses a psychoacoustic model as a weighting factor to minimize the perception of musical noise without degrading speech clarity. The goal is to remove perceptually significant noise components while preserving clean speech. Linear optimum filtering techniques like the wiener filter are also discussed, along with the principle of orthogonality which provides the stochastic least squares solution to minimize the mean squared error between the desired and estimated signals.

Uploaded by

SaNtosh Komakula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 72

CHAPTER 1

1.1INTRODUCTION

The objective of speech enhancement process is to improve the quality

and intelligibility of Speech in noisy environments. The problem has been widely

discussed over the years. Many Approaches have been proposed like subtractive

type [1-4], Perceptual Wiener filtering algorithms. Among them spectral

subtraction and the Wiener filtering algorithms are widely because of their low

computational complexity and impressive performance. In these algorithms,

Such methods return residual noise known as musical noise. This type of noise is

quite annoying. In order to reduce the effect of musical noise, several solutions

have been proposed. Some involve adjusting parameters of spectral subtraction

so as to offer more flexibility as in [2] and [3]. Other such as proposed in [4], are

based on signal subspace approaches. Despite the effectiveness of these

techniques to improve the signal to noise ratio (SNR), the problem of eliminating

or reducing musical noise is still a challenge to many researchers.

1
1.1 OVERVIEW OF PWF

In the last few decades the introduction of psychoacoustic models has

attracted great deal of interest. The objective is to improve the perceptual quality of

the enhanced signal. In [3], a psychoacoustic model is used to control the parameters

of the spectral subtraction in order to find the best trade of between noise reduction

and speech distortion. To make musical noise inaudible, the linear estimator proposed

in [5] incorporates the masking properties of the human auditory system. In [6], the

masking threshold and intermediate signal, which is slightly denoised and free of

musical noise, are used to detect musical tones generated by the spectral subtraction

methods. This detection can be used by a post-processing aimed at reducing the

detected tones. These perceptual speech enhancement systems reduce the musical

noise but introduce some undesired distortion to the enhanced speech signal. When

this distorted estimated speech signal is applied to the recognition systems their

performance degrades drastically. The basic idea of the proposed method is to

remove, perceptually significant noise components from the noisy signal, so that the

clean speech components are not affected by processing. In addition, the technique

requires very little a priori information of the features of the noise. In the present

paper, we propose to control the perceptual wiener filtering by psychoacoustic ally

motivated filter that can be regarded as weighting factor. Tpurpose is to minimize the

perception of musical noise without degrading the clarity of the enhanced speech.

2
CHAPTER 2

Linear Optimum Filtering

2.1 LPF Using Wiener Filter

Consider the block diagram of Fig.1 built around a linear discrete-time filter.

The filter input consists of a time series x(0), x(1), x(2), …, and the filter is itself

characterized by the impulse response w0, w1, w2, … At some discrete time n, the

filter produces an output denote by y(n). This output is used to provide an estimate

of a desired response denoted by d(n). With the filter input and the desired response

representing single realizations of respective stochastic processes, the estimation is

accompanied by an error with statistical characteristics if its own. In particular, the

estimation error e(n) “as small as possible” in some statistical sense. Two

restriction s have so far been placed on the filter:

1. The filter is linear, which makes the mathematical analysis easy to handle.

2. The filter operates in discrete time, which makes it possible for the filter to

be implemented using digital hardware/software.

3
Fig. 2.1 Block diagram representation of the statistical filtering problem

The final details of the filter specification, however, depend on two other choices

that have to be made:

1. Whether the impulse response of the filter has finite or infinite duration.

2. The type of statistical criterion used for the optimization.

The choice of a finite-duration impulse response (FIR) or an infinite-duration

impulse response (IIR) for the filter is dictated by practical considerations. The

choice of a statistical criterion for optimizing the filter design is influenced bt

mathematical tractability. These two issues are considered in turn.

4
For the initial developed includes that for FIR filters as a special case.

However, for much of the material presented in this tutorial, we will confine our

attention to the use if FIR filters. We do so for the following reason. An FIR filter

is inherently stable, because its structure involves the use of forward paths only.

In others words, the only mechanism for input-output interaction in the filter us

via forward paths from the filter input to its output. Indeed, it is this form of signal

transmission through the filter that limits its impulse response to a finite duration.

On the other hand, an IIR filter involves both feedforward and feedback. The

presence of feedback means that portions of the filter output and possibly other

internal variables in the filter are fed back to the input. Consequently, unless it is

properly designed, feedback in the filter can indeed make it unstable with the

result that the filter oscillates; this kind of operation is clearly unacceptable when

the requirement is that of filtering for which stability is a “must.” By itself, the

stability problem in IIR filters us manageable in both theoretical and practical

terms. However, when the filter us required to be adaptive, bringing with it

stability problems of its own, the inclusion of the adaptivity combined with

feedback that is inherently present in an IIR filter makes a difficult problem that

much more difficult to handle. It is for this reason that we find that in the majority

if applications requiring the use if adaptivity, the use if an FIR filter is preferred

over IIR filter even through the latter is less demanding in computational

requirements.

5
Turning next to the issue of what criterion to choose for statistical

optimization, there are indeeded several criteria that suggest themselves.

Specifically, we may consider optimizing the filter design by minimizing a cost

function, or index of performance, selected form the following short list of

possibilities:

1. Mean-square value if the estimation error

2. Expectation of the absolute value of the estimation error

3. Expectation of third or higher powers of the absolutely value of the

estimation error

Option 1 has a clear advantage over the other two, because it leads to tractable

mathematics. In particular, the choice of the mean-square error criterion results in

a second order dependence for the cost function on the unknown coefficients in

the impulse response of the filter. Moreover, the cost function has a distinct

minimum that uniquely defines the optimum statistical design of the filter.

We may now summarize the essence of the filtering problem it making the

following statement:

Design a linear discrete-time filter whose output y(n) provides an estimate of a

desired signal response d(n), given a set of input samples x(0), x(1), x(2), …, such
6
that the mean-square value of the estimation error e(n), defined as the difference

between the desired response d(n) and the actual response y(n), is minimized.

We may develop the mathematical solution to this statistical optimization

problem by following two entirely different approaches that are complementary.

One approach leads to the development of an important theorem commonly

known as the principle of orthogonality. The other approach highlights the error-

performance surface that describes the second-order dependence of the cost

function one the filter coefficients. We will proceed by deriving the principle of

orthogonality first, because the derivation is relatively simple and because the

principle of orthogonality is highly insightful.

2.2 Principle of Orthogonality

Consider again the statistical filtering problem described in Fig1. We

have a set of samples{x(n)} and desired{d(n)} coming from a jointly wide

sense stationary (WSS) process with zero mean. Suppose now we want to find

a linear estimate of d(n) based on the L-most recent samples of x(n), i.e.,

L −1
y ( n )=w X ( n)= ∑ w k x (n−k ),
T
w , X (n )∈ R L and n=0,1,2 , .. .. . .
k =0 (1)

7
The introduction of a particular criterion to quantify how well d(n) is

estimated by y(n) would influence how the coefficients wk will be computed.

We propose to use the Mean Squared Error (MSE), which is defined by

J MSE (w )=E [|e (n)|2 ]=E [|d (n )− y (n )|2 ] (2)

where E[·] is the expectation operator and e(n) is the estimation error. Then,

the estimation problem can be seen as finding the vector w that minimizes the

cost function JMSE(w). The solution to this problem is sometimes called the

stochastic least squares solution. If we choose the MSE cost function (2), the

optimal solution to the linear estimation problem can be presented as:

w opt ( w )=arg min J MSE( w )


w ∈RL (3)

Replacing (1) in (2), the latter can be expanded as

2 T T T
J MSE (w )=E [|d (n )| −2d (n ) X (n ) w+w X (n )X (n)w ] (4)

As this is a quadratic form, the optimal solution will be at the point where the

cost function has zero gradient, i.e.,

8
∂ J MSE
∇ w|J MSE ( w)= =0 Lx 1
∂w (5)

or in other words, the partial derivative of J MSE with respect to each coefficient

wk should be zero. Under this set of conditions the filter is said to be optimum

in the mean-squared-error sense. Using (1) in (2), we can compute the

gradient as:

∂J MSE ∂ e(n )
=2 E [e (n ) ]=−2 E [ e(n ) X (n )]
∂w ∂w (6)

Then, at the minimum, the condition that should hold is:

E[ e min (n) X ( n)]=0Lx 1 (7)

or equivalently

E[ e min (n) x (n−k )]=0 k=0,1,2, . .. . , L−1 (8)

9
This is called the principle of orthogonality, and it implies that the optimal

condition is achieved if and only if the error e(n) is decorrelated from the

samples x(n−k), k = 0,1,..., L−1. Actually, the error will also be decorrelated

from the estimate y(n) since:

T T
opt opt
E[ e min ( n) y opt ( n) ]=E[ e min ( n) w ( n ) X ( n )]=w E[ e min (n ) X ( n) ]=0 (9)

Fig. 2.2

yopt(n)

We may thus state the corollary to the principle of orthogonality as follows:

10
When the filter operates in its optimum condition, the estimate of the desired

response defined by the filter output y opt(n), and the corresponding estimation

error, emin(n), are orthogonal to each other.

Equation (9) offers an interesting geometric interpretation of the conditions that

exist at the output of the optimum filter, as illustrated in Fig.2 for case L = 2. In

this figure, the desired response, the filter output, and the corresponding

estimation error are represented by vectors labeled d, y opt, and emin, respectively.

We see that for the optimum filter the vector representing the estimation error is

normal (i.e. perpendicular) to the vector representing the filter output. It should,

however, be emphasized that the situation depicted in Fig.2 is merely an analogy,

where random variables and expectations are replaced with vectors and vector

inner products, respectively. Also for obvious reasons the geometry depicted in

this figure may be viewed as a Statistician’s Pythagorean Theorem.

2.3 Minimum Mean-Squared Error

When the linear discrete-time filter in Fig.1 operates in its optimum

condition, it takes on the following special form:

e min (n )=d (n )− y opt (n) (10)

11
Rearranging the terms, we have:

d (n )= y opt (n)+e min (n ) (11)

Let JMSE denotes the MMSE, defined by:

2
J MSE =E [|e min (n )| ] (12)

Hence, evaluating the mean-square values of both sides of (11), and applying to

it the corollary to the principle of orthogonality described by (9), we get:

σ 2d =σ 2y +J MSE
opt (13)

2
where σ 2d is the variance of the desired response, and σ y is the variance of the
opt

estimate yopt ; both of these random variables are assumed to be of zero mean.

Solving (13) for MMSE, we get:

12
J MSE=σ 2d −σ 2y
opt (14)

This relation shows that for the optimum filter, the MMSE equals the difference

between the variance of the desired response and the variance of the estimate that

the filter products at its output.

It is convenient to normalize the expression in (14) in such a way that the

minimum value if the mean-squared error always lies between zero and one. We

may do this by dividing both sides of (14) by σ 2d, obtaining

2
J MSE σ y opt
=1−
σ 2d σ 2d (15)

Clearly, this is possible because σ 2d is never zero, expect in the trivial case of a

desired response d(n) that is zero for all n. Let

J MSE
κ=
σ 2d (16)

The quantity κ is called the normalized mean-squared error, in terms of which

we may rewrite (15) in the form:

13
2
σ yopt
κ=1− 2
σd (16)

σ 2y
opt
2
We note that the ratio κ can never be negative, and the ratio σ d is always

positive. We therefore have

0≤κ≤1 (17)

If κ is zero, the optimum filter operates perfectly in the sense that there is

complete agreement between the estimate yopt(n) at the filter output and the

desired response d(n). On the other hand, if κ is unity, there is no agreement

whatsoever between these two quantities; this corresponds to the worst possible

situation.

2.4 Standard Speech Enhancement Technique

Let the noisy signal can be expressed as

y (n )=s(n )+d (n) ,

(1) Where x(n) is the original clean speech signal and d (n) is the additive random
noise
signal, uncorrelated with the original signal. Taking DFT to the observed signal gives

Y (m ,k )=S (m, k )+ D(m, k ) .

14
(2)Where m=1,2,...,M is the frame index, k=1,2,....,K is the frequency bin index,
M is

the total number of frames and K is the frame length, Y (m ,k ), S (m, k ) and D(m ,k )
represent

the short time spectral components of the y(n ), S(n )and( n) , respectively. Clean
^ ,k)
S(m
speech spectrum is obtained by multiplying noisy speech spectrum with filter
gain function
as given in eqation
^ ,k)=H ( m,k )Y (m ,k)
S(m

(3)Where H (m,k ) is the noise suppression filter gain function (conventional Wiener
filter (WF)), which is derived according to MMSE estimator

2.5 Formula For Gain


H (m,k ) is given by

ξ(m , k )
H (m, k )=
1+ξ (m, k )

Where ξ (m, k ) is an apriori SNR, which is defined as

Γ s (m, k )
ξ (m, k )=
Γ d (m , k ) . (5) Γ d (m ,k)=E {|D(m ,k )|2 } and Γ s (m , k)=E {|S(m , k)|2 }
represents the estimated noise power spectrum and clean speech power spectrum,
respectively. A posteriori
estimation is given by

|Y (m , k)|2
γ (m, k )=
Γ d (m , k)
^
An estimate of ξ(m,k ) of ξ (m, k ) is given by the well known decision directed
approach and is expressed as
2
^ |H(m−1,k)Y (m−1,k)|
ξ(m,k)=α +(1−α)P' [ V (m,k ] .
Γd
15
Where V (m , k)=γ( m, k )−1 , P [ x ] =x if x≥0 and P [ x ] =0 otherwise.
The noise suppression gain function is chosen as the Wiener filter

CHAPTER 3

3.1 PERCEPTUAL SPEECH ENHANCEMENT

Although the Wiener filtering reduces the level of musical noise, it

does not eliminate it [15]. Musical noise exists and perceptually annoying. In an

effort to make the residual noise perceptually inaudible, many perceptual speech

enhancement methods have been proposed which incorporates the auditory masking

properties [2-9]. In these methods residual noise is shaped according to an estimate of

the signal masking threshold [9, 13]. Figure 1 depicts the complete block diagram of

the proposed speech enhancement method.

3.2 Gain of Perceptual Wiener filter (PWF)

The perceptual Wiener filter (PWF) gain function H 1 (m , k) is calculated based cost
function,
J which is defined as

J=[|S(m,k)−S(m,k)|
^ 2
]
16
Substituting (2) and (3) in (9) results to

=E {|( H 1 (m, k )−1) S(m , k)+H 1 (m, k )D(m , k)|2 }

=d i +r i (9)
Where

d i=( H 1 (m , k)−1)2 E [|S(m ,k )|2 ] and r i =H 21 (m ,k) E [|D(m ,k)|2 ] represents speech
distortion

energy and residual noise energy.


To make this residual noise inaudible, the residual noise should be less than the
auditory

masking threshold, T ( m,k ) . This constraint is given by

ri ¿ T ( m,k ) (10)

By including the above constraint and substituting Γ d (m ,k)=E {|D(m ,k)| } and
2

Γ s (m ,k )=E {|S(m ,k )|2 } in (9) the cost function will become as


{
J =( H 1 ( m , k )−1 )2 Γ s ( m ,k )+ H 21 ( m, k ) max [ ( Γ d ( m , k )−T ( m , k )) , 0 ] } (11)

The desired perceptual modification of Wiener is obtained by differentiating J w.r.t

H 1 (m , k) and equating to zero. The obtained perceptually defined Wiener filter gain
Γ s (m ,k)
H 1(m ,k)=
function is given by Γ s ( m,k )+max( Γ d (m ,k )−T (m,k ),0) (12)

By multiplying and dividing equation (12) with Γ d (m ,k) , H 1 (m , k) will become as


^ m,k )
ξ(
H 1 (m ,k)=
max( Γ d (m ,k)−T(m ,k),0)
^ξ(m ,k)+
Γ d (m,k ) (13)

T ( m,k ) is noise masking threshold which is estimated based on[16] noisy speech
spectrum.
A priori SNR and noise power spectrum were estimated using the two -step a priori
SNR

17
estimator proposed in [15] and weighted noise estimation method proposed
in[17],respectively.

3.3 Block Diagram Of PWF

Windowing + Amplitude Estimation


FFT of NMT

Noisy
signal
Noise
estimation

PWF WPW
Phase F
ATH

Enhanced *
signal
IFFT-
Overlap-
Add *

FIG 3.1

3.4 Weighted PWF


Although perceptual speech enhancement methods perform better than the non-

perceptual methods, most of them still return annoying residual musical noise.

18
Enhanced speech signal obtained using above mentioned perceptual Wiener filter still

contains some residual noise due to the fact that only noise above the noise masking

threshold is filtered and noise below the noise masking threshold is remain. It can

affect the performance of perceptual speech enhancement method that processes

audible noise only.

In order to overcome this drawback we propose to weight the perceptual

Wiener filters using a psychoacoustically motivated weighting filter.

Psychoacoustically motivated weighting filter is given by

W(m,k)=¿ {H (m,k ),ifATH(m,k)< Γ d≤T(m,k) ¿ ¿¿¿ (15)

Where ATH (m, k ) is the absolute threshold of hearing. This weighting factor is used
to

weight the perceptual wiener filter. The gain function of the H 2 (m , k) of the proposed

weighted perceptual Wiener filter is given by

H 2 =H 1 ( m , k )W (m , k ) (16)

19
CHAPTER –4

APPLICATIONS OF PWF

4.1 In-car systems


Typically a manual control input, for example by means of a finger control on

the steering-wheel, enables the speech recognition system and this is

signalled to the driver by an audio prompt. Following the audio prompt, the

system has a "listening window" during which it may accept a speech input for

recognition

Simple voice commands may be used to initiate phone calls, select radio

stations or play music from a compatible smartphone, MP3 player or music-

loaded flash drive. Voice recognition capabilities vary between car make and

model. Some of the most recent car models offer natural-language speech

recognition in place of a fixed set of commands. allowing the driver to use full

sentences and common phrases. With such systems there is, therefore, no

need for the user to memorize a set of fixed command words.

4.2 Health care

20
In the health care sector, speech recognition can be implemented in

front-end or back-end of the medical documentation process. Front-end

speech recognition is where the provider dictates into a speech-recognition

engine, the recognized words are displayed as they are spoken, and the

dictator is responsible for editing and signing off on the document. Back-end

or deferred speech recognition is where the provider dictates into a digital

dictation system, the voice is routed through a speech-recognition machine

and the recognized draft document is routed along with the original voice file

to the editor, where the draft is edited and report finalised. Deferred speech

recognition is widely used in the industry currently.

One of the major issues relating to the use of speech recognition in

healthcare is that the American Recovery and Reinvestment Act of 2009

(ARRA) provides for substantial financial benefits to physicians who utilize an

EMR according to "Meaningful Use" standards. These standards require that

a substantial amount of data be maintained by the EMR (now more commonly

referred to as an Electronic Health Record or EHR). The use of speech

recognition is more naturally suited to the generation of narrative text, as part

of a radiology/pathology interpretation, progress note or discharge summary:

the ergonomic gains of using speech recognition to enter structured discrete

data (e.g., numeric values or codes from a list or a controlled vocabulary are

relatively minimal for people who are sighted and who can operate a

keyboard and mouse.

A more significant issue is that most EHRs have not been expressly tailored

to take advantage of voice-recognition capabilities. A large part of the


21
clinician's interaction with the EHR involves navigation through the user

interface using menus, and tab/button clicks, and is heavily dependent on

keyboard and mouse: voice-based navigation provides only modest

ergonomic benefits. By contrast, many highly customized systems for

radiology or pathology dictation implement voice "macros", where the use of

certain phrases - e.g., "normal report", will automatically fill in a large number

of default values and/or generate boilerplate, which will vary with the type of

the exam - e.g., a chest X-ray vs. a gastrointestinal contrast series for a

radiology system.

4.3 Usage in education and daily life

For language learning, speech recognition can be useful for

learning a second language. It can teach proper pronunciation, in addition to

helping a person develop fluency with their speaking skills. [10]

Students who are blind (see Blindness and education) or have very low vision

can benefit from using the technology to convey words and then hear the

computer recite them, as well as use a computer by commanding with their

voice, instead of having to look at the screen and keyboard.

Students who are physically disabled or suffer from Repetitive strain

injury/other injuries to the upper extremities can be relieved from having to

worry about handwriting, typing, or working with scribe on school assignments

by using speech-to-text programs. They can also utilize speech recognition

22
technology to freely enjoy searching the Internet or using a computer at home

without having to physically operate a mouse and keyboard.

CHAPTER 5

SIMULATION RESULTS

To evaluate and compare the performance of the proposed scheme of speech

enhancement, simulations are carried out with the NOIZEUS, A noisy speech

for evaluation of speech enhancement algorithms, database [18]. The noisy

database contains 30 IEEE sentences (produced by three male and three female

speakers) corrupted by eight different real world noises at different SNRs.

5.1 Segmental SNR


Speech signals were degraded with different types of noise at global SNR

levels of 0 dB, 5 dB, 10 dB and 15 dB. In this evaluation only five noises are

considered those are babble, car, train, airport and street noise. The objective quality

measures used for the evaluation of the proposed speech enhancement method are the

segmental SNR and PESQ measures [19]. It is well known that the segmental SNR is

more accurate in indicating the speech distortion than the overall SNR. The higher

value of the segmental SNR indicates the weaker speech distortion. The higher PESQ

score indicates better perceived quality of the proposed signal [19]. The performance

of the proposed method is compared with Wiener filter and perceptual Wiener filter.

23
Noise Input WF PWF Proposed
Type SNR method
(dB)
0 -4.59 -0.61 0.22
Babble 5 -1.39 0.01 0.32
10 0.02 0.65 2.14
15 0.75 2.71 3.97
0 -3.93 -0.24 0.85
Car 5 -1.65 0.52 1.20
10 0.69 0.70 2.37
15 0.72 2.31 3.81
0 -3.45 -0.49 0.15
Train 5 -0.86 0.38 0.43
10 -0.39 0.77 2.20
15 0.75 2.62 3.5
0 -4.37 -0.24 0.19
Airport 5 -2.57 0.15 0.43
10 -0.06 0.14 1.09
15 0.75 1.88 3.65
0 -2.88 -0.15 0.08
Street 5 -2.13 0.61 0.73
10 0.69 1.20 2.70
15 0.77 2.25 3.42

Table.1 Segmental SNR values of Enhanced Signal

5.2 EXPLANATION OF PESQ

24
The simulation results are summarized in Table 1 and Table 2. The proposed method

leads to better denoising quality for temporal and the better improvements are

obtained for the high noise level. The time-frequency distribution of speech signals

provides more accurate information about the residual noise and speech distortion

than the corresponding time domain wave forms. we compared the spectrograms for

each of the method and confirmed a reduction of the residual noise and speech

distortion. Figure2. Represents the spectrograms of the clean speech signal, noisy

signal and enhanced speech signals.

Noise Input WF PWF Proposed


Type SNR method
(dB)
0 1.221 0.952 1.427
Babble 5 1.728 1.750 1.836
10 2.034 2.276 2.402
15 2.127 2.609 2.718
0 1.165 1.439 1.734
Car 5 1.694 1.697 2.107
10 1.921 2.168 2.318
15 2.265 2.645 3.127
0 1.450 1.482 1.731
Train 5 1.680 1.715 2.133
10 2.009 2.096 2.479
15 2.040 2.032 2.714
0 1.472 1.561 1.759
Airport 5 1.492 1.769 2.242
10 2.025 2.413 2.538
15 2.249 2.579 2.715
Street 0 1.636 1.782 1.817
5 1.679 1.857 1.968
10 2.119 2.260 2.392
15 2.380 2.573 2.683

Table.2 PESQ values of the enhanced signals

25
5.3 Spectrums

26
FIG 5.1

CHAPTER 6

27
ADVANTAGES

1. Many well documented implementations (LMS, RLS, Kalman filters)


2. Are optimal in that they minimize the mean squared estimation error
3. Can be computed in real-time

DISADVANTAGES

 They assume your process dynamics are linear


 Only provide a point estimate
 Can only handle processes with additive, unimodal noise

28
CHAPTER 7

CONCLUSION

In this paper, an effective approach for suppressing musical noise presented after

wiener filtering has been introduced. Based on the perceptual properties of the human

auditory system, a weighting factor accentuates the denoising process when noise is

perceptually insignificant and prevents that residual noise components might become

audible in the absence of adjacent maskers. When the speech signal is additively

corrupted by babble noise and car noise objective measure results showed the

improvement brought by the proposed method in comparison to some recent filtering

techniques of the same type.

SCOPE OF PWF

It is observed that Perceptual Wiener Filter can be effectively used to remove


musical noise from the corrupted original signal even below threshold masking level.
So,it cab be sure that it can be used for wide variety of applications irrespective of
complexity of the signal distortion.

29
REFERENCE
[1] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean
square error short-time spectral amplitude estimator,” IEEE Trans. Acoust.,
Speech, Signal Processing,vol. ASSP-32, pp. 1109–1121, Dec 1984.
[2] R. Schwartz M. Berouti and J. Makhoul, “Enhancement of speech corrupted
by acoustic noise,” Proc. of ICASSP, 1979, vol. I, pp. 208–211.
[3] N.Virag, “Single channel speech enhancement based on masking properties of
the human auditory system,” IEEE Trans. Speech and Audio Processing, vol.
7, pp. 126–137, 1999.
[4] Y. Ephraim and H.L. Van Trees, “A signal subspace approach for speech
enhancement,” IEEE Trans. Speech and Audio Processing, vol. 3, pp. 251–
266, 1995.
[5] Y. Hu and P. Loizou, “Incorporating a psychoacoustic model in frequency
domain speech enhancement,” IEEE Signal Processing Letters, vol. 11(2), pp.
270–273, 2004.
[6] F. Jabloun and B. Champagne, “Incorporating the human hearing properties in
the signal subspace approach for speech enhancement,” IEEE Trans. Speech
and Audio Processing,vol. 11, pp. 700–708, 2003.
[7] Y.M. Cheng and D. O’Shaughnessy, “Speech enhancement based
conceptually on auditory evidence,” IEEE Trans. Signal Processing,
vol.39, no.9, pp.1943–1954, 1991.
[8] D. Tsoukalas, M. Paraskevas, and J. Mourjopoulos, “Speech enhancement
using psychoacoustic criteria,” IEEE ICASSP, pp.359–362, Minneapolis, MN,
1993.
[9] Y. Hu and P.C. Loizou, "A perceptually motivated approach for speech
enhancement," IEEE Trans. Speech Audio Processing, pp. 457-465. Sept.
2003.
[10] L. Lin, W. H. Holmes and E. Ambikairajah, “Speech denoising using
perceptual modification of Wiener filtering,” IEE Electronic Letters, vol. 38,
pp. 1486–1487, Nov 2002.
[11] P. Scalart C. Beaugeant, V. Turbin and A. Gilloire,
“New optimal filtering approaches for hands-free telecommunication
terminals, ”Signal Processing, vol. 64 (15), pp. 33–47, Jan 1998.
[12] T. Lee and Kaisheng Yao, “Speech enhancement by perceptual filter
with sequential noise parameter estimation,” Proc. of ICASSP, vol. I, pp. 693–
696, 2004.

APPENDIX -I
30
SOURSE CODES

1 Absolute Program:

function TH=absoluteth()

BLOCK=280;

FS=8000;

BITS=16;

p=sin(2*pi*[1:BLOCK]*4000/FS)*(1/(2^BITS)); %definition

and scaling to be 1 bit in ampitude

TH=max(abs(fft(p)).^2);

2 Berouti Program:

signal=wavread('sp01_airport_sn0');

fs=8000;

if (nargin<3 | isstruct(IS))

IS=.25; %seconds

end

31
W=fix(.025*fs); %Window length is 25 ms

nfft=W;

SP=.4; %Shift percentage is 40% (10ms) %Overlap-Add

method works good with this value(.4)

wnd=hamming(W);

% IGNORE THIS SECTION FOR CAMPATIBALITY WITH ANOTHER

PROGRAM FROM HERE.....

if (nargin>=3 & isstruct(IS))%This option is for

compatibility with another programme

W=IS.windowsize

SP=IS.shiftsize/W;

nfft=IS.nfft;

wnd=IS.window;

if isfield(IS,'IS')

IS=IS.IS;

else

IS=.25;

end

end

% .......IGNORE THIS SECTION FOR CAMPATIBALITY WITH

ANOTHER PROGRAM T0 HERE

32
NIS=fix((IS*fs-W)/(SP*W) +1);%number of initial silence

segments

Gamma=2;%Magnitude Power (1 for magnitude spectral

subtraction 2 for power spectrum subtraction)

%Change Gamma to 1 to get a completely different

performance

y=segment(signal,W,SP,wnd);

Y=fft(y,nfft);

YPhase=angle(Y(1:floor(end/2)+1,:)); %Noisy Speech Phase

Y=abs(Y(1:floor(end/2)+1,:)).^Gamma;%Specrogram

numberOfFrames=size(Y,2);

FreqResol=size(Y,1);

N=mean(Y(:,1:NIS)')'; %initial Noise Power Spectrum mean

NoiseCounter=0;

NoiseLength=9;%This is a smoothing factor for the noise

updating

Beta=.03;

minalpha=1;

maxalpha=3;

minSNR=-5;

maxSNR=20;

alphaSlope=(minalpha-maxalpha)/(maxSNR-minSNR);
33
alphaShift=maxalpha-alphaSlope*minSNR;

BN=Beta*N;

for i=1:numberOfFrames

[NoiseFlag, SpeechFlag, NoiseCounter,

Dist]=vad(Y(:,i).^(1/Gamma),N.^(1/Gamma),NoiseCounter);

%Magnitude Spectrum Distance VAD

if SpeechFlag==0

N=(NoiseLength*N+Y(:,i))/(NoiseLength+1); %Update

and smooth noise

BN=Beta*N;

end

SNR=10*log(Y(:,i)./N);

alpha=alphaSlope*SNR+alphaShift;

alpha=max(min(alpha,maxalpha),minalpha);

D=Y(:,i)-alpha.*N; %Nonlinear (Non-uniform) Power

Specrum Subtraction

X(:,i)=max(D,BN); %if BY>D X=BY else X=D which sets

very small values of subtraction result to an attenuated

%version of the input power

spectrum.
34
end

output=OverlapAdd2(X.^(1/Gamma),YPhase,W,SP*W);

3 Evalution Program

sample_rate=8000;

wave1='sp03';

clean_speech=wavread(wave1);

% wave='sp03_train_sn15';

% processed_speech=wavread(wave);

load signal

d1=signal;

load output

d=output;

lens=length(clean_speech);

lenes=length(d);

if lens>lenes

out=[output' zeros(1,(lens-lenes))];

else

out1=[clean_speech' zeros(1,(lenes-lens))];

end

processed_speech=out';
35
% processed_speech=d;

clean_length = length(clean_speech);

processed_length = length(processed_speech);

% if (clean_length ~= processed_length)

% disp('Error: Files musthave same length.');

% return

% end

---------------------------------------------------------

-------------

% Global Variables

---------------------------------------------------------

-------------

sample_rate = 8000; % default sample rate

samplin_rate=8000;

winlength=fix(0.03*samplin_rate); %frame length

% M=0.4*fs;

M=fix(0.5*winlength);
36
L=winlength-M-1;

% L=numberOfFrames;

skiprate = floor(clean_length/L);

% N=fix(0.025*sample_rate);

% winlength = 240; % window length in samples

% % winlength = N;

% skiprate = 60; % window skip in samples

% winlength = round(30*sample_rate/1000); %240;

% window length in samples

% skiprate = floor(winlength/4); % window skip

in samples

max_freq = sample_rate/2; % maximum bandwidth

num_crit = 25; % number of critical bands

USE_FFT_SPECTRUM = 1; % defaults to 10th order

LP spectrum

%n_fft = 512; % FFT size

n_fft = 2^nextpow2(2*winlength);

n_fftby2 = n_fft/2; % FFT size/2

Kmax = 15; % value suggested by Klatt, pg

1280

Klocmax = 1; % value suggested by Klatt, pg

1280

37
%

---------------------------------------------------------

-------------

% Critical Band Filter Definitions (Center Frequency and

Bandwidths in Hz)

---------------------------------------------------------

-------------

cent_freq(1) = 50.0000; bandwidth(1) = 70.0000;

cent_freq(2) = 120.000; bandwidth(2) = 70.0000;

cent_freq(3) = 190.000; bandwidth(3) = 70.0000;

cent_freq(4) = 260.000; bandwidth(4) = 70.0000;

cent_freq(5) = 330.000; bandwidth(5) = 70.0000;

cent_freq(6) = 400.000; bandwidth(6) = 70.0000;

cent_freq(7) = 470.000; bandwidth(7) = 70.0000;

cent_freq(8) = 540.000; bandwidth(8) = 77.3724;

cent_freq(9) = 617.372; bandwidth(9) = 86.0056;

cent_freq(10) = 703.378; bandwidth(10) = 95.3398;

cent_freq(11) = 798.717; bandwidth(11) = 105.411;

cent_freq(12) = 904.128; bandwidth(12) = 116.256;

cent_freq(13) = 1020.38; bandwidth(13) = 127.914;

cent_freq(14) = 1148.30; bandwidth(14) = 140.423;

cent_freq(15) = 1288.72; bandwidth(15) = 153.823;

38
cent_freq(16) = 1442.54; bandwidth(16) = 168.154;

cent_freq(17) = 1610.70; bandwidth(17) = 183.457;

cent_freq(18) = 1794.16; bandwidth(18) = 199.776;

cent_freq(19) = 1993.93; bandwidth(19) = 217.153;

cent_freq(20) = 2211.08; bandwidth(20) = 235.631;

cent_freq(21) = 2446.71; bandwidth(21) = 255.255;

cent_freq(22) = 2701.97; bandwidth(22) = 276.072;

cent_freq(23) = 2978.04; bandwidth(23) = 298.126;

cent_freq(24) = 3276.17; bandwidth(24) = 321.465;

cent_freq(25) = 3597.63; bandwidth(25) = 346.136;

bw_min = bandwidth (1); % minimum critical

bandwidth

---------------------------------------------------------

-------------

% Set up the critical band filters. Note here that

Gaussianly shaped

% filters are used. Also, the sum of the filter weights

are equivalent

% for each critical band filter. Filter less than -30 dB

and set to

% zero.

39
%

---------------------------------------------------------

-------------

min_factor = exp (-30.0 / (2.0 * 2.303)); % -30 dB

point of filter

for i = 1:num_crit

f0 = (cent_freq (i) / max_freq) * (n_fftby2);

all_f0(i) = floor(f0);

bw = (bandwidth (i) / max_freq) * (n_fftby2);

norm_factor = log(bw_min) - log(bandwidth(i));

j = 0:1:n_fftby2-1;

crit_filter(i,:) = exp (-11 *(((j - floor(f0))

./bw).^2) + norm_factor);

crit_filter(i,:) = crit_filter(i,:).*(crit_filter(i,:)

> min_factor);

end

---------------------------------------------------------

-------------

% For each frame of input speech, calculate the Weighted

Spectral

% Slope Measure
40
%

---------------------------------------------------------

-------------

num_frames = clean_length/skiprate-(winlength/skiprate);

% number of frames

start = 1; % starting sample

% window = 0.5*(1 -

cos(2*pi*(1:winlength)'/(winlength+1)));

window=hamming(winlength);

for frame_count = 1:num_frames

---------------------------------------------------------

% (1) Get the Frames for the test and reference

speech.

% Multiply by Hanning Window.

---------------------------------------------------------

clean_frame = clean_speech(start:start+winlength-1);

processed_frame =

processed_speech(start:start+winlength-1);
41
% clean_frame = clean_frame.*window';

% processed_frame = processed_frame.*window;

---------------------------------------------------------

% (2) Compute the Power Spectrum of Clean and

Processed

---------------------------------------------------------

if (USE_FFT_SPECTRUM)

clean_spec = (abs(fft(clean_frame,n_fft)).^2);

processed_spec =

(abs(fft(processed_frame,n_fft)).^2);

else

a_vec = zeros(1,n_fft);

a_vec(1:11) = lpc(clean_frame,10);

clean_spec = 1.0/(abs(fft(a_vec,n_fft)).^2)';

a_vec = zeros(1,n_fft);

a_vec(1:11) = lpc(processed_frame,10);

processed_spec = 1.0/(abs(fft(a_vec,n_fft)).^2)';

end
42
%

---------------------------------------------------------

% (3) Compute Filterbank Output Energies (in dB scale)

---------------------------------------------------------

for i = 1:num_crit

clean_energy(i) = sum(clean_spec(1:n_fftby2) ...

.*crit_filter(i,:)');

processed_energy(i) =

sum(processed_spec(1:n_fftby2) ...

.*crit_filter(i,:)');

end

clean_energy = 10*log10(max(clean_energy,1E-10));

processed_energy = 10*log10(max(processed_energy,1E-

10));

---------------------------------------------------------

% (4) Compute Spectral Slope (dB[i+1]-dB[i])

43
%

---------------------------------------------------------

clean_slope = clean_energy(2:num_crit) - ...

clean_energy(1:num_crit-1);

processed_slope = processed_energy(2:num_crit) - ...

processed_energy(1:num_crit-1);

---------------------------------------------------------

% (5) Find the nearest peak locations in the spectra

to

% each critical band. If the slope is negative,

we

% search to the left. If positive, we search to

the

% right.

---------------------------------------------------------

for i = 1:num_crit-1

44
% find the peaks in the clean speech signal

if (clean_slope(i)>0) % search to the right

n = i;

while ((n<num_crit) & (clean_slope(n) > 0))

n = n+1;

end

clean_loc_peak(i) = clean_energy(n-1);

else % search to the left

n = i;

while ((n>0) & (clean_slope(n) <= 0))

n = n-1;

end

clean_loc_peak(i) = clean_energy(n+1);

end

% find the peaks in the processed speech signal

if (processed_slope(i)>0) % search to the right

n = i;

while ((n<num_crit) & (processed_slope(n) > 0))

n = n+1;

end

processed_loc_peak(i) = processed_energy(n-1);

else % search to the left


45
n = i;

while ((n>0) & (processed_slope(n) <= 0))

n = n-1;

end

processed_loc_peak(i) = processed_energy(n+1);

end

end

---------------------------------------------------------

% (6) Compute the WSS Measure for this frame. This

% includes determination of the weighting

function.

---------------------------------------------------------

dBMax_clean = max(clean_energy);

dBMax_processed = max(processed_energy);

% The weights are calculated by averaging individual

% weighting factors from the clean and processed

frame.
46
% These weights W_clean and W_processed should range

% from 0 to 1 and place more emphasis on spectral

% peaks and less emphasis on slope differences in

spectral

% valleys. This procedure is described on page 1280

of

% Klatt's 1982 ICASSP paper.

Wmax_clean = Kmax ./ (Kmax + dBMax_clean - ...

clean_energy(1:num_crit-1));

Wlocmax_clean = Klocmax ./ ( Klocmax +

clean_loc_peak - ...

clean_energy(1:num_crit-1));

W_clean = Wmax_clean .* Wlocmax_clean;

Wmax_processed = Kmax ./ (Kmax + dBMax_processed -

...

processed_energy(1:num_crit-1));

Wlocmax_processed = Klocmax ./ ( Klocmax +

processed_loc_peak - ...

processed_energy(1:num_crit-1));

W_processed = Wmax_processed .*

Wlocmax_processed;

W = (W_clean + W_processed)./2.0;
47
distortion(frame_count) =

sum(W.*(clean_slope(1:num_crit-1) - ...

processed_slope(1:num_crit-1)).^2);

% this normalization is not part of Klatt's paper, but

helps

% to normalize the measure. Here we scale the measure

by the

% sum of the weights.

wssdistortion(frame_count) =

distortion(frame_count)/sum(W);

start = start + skiprate;

end

Fwss=mean(wssdistortion);

% Outputsnr=10* log10(

sum(processed_speech.^2)/sum((clean_speech-

processed_speech).^2));

% Overall_snr = 10* log10(

sum(clean_speech.^2)/sum((clean_speech-

processed_speech).^2));

48
%

---------------------------------------------------------

-------------

% Global Variables

---------------------------------------------------------

-------------

% samplin_rate=8000;

% winlength=fix(0.025*samplin_rate); %frame length

% % M=0.4*fs;

% M=fix(0.4*winlength);

% L=winlength-M-1;

% % L=numberOfFrames;

% skiprate = floor(length(signal)/L);

sample_rate = 8000; % default sample rate

winlength = 240; % window length in samples

skiprate = 60; % window skip in samples

winlength = round(30*sample_rate/1000); %240; %

window length in samples

skiprate = floor(winlength/4); % window skip

in samples

MIN_SNR = -10; % minimum SNR in dB

MAX_SNR = 35; % maximum SNR in dB

49
%

---------------------------------------------------------

-------------

% For each frame of input speech, calculate the Segmental

SNR

---------------------------------------------------------

-------------

num_frames = clean_length/skiprate-(winlength/skiprate);

% number of frames

start = 1; % starting sample

window = 0.5*(1 -

cos(2*pi*(1:winlength)'/(winlength+1)));

for frame_count = 1: num_frames

---------------------------------------------------------

% (1) Get the Frames for the test and reference

speech.

% Multiply by Hanning Window.

50
%

---------------------------------------------------------

clean_frame = clean_speech(start:start+winlength-1);

processed_frame =

processed_speech(start:start+winlength-1);

clean_frame = clean_frame.*window;

processed_frame = processed_frame.*window;

---------------------------------------------------------

% (2) Compute the Segmental SNR

---------------------------------------------------------

signal_energy = sum(clean_frame.^2);

noise_energy = sum((clean_frame-processed_frame).^2);

segmental_snr(frame_count) =

10*log10(signal_energy/(noise_energy+eps)+eps);

segmental_snr(frame_count) =

max(segmental_snr(frame_count),MIN_SNR);

51
segmental_snr(frame_count) =

min(segmental_snr(frame_count),MAX_SNR);

start = start + skiprate;

end

Segsnr=mean(segmental_snr);

% function distortion = llr(clean_speech,

processed_speech,sample_rate)

---------------------------------------------------------

-------------

% Check the length of the clean and processed speech.

Must be the same.

---------------------------------------------------------

-------------

% clean_length = length(clean_speech);

% processed_length = length(processed_speech);

% if (clean_length ~= processed_length)

52
% disp('Error: Both Speech Files must be same

length.');

% return

% end

---------------------------------------------------------

-------------

% Global Variables

---------------------------------------------------------

-------------

sample_rate = 8000; % default sample rate

% winlength = 240; % window length in samples

% skiprate = 60; % window skip in samples

P = 10; % LPC Analysis Order

winlength = round(0.025*sample_rate); % window length

in samples

skiprate = floor(winlength/4); % window skip

in samples

if sample_rate<10000

P =2; % LPC Analysis Order

else

53
P=16; % this could vary depending on sampling

frequency.

end

---------------------------------------------------------

-------------

% For each frame of input speech, calculate the Log

Likelihood Ratio

---------------------------------------------------------

-------------

num_frames = clean_length/skiprate-(winlength/skiprate);

% number of frames

start = 1; % starting sample

window = 0.5*(1 -

cos(2*pi*(1:winlength)'/(winlength+1)));

for frame_count = 1:num_frames

---------------------------------------------------------

54
% (1) Get the Frames for the test and reference

speech.

% Multiply by Hanning Window.

---------------------------------------------------------

clean_frame = clean_speech(start:start+winlength-1);

processed_frame =

processed_speech(start:start+winlength-1);

clean_frame = clean_frame.*window;

processed_frame = processed_frame.*window;

---------------------------------------------------------

% (2) Get the autocorrelation lags and LPC parameters

used

% to compute the LLR measure.

---------------------------------------------------------

[R_clean, Ref_clean, A_clean] = ...

lpcoeff(clean_frame, P);
55
[R_processed, Ref_processed, A_processed] = ...

lpcoeff(processed_frame, P);

---------------------------------------------------------

% (3) Compute the LLR measure

---------------------------------------------------------

numerator =

A_processed*toeplitz(R_clean)*A_processed';

denominator = A_clean*toeplitz(R_clean)*A_clean';

llrdistortion(frame_count) =

log(numerator/denominator);

start = start + skiprate;

end

Llr=mean(llrdistortion);

4 Noise Masking Program

56
function T=noisemaskingthreshold(a)

FS=8000;

% signal=wavread('sp01_airport_sn10'); %open wavefile and

store in data vector

a2=fft(a);

Sp=abs(a2).^2;

BLOCK=280;

BITS=16;

ZT= ceil (barkme2(FS/2)); %total of critical bands in

each frame

B=(10.^(spreadfn(1,8)/10)); %spreading function, over 8

critical bands

Pz = zeros(1,ZT); %total of points in each critical band

vlimit = zeros(2,ZT); %INF - SUP limits in each critical

band

vz = zeros(1,BLOCK/2); %mapping from index -> bark domain

vf = zeros(1,BLOCK/2); %mappint from index -> frequency

domain f

vlimit(1,1) = 1; %the first and last limits are fixed

vlimit(2,ZT) = BLOCK/2;

for ii=1:BLOCK/2

f=((ii-1) * FS) / BLOCK; %convert index -> frequency

57
z= barkme2(f); %freq -> z

if z==0

z=barkme2(0.5* FS/BLOCK); %approximation to don't have a

critical band 'cero'

end

vf(ii)=f; %frequency

vz(ii)=z; %bark

Pz(ceil(z))=Pz(ceil(z))+1; %points per critical band

%if there is a change in the units of z, change of

critical band

if ii>1 & floor(z)-floor(vz(ii-1)) > 0

vlimit(2,ceil(z)-1)=ii-1;

vlimit(1,ceil(z))=ii;

end

end

p=sin(2*pi*[1:BLOCK]*4000/FS)*(1/(2^BITS)); %definition

and scaling to be 1 bit in ampitude

TH=max(abs(fft(p)).^2);

% Sw=fft(signal);

%##power spectrum

%final threshold.. usind below vector as auxiliar

Sp=abs(Sw).^2;

%##Energy per critical band

Spz=zeros(1,ZT);

58
for ii=1:ZT,

Spz(ii)=sum(Sp([vlimit(1,ii):vlimit(2,ii)]));

end

%##Spreading across bands

Sm=conv(Spz,B);

temp=round(length(B)/2);

Sm=Sm([temp:ZT+temp-1]); %crop vector after convolution

%##Masking threshold estimate

%SFM Spectral flatness measure

Gm=prod(Spz)^(1/ZT); %Geometric mean

Am=sum(abs(Spz))*(1/ZT); %aritmetic mean

%SFM in dB

SFM=10*log10(Gm/Am);

SFMmax=-60; %maximum in dB

alpha=min(SFM/SFMmax,1); %alpha =1 -> tone-like; alpha =

0 -> noise-like

%masking energy offset

O=alpha*(14.5+[1:ZT]+0.5) + (1-alpha)*5.5;

%raw masking threshold

Traw=10.^(log10(Sm)-(O/10));

%normalization of threshold
59
Tnorm=Traw./Pz;

T=Tnorm; %copy

below=find(Tnorm<TH); %find components below

T(below)=TH; %replace them with the minimun value of TH.

T is the final threshold

MATLAB PROGRAMS FOR SPECTRUMS

1 SSB Berouti1Program

clear all

close all

IS=.25;

fs=8000;

[data,FS,BITS]=wavread('sp01_babble_sn0');

% figure

subplot(2,2,1),plot(data);

xlabel('Time(ms)'),ylabel('amplitude');

title('Original Signal in Time Domain ')

% figure

60
data=data(1:length(data)/1)';

subplot(2,2,2),specgram(data,[],8000);

xlabel('Time(ms)'),ylabel('Normalized Freq.(Hz)');

title('Frequency Spectrum of Original Audio') % Spectrum

of Original Audio

W=fix(.025*fs); %Window length is 25 ms

nfft=W;

SP=.4; %Shift percentage is 40% (10ms) %Overlap-Add

method works good with this value(.4)

wnd=hamming(W);

NIS=fix((IS*fs-W)/(SP*W) +1);%number of initial silence

segments

Gamma=2;%Magnitude Power (1 for magnitude spectral

subtraction 2 for power spectrum subtraction)

%Change Gamma to 1 to get a completely different

performance

y=segment(data,W,SP,wnd);

Y=fft(y,nfft);

YPhase=angle(Y(1:(end/2)+1,:)); %Noisy Speech Phase

Y=abs(Y(1:(end/2)+1,:)).^Gamma;%Specrogram

numberOfFrames=size(Y,2);

FreqResol=size(Y,1);

N=mean(Y(:,1:NIS)')'; %initial Noise Power Spectrum mean

61
NoiseCounter=0;

NoiseLength=9;%This is a smoothing factor for the noise

updating

Beta=.03;

minalpha=1;

maxalpha=3;

minSNR=-5;

maxSNR=20;

alphaSlope=(minalpha-maxalpha)/(maxSNR-minSNR);

alphaShift=maxalpha-alphaSlope*minSNR;

BN=Beta*N;

for i=1:numberOfFrames

[NoiseFlag, SpeechFlag, NoiseCounter,

Dist]=vad(Y(:,i).^(1/Gamma),N.^(1/Gamma),NoiseCounter);

%Magnitude Spectrum Distance VAD

if SpeechFlag==0

N=(NoiseLength*N+Y(:,i))/(NoiseLength+1); %Update

and smooth noise

BN=Beta*N;

end

SNR=10*log(Y(:,i)./N);

alpha=alphaSlope*SNR+alphaShift;

62
alpha=max(min(alpha,maxalpha),minalpha);

D=Y(:,i)-alpha.*N; %Nonlinear (Non-uniform) Power

Specrum Subtraction

X(:,i)=max(D,BN); %if BY>D X=BY else X=D which sets

very small values of subtraction result to an attenuated

%version of the input power

spectrum.

end

output=OverlapAdd2(X.^(1/Gamma),YPhase,W,SP*W);

subplot(2,2,3),plot(output) ;

xlabel('Time(ms)'),ylabel('amplitude');

title('Enhanced Time Domain Signal')

subplot(2,2,4),specgram(output,[],8000);

xlabel('Time(ms)'),ylabel('Normalized Freq.(Hz)');

title('Frequency Spectrum of enhanced signal') % Spectrum

of Original Audio

63
2 SSB Berouti2 Program

function output=SSBerouti791(signal,fs,IS)

% OUTPUT=SSBEROUTI79(S,FS,IS)

% Nonlinear Spectral Subtraction based on Berouti 79.

Power spectral

% subtraction with adjusting subtraction factor. the

adjustment is

% according to local a postriori SNR.

% S is the noisy signal, FS is the sampling frequency and

IS is the initial

% silence (noise only) length in seconds (default value

is .25 sec)

% Required functions:

% SEGMENT

% VAD

% Sep-04

% Esfandiar Zavarehei

if (nargin<3 | isstruct(IS))

IS=.25; %seconds
64
end

W=fix(.025*fs); %Window length is 25 ms

nfft=W;

SP=.4; %Shift percentage is 40% (10ms) %Overlap-Add

method works good with this value(.4)

wnd=hamming(W);

% IGNORE THIS SECTION FOR CAMPATIBALITY WITH ANOTHER

PROGRAM FROM HERE.....

if (nargin>=3 & isstruct(IS))%This option is for

compatibility with another programme

W=IS.windowsize

SP=IS.shiftsize/W;

nfft=IS.nfft;

wnd=IS.window;

if isfield(IS,'IS')

IS=IS.IS;

else

IS=.25;

end

end

% .......IGNORE THIS SECTION FOR CAMPATIBALITY WITH

ANOTHER PROGRAM T0 HERE

65
NIS=fix((IS*fs-W)/(SP*W) +1);%number of initial silence

segments

Gamma=2;%Magnitude Power (1 for magnitude spectral

subtraction 2 for power spectrum subtraction)

%Change Gamma to 1 to get a completely different

performance

y=segment(signal,W,SP,wnd);

Y=fft(y,nfft);

YPhase=angle(Y(1:fix(end/2)+1,:)); %Noisy Speech Phase

Y=abs(Y(1:fix(end/2)+1,:)).^Gamma;%Specrogram

numberOfFrames=size(Y,2);

FreqResol=size(Y,1);

N=mean(Y(:,1:NIS)')'; %initial Noise Power Spectrum mean

NoiseCounter=0;

NoiseLength=9;%This is a smoothing factor for the noise

updating

Beta=.03;

minalpha=1;

maxalpha=3;

minSNR=-5;

maxSNR=20;

alphaSlope=(minalpha-maxalpha)/(maxSNR-minSNR);
66
alphaShift=maxalpha-alphaSlope*minSNR;

BN=Beta*N;

for i=1:numberOfFrames

[NoiseFlag, SpeechFlag, NoiseCounter,

Dist]=vad(Y(:,i).^(1/Gamma),N.^(1/Gamma),NoiseCounter);

%Magnitude Spectrum Distance VAD

if SpeechFlag==0

N=(NoiseLength*N+Y(:,i))/(NoiseLength+1); %Update

and smooth noise

BN=Beta*N;

end

SNR=10*log(Y(:,i)./N);

alpha=alphaSlope*SNR+alphaShift;

alpha=max(min(alpha,maxalpha),minalpha);

D=Y(:,i)-alpha.*N; %Nonlinear (Non-uniform) Power

Specrum Subtraction

X(:,i)=max(D,BN); %if BY>D X=BY else X=D which sets

very small values of subtraction result to an attenuated

%version of the input power

spectrum.
67
end

output=OverlapAdd2(X.^(1/Gamma),YPhase,W,SP*W);

3 Wiener scalar Program

signal=wavread('sp01_airport_sn0');

fs=8000;

if (nargin<3 | isstruct(IS))

IS=.25; %Initial Silence or Noise Only part in

seconds

end

W=fix(.025*fs); %Window length is 25 ms

SP=.4; %Shift percentage is 40% (10ms) %Overlap-Add

method works good with this value(.4)

wnd=hamming(W);

%IGNORE FROM HERE ...............................

if (nargin>=3 & isstruct(IS))%This option is for

compatibility with another programme

68
W=IS.windowsize

SP=IS.shiftsize/W;

%nfft=IS.nfft;

wnd=IS.window;

if isfield(IS,'IS')

IS=IS.IS;

else

IS=.25;

end

end

% ......................................UP TO HERE

pre_emph=0;

signal=filter([1 -pre_emph],1,signal);

NIS=fix((IS*fs-W)/(SP*W) +1);%number of initial silence

segments

y=segment(signal,W,SP,wnd); % This function chops the

signal into frames

Y=fft(y);

YPhase=angle(Y(1:(end/2)+1,:)); %Noisy Speech Phase

Y=abs(Y(1:(end/2)+1,:));%Specrogram

numberOfFrames=size(Y,2);

FreqResol=size(Y,1);
69
N=mean(Y(:,1:NIS)')'; %initial Noise Power Spectrum mean

LambdaD=mean((Y(:,1:NIS)').^2)';%initial Noise Power

Spectrum variance

alpha=.99; %used in smoothing xi (For Deciesion Directed

method for estimation of A Priori SNR)

NoiseCounter=0;

NoiseLength=9;%This is a smoothing factor for the noise

updating

G=ones(size(N));%Initial Gain used in calculation of the

new xi

Gamma=G;

X=zeros(size(Y)); % Initialize X (memory allocation)

h=waitbar(0,'Wait...');

for i=1:numberOfFrames

%%%%%%%%%%%%%%%%VAD and Noise Estimation START

if i<=NIS % If initial silence ignore VAD

SpeechFlag=0;

NoiseCounter=100;

else % Else Do VAD

70
[NoiseFlag, SpeechFlag, NoiseCounter,

Dist]=vad(Y(:,i),N,NoiseCounter); %Magnitude Spectrum

Distance VAD

end

if SpeechFlag==0 % If not Speech Update Noise

Parameters

N=(NoiseLength*N+Y(:,i))/(NoiseLength+1); %Update

and smooth noise mean

LambdaD=(NoiseLength*LambdaD+(Y(:,i).^2))./(1+NoiseLength

); %Update and smooth noise variance

end

%%%%%%%%%%%%%%%%%%%VAD and Noise Estimation END

gammaNew=(Y(:,i).^2)./LambdaD; %A postiriori SNR

xi=alpha*(G.^2).*Gamma+(1-alpha).*max(gammaNew-1,0);

%Decision Directed Method for A Priori SNR

Gamma=gammaNew;

G=(xi./(xi+1));

X(:,i)=G.*Y(:,i); %Obtain the new Cleaned value

71
waitbar(i/numberOfFrames,h,num2str(fix(100*i/numberOfFram

es)));

end

close(h);

output=OverlapAdd2(X,YPhase,W,SP*W); %Overlap-add

Synthesis of speech

output=filter(1,[1 -pre_emph],output); %Undo the effect

of Pre-emphasis

72

You might also like