0% found this document useful (0 votes)
169 views41 pages

Speech Compression Techniques - Formant and CELP Vocoders

Speech compression techniques can analyze speech into model parameters like formant frequencies and bandwidths or the output of a bank of bandpass filters. These parameters are transmitted instead of the full speech signal. At the receiver, the parameters are used to synthesize speech. Early techniques like the channel vocoder used bandpass filters and transmitted the energy output of each filter. The phase vocoder estimates and transmits the phase derivative at the filter outputs to preserve phase information. Formant vocoders estimate and transmit formant frequencies and bandwidths to model the vocal tract resonances.

Uploaded by

siddharth2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
169 views41 pages

Speech Compression Techniques - Formant and CELP Vocoders

Speech compression techniques can analyze speech into model parameters like formant frequencies and bandwidths or the output of a bank of bandpass filters. These parameters are transmitted instead of the full speech signal. At the receiver, the parameters are used to synthesize speech. Early techniques like the channel vocoder used bandpass filters and transmitted the energy output of each filter. The phase vocoder estimates and transmits the phase derivative at the filter outputs to preserve phase information. Formant vocoders estimate and transmit formant frequencies and bandwidths to model the vocal tract resonances.

Uploaded by

siddharth2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 41

Speech compression techniques –

Formant and CELP Vocoders


Introduction
• Earlier approach towards lossy compression is to model the source
output and send the model parameters to the source instead of the
estimates of the source output.
• The receiver tries to synthesize the source output based on the
received model parameters.
• Speech can be analyzed in terms of a model, and the model
parameters can be extracted and transmitted to the receiver.
• At the receiver the speech can be synthesized using the model.
• This analysis/synthesis approach was first employed by Homer
Dudley at Bell Laboratories, who developed what is known as the
channel vocoder.
• He developed a “speaking machine” in which the vocal tract was
modeled by a flexible tube whose shape could be modified by an
operator. Sound was produced by forcing air through this tube
using bellows.
Unlike speech, images are generated in a variety of different ways; therefore, the
analysis/ synthesis approach does not seem very useful for image or video
compression

2
speech production mechanism
speech is produced by forcing air
first through an elastic opening, the
vocal cords, and then through the
laryngeal, oral, nasal, and pharynx
passages, and finally through the
mouth and the nasal cavity.
 Everything past the vocal cords is
generally referred to as the vocal
tract.
The first action generates the
sound, which is then modulated into
speech as it traverses through the
vocal tract

3
Simplified model of speech synthesis
models the vocal tract

corresponds to the sound generation


At the transmitter, the speech is divided into segments. Each segment is
analyzed to determine an excitation signal and the parameters of the vocal
tract filter.
 In some of the schemes, a model for the excitation signal is transmitted
to the receiver.
The excitation signal is then synthesized at the receiver and used to drive
the vocal tract filter.
In other schemes, the excitation signal itself is obtained using an analysis-
by-synthesis approach. This signal is then used by the vocal tract filter to
generate the speech signal
4
Channel Vocoder
• each segment of input speech is analyzed using a bank of
band-pass filters called the analysis filters.
• The energy at the output of each filter is estimated at fixed
intervals and transmitted to the receiver.
• In a digital implementation, the energy estimate may be
the average squared value of the filter output.
• In analog implementations, this is the sampled output of an
envelope detector.
• Generally, an estimate is generated 50 times every second.
• Along with the estimate of the filter output, a decision is
made as to whether the speech in that segment is voiced,
as in the case of the sounds /a/ /e/ /o/, or unvoiced, as in
the case for the sounds /s/ /f/.

5
The sound /e/ in test (male voice saying the
word test)
Voiced sounds tend to have a
pseudoperiodic structure

The period of the


fundamental harmonic is
called the pitch period.
The transmitter also forms
an estimate of the pitch
period, which is transmitted
to the receiver.

6
The sound /s/ in test
Unvoiced sounds tend to have
a noiselike structure,
Eg., the /s/ sound in the word
test.

7
The Channel Vocoder (analyzer block diagram):
Bandpass Lowpass A/D
Rectifier
Filter Filter Converter

Bandpass Lowpass A/D


Rectifier

Encoder
Filter Filter Converter To
S(n)
Channel

Voicing
detector

Pitch
detector
8
The channel vocoder receiver

9
The Channel Vocoder (synthesizer):
• At the receiver, the vocal tract filter is implemented by a
bank of band-pass filters. The bank of filters at the
receiver, known as the synthesis filters, is identical to the
bank of analysis filters.
• Based on whether the speech segment was deemed to
be voiced or unvoiced, either a pseudonoise source or a
periodic pulse generator is used as the input to the
synthesis filter bank.
• The period of the pulse input is determined by the pitch
estimate obtained for the segment being synthesized at
the transmitter. The input is scaled by the energy
estimate at the output of the analysis filters

10
channel vocoder

• The channel vocoder employs a bank of bandpass


filters,
– Each having a bandwidth between 100 HZ and 300 HZ.
– Typically, 16-20 linear phase FIR filter are used.
• The output of each filter is rectified and lowpass
filtered.
– The bandwidth of the lowpass filter is selected to match
the time variations in the characteristics of the vocal tract.
• For measurement of the spectral magnitudes, a
voicing detector and a pitch estimator are included in
the speech analysis.

11
The Phase Vocoder

• The phase vocoder is similar to the channel


vocoder.

• However, instead of estimating the pitch, the


phase vocoder estimates the phase derivative
at the output of each filter.

• By coding and transmitting the phase


derivative, this vocoder destroys the phase
information .
12
The Phase Vocoder (analyzer block diagram):
cos k n Short-term
magnitude
ak n
Lowpass
cos  n sin  k n
Decimator
k

Filter
Differentiator Compute
Short-term

Encoder
S(n) To
Magnitude
And Channel
Phase
Differentiator Derivative
Lowpass
cos  n Decimator
Filter k

bk n
Short-term phase
sin k n derivative

13
The Phase Vocoder
(synthesizer block diagram, kth channel) :
Decimate
Short-term
amplitude

cos k n
Decoder

From
sin  n
Channel
Cos Interpolator
k

Integrator ∑
Decimate Interpolator
Sin
Short-term
Phase
derivative

14
The Phase Vocoder

• LPF bandwidth: 50 Hz
• Demodulation separation: 100 Hz
• Number of filters: 25 – 30
• Sampling rate of spectrum magnitude and phase derivative:
50 - 60 samples per second
• Spectral magnitude is coded using PCM or
• DPCM
• Phase derivative is coded linearly using 2 - 3 bits
• The resulting bit rate is 7200 bps

15
Formant Vocoder
• as the vocal tract is a tube of nonuniform cross section, it resonates
at a number of different frequencies. These frequencies are known
as formants.
• The formant values change with different sounds; however, we can
identify ranges in which they occur.
• For example, the first formant occurs in the range 200–800 Hz for a
male speaker, and in the range 250–1000 Hz for a female speaker.
• formant vocoders transmits an estimate of the formant values
(usually four formants are considered sufficient) and an estimate of
the bandwidth of each formant.
• At the receiver the excitation signal is passed through tunable
filters that are tuned to the formant frequency and bandwidth.

16
Formant Vocoder

• The formant vocoder can be viewed as a type


of channel vocoder that estimate the first
three or four formants in a segment of
speech.

• It is this information plus the pitch period that


is encoded and transmitted to the receiver.

17
Formant Vocoder

• The speech can be represented as the output


of a linear time-varying system whose
properties vary slowly with time.
• The digital model of speech production
represents voiced speech by pitch period,
amplitude and the lowest three formant
frequencies and unvoiced speech simply by
amplitude and a single zero and pole.
• All these parameters vary with time.
Formant Vocoder (analyzer block diagram):
F3
F3 B3

F2
F2 B2
Input
Speech F1
F1 B1

Pitch V/U
And
V/U
F0
Decoder

Fk :The frequency of the kth formant


Bk :The bandwidth of the kth formant
19
Formant Vocoder ( synthesizer block diagram) :

F3
F3
B3

F2
F2 ∑
B2

F1
F1
B1

V/U Excitation
F0 Signal

20
Digital model for speech production
Linear Predictive Coder
• vocal tract is modeled as a single linear filter
whose output yn is related to the input ξn by

the input to the vocal tract filter is either the output of a


random noise generator or a periodic pulse generator.

22
23
Speech synthesis model
The input speech is generally sampled at 8000 samples
per second.
In the LPC-10 standard, the speech is broken into 180
sample segments, corresponding to 22.5 milliseconds of
speech per segment.

 the samples of the voiced speech have larger amplitude;


that is, there is more energy in the voiced speech.
 Also, the unvoiced speech contains higher frequencies.
 As both speech segments have average values close to
zero, this means that the unvoiced speech waveform crosses
the x = 0 line more often than the voiced speech sample.
Linear Predictive Coding -10
• In the LPC-10 algorithm, the speech segment is first low-pass
filtered using a filter with a bandwidth of 1 kHz.
• The energy at the output relative to the background noise is
used to obtain a tentative decision about whether the signal in
the segment should be declared voiced or unvoiced.
• The estimate of the background noise is basically the energy in
the unvoiced speech segments. This tentative decision is
further refined by counting the number of zero crossings and
checking the magnitude of the coefficients of the vocal tract
filter

25
Principle
• A speech sample can be approximated as a linear
combination of past speech samples.
• By minimizing the sum of the squared differences between
the actual speech samples and the linearly predicted ones, a
unique set of predictor coefficients can be determined.
• Linear prediction provides a robust, reliable and accurate
method for estimating the parameters that characterize the
linear, time-varying system.
Steps
1. Voiced / unvoiced decision (based on the
energy in the segment)
2. Pitch period estimation - (using
autocorrelation function)
3. Obtaining the vocal tract filter – linear filter
with coefficients in a minimum mean
squared error sense.
Autocorrelation function

• the autocorrelation of a periodic function Rxx (k) will


have a maximum when k is equal to the pitch
period.
• Coupled with the fact that the estimation of the
autocorrelation function generally leads to a
smoothing out of the noise, this makes the
autocorrelation function a useful tool for obtaining
the pitch period.

28
problems with the use of the
autocorrelation
• Voiced speech is not exactly periodic, which makes the
maximum lower than we would expect from a periodic signal.
• Generally, a maximum is detected by checking the
autocorrelation value against a threshold; if the value is
greater than the threshold, a maximum is declared to have
occurred.
• When there is uncertainty about the magnitude of the
maximum value, it is difficult to select a value for the
threshold.
• Another problem occurs because of the interference due to
other resonances in the vocal tract

29
Average magnitude difference function
(AMDF)

If a sequence yn is periodic with period P0, samples that are P0


apart in the yn sequence will have values close to each other, and
therefore the AMDF will have a minimum at P0.

30
AMDF function for the sound /e/ in test.

31
AMDF function for the sound /s/ in test.

32
Obtaining the Vocal T r a c t F i l t e r

• if yn are the speech samples in that particular segment, then


we want to choose ai to minimize the average value of e2n

33
Autocovariance approach

we assume that the yn sequence is zero outside the segment for


which we are calculating the filter parameters

34
Contd.,

• reflection coefficients, or partial correlation


(PARCOR) coefficients

35
Problem
• In order to get an effective reconstruction of the voiced
segment, the order of the vocal tract filter needs to be
sufficiently high.
• Generally, the order of the filter is 10 or more.
• Because the filter is an IIR filter, error in the coefficients can
lead to instability, especially for the high orders necessary in
linear predictive coding.
• As the filter coefficients are to be transmitted to the receiver,
they need to be quantized. This means that quantization error
is introduced into the value of the coefficients, and that can
lead to instability.

36
covariance method
• Discarding the assumption of stationary speech signals as in
Autocovariance method, the equations to obtain the filter
coefficients change.
• Defining E[yn−I yn−j ] as a function of both i and j,

37
Code excited linear prediction(CELP)

• In CELP, each trial waveform is synthesized by passing


it through a two part cascade synthesis filter.
• The first part, termed the pitch synthesis filter, inserts
pitch periodicities into the reconstructed speech.
• The second filter is the formant synthesis filter which
introduces a frequency shaping related to the formant
resonances produced by the human vocal tract.
• Both filters are all-pole structures, using an FIR filter in
a feedback configuration.
CELP analyzer
• The formant filter F(z)
removes sample to
sample correlation.
• Predictive filter P(z)
filter removes
periodicities due to the
pitch excited nature of
speech
CELP synthesizer
• P(z) - pitch synthesis
filter, inserts pitch
periodicities into the
reconstructed speech
• F(z) - the formant
synthesis filter which
introduces a frequency
shaping related to the
formant resonances
produced by the human
vocal tract.
CELP
• In CELP, the excitation waveform i ( n ) is chosen from a dictionary
of waveforms.
• Conceptually, each waveform in the dictionary is passed through
the synthesis filters to determine which waveform "best” matches
the input speech.
• The optimality criterion is based on the same type of frequency
weighted mean-square error criterion used in multi-pulse coding.
• The index of the “best" waveform used is transmitted to the
decoder. In addition, both the formant and pitch filters are
transmitted periodically.
• The parameters of these filters are sent to the decoder as side
information to allow it to form the appropriate synthesis filters.

You might also like