Speech Compression Techniques - Formant and CELP Vocoders
Speech Compression Techniques - Formant and CELP Vocoders
2
speech production mechanism
speech is produced by forcing air
first through an elastic opening, the
vocal cords, and then through the
laryngeal, oral, nasal, and pharynx
passages, and finally through the
mouth and the nasal cavity.
Everything past the vocal cords is
generally referred to as the vocal
tract.
The first action generates the
sound, which is then modulated into
speech as it traverses through the
vocal tract
3
Simplified model of speech synthesis
models the vocal tract
5
The sound /e/ in test (male voice saying the
word test)
Voiced sounds tend to have a
pseudoperiodic structure
6
The sound /s/ in test
Unvoiced sounds tend to have
a noiselike structure,
Eg., the /s/ sound in the word
test.
7
The Channel Vocoder (analyzer block diagram):
Bandpass Lowpass A/D
Rectifier
Filter Filter Converter
Encoder
Filter Filter Converter To
S(n)
Channel
Voicing
detector
Pitch
detector
8
The channel vocoder receiver
9
The Channel Vocoder (synthesizer):
• At the receiver, the vocal tract filter is implemented by a
bank of band-pass filters. The bank of filters at the
receiver, known as the synthesis filters, is identical to the
bank of analysis filters.
• Based on whether the speech segment was deemed to
be voiced or unvoiced, either a pseudonoise source or a
periodic pulse generator is used as the input to the
synthesis filter bank.
• The period of the pulse input is determined by the pitch
estimate obtained for the segment being synthesized at
the transmitter. The input is scaled by the energy
estimate at the output of the analysis filters
10
channel vocoder
11
The Phase Vocoder
Filter
Differentiator Compute
Short-term
Encoder
S(n) To
Magnitude
And Channel
Phase
Differentiator Derivative
Lowpass
cos n Decimator
Filter k
bk n
Short-term phase
sin k n derivative
13
The Phase Vocoder
(synthesizer block diagram, kth channel) :
Decimate
Short-term
amplitude
cos k n
Decoder
From
sin n
Channel
Cos Interpolator
k
Integrator ∑
Decimate Interpolator
Sin
Short-term
Phase
derivative
14
The Phase Vocoder
• LPF bandwidth: 50 Hz
• Demodulation separation: 100 Hz
• Number of filters: 25 – 30
• Sampling rate of spectrum magnitude and phase derivative:
50 - 60 samples per second
• Spectral magnitude is coded using PCM or
• DPCM
• Phase derivative is coded linearly using 2 - 3 bits
• The resulting bit rate is 7200 bps
15
Formant Vocoder
• as the vocal tract is a tube of nonuniform cross section, it resonates
at a number of different frequencies. These frequencies are known
as formants.
• The formant values change with different sounds; however, we can
identify ranges in which they occur.
• For example, the first formant occurs in the range 200–800 Hz for a
male speaker, and in the range 250–1000 Hz for a female speaker.
• formant vocoders transmits an estimate of the formant values
(usually four formants are considered sufficient) and an estimate of
the bandwidth of each formant.
• At the receiver the excitation signal is passed through tunable
filters that are tuned to the formant frequency and bandwidth.
16
Formant Vocoder
17
Formant Vocoder
F2
F2 B2
Input
Speech F1
F1 B1
Pitch V/U
And
V/U
F0
Decoder
F3
F3
B3
F2
F2 ∑
B2
F1
F1
B1
V/U Excitation
F0 Signal
20
Digital model for speech production
Linear Predictive Coder
• vocal tract is modeled as a single linear filter
whose output yn is related to the input ξn by
22
23
Speech synthesis model
The input speech is generally sampled at 8000 samples
per second.
In the LPC-10 standard, the speech is broken into 180
sample segments, corresponding to 22.5 milliseconds of
speech per segment.
25
Principle
• A speech sample can be approximated as a linear
combination of past speech samples.
• By minimizing the sum of the squared differences between
the actual speech samples and the linearly predicted ones, a
unique set of predictor coefficients can be determined.
• Linear prediction provides a robust, reliable and accurate
method for estimating the parameters that characterize the
linear, time-varying system.
Steps
1. Voiced / unvoiced decision (based on the
energy in the segment)
2. Pitch period estimation - (using
autocorrelation function)
3. Obtaining the vocal tract filter – linear filter
with coefficients in a minimum mean
squared error sense.
Autocorrelation function
28
problems with the use of the
autocorrelation
• Voiced speech is not exactly periodic, which makes the
maximum lower than we would expect from a periodic signal.
• Generally, a maximum is detected by checking the
autocorrelation value against a threshold; if the value is
greater than the threshold, a maximum is declared to have
occurred.
• When there is uncertainty about the magnitude of the
maximum value, it is difficult to select a value for the
threshold.
• Another problem occurs because of the interference due to
other resonances in the vocal tract
29
Average magnitude difference function
(AMDF)
30
AMDF function for the sound /e/ in test.
31
AMDF function for the sound /s/ in test.
32
Obtaining the Vocal T r a c t F i l t e r
33
Autocovariance approach
34
Contd.,
35
Problem
• In order to get an effective reconstruction of the voiced
segment, the order of the vocal tract filter needs to be
sufficiently high.
• Generally, the order of the filter is 10 or more.
• Because the filter is an IIR filter, error in the coefficients can
lead to instability, especially for the high orders necessary in
linear predictive coding.
• As the filter coefficients are to be transmitted to the receiver,
they need to be quantized. This means that quantization error
is introduced into the value of the coefficients, and that can
lead to instability.
36
covariance method
• Discarding the assumption of stationary speech signals as in
Autocovariance method, the equations to obtain the filter
coefficients change.
• Defining E[yn−I yn−j ] as a function of both i and j,
37
Code excited linear prediction(CELP)