CSPL 392
CSPL 392
=
= +
(1.1)
Each sample is represented as a linear combination of the previous L samples plus a white noise.
The weighting coefficients
1 2
, ,...,
L
a a a are called Linear Prediction Coefficients (LPCs).
We
now describe how CELP uses this model to encode speech.
The samples of the input speech are divided into blocks of N samples each, called frames. Each
frame is typically 10-20 ms long (this corresponds to 80 160 N = ). Each frame is divided into
smaller blocks, of k samples (equal to the dimension of the VQ) each, called subframes. For
each frame, we choose
1 2
, ,...,
L
a a a so that the spectrum of {
1 2
, ,...,
N
X X X }, generated using
the above model, closely matches the spectrum of the input speech frame. This is a standard
spectral estimation problem and the LPCs
1 2
, ,...,
L
a a a
can be computed using the Levinson-
Durbin algorithm.
2
Adaptive Codebook
Fig.1: Basic CELP scheme, minimize error by selecting best codebook entry.
Writing Eq.(1.1) in z-domain, we obtain
) (
1
) .... ( 1
1
) (
) (
2
2
1
1
z A z a z a z a z E
z X
L
L
=
+ + +
=
(1.2)
From Eqs.(1.1) and (1.2), we see that if we pass a white sequence [ ] e n through the filter
1/ ( ) A z , we can generate ( ) X z , a close reproduction of the input speech.
The block diagram of a CELP encoder is shown in Fig.1. There is a codebook of size M and
dimensionk , available to both the encoder and the decoder. The codevectors have components
that are all independently chosen from a (0,1) N distribution so that each codevector has an
approximately white spectrum. For each subframe of input speech ( k samples), the processing
is done as follows: Each of the codevectors is filtered through the two filters (labeled 1/ ( ) A z
and 1/ ( ) B z ) and the output y
k
is compared to the speech samples. The codevector whose output
best matches the input speech (least MSE) is chosen to represent the subframe.
The first of the filters,1/ ( ) A z , is described by Eq.(1.2). It shapes the white spectrum of the
codevector to resemble the spectrum of the input speech. Equivalently, in time-domain, the filter
incorporates short-term correlations (correlation with L previous samples) in the white sequence.
Besides the short-term correlations, it is known that regions of voiced speech exhibit long term
CODE BOOK
GAIN LPCs
PITCH
INPUT
SPEECH
y
n
_
+
error
1 1
A(z) B(z)
MSE
3
periodicity. This period, known as pitch, is introduced into the synthesized spectrum by the pitch
filter 1/ ( ) B z . The time domain behavior of this filter can be expressed as:
[ ] [ ] [ ] , y n x n y n P = + (1.3)
where [ ] x n is the input, [ ] y n is the output and P is the pitch.
The speech synthesized by the filtering is scaled by an appropriate gain to make the energy
equal to the energy of the input speech. To summarize, for every frame of speech ( N samples)
we compute the LPCs and pitch and update the filters. For every subframe of speech ( k samples),
the codevector that produces the best filtered output is chosen to represent the subframe.
The decoder receives the index of the chosen codevectors and the quantized value of gain for
each subframe. The LPCs and the pitch values also have to be quantized and sent every frame for
reconstructing the filters at the decoder. The speech signal is reconstructed at the decoder by
passing the chosen codevectors through the filters.
An interesting interpretation of the CELP encoder is that of a forward adaptive VQ. The filters
are updated every N samples and so we have a new set of codevectors y
k
every frame. Thus, the
dashed block in Fig.1 can be considered a forward adaptive codebook because it is designed
according to the current frame of speech.
In our design, we have used MSE as the criterion to choose the best codevector. However, it is
the perceptual quality of the synthesized speech that we should seek to optimize. Does a lower
MSE always guarantee better sounding speech? This is in general, but not always, true and so
many practical CELP coders use perceptually weighted MSE as the fidelity criterion. In our
design, we found MSE to be a reasonable fidelity criterion. In a later section, we examine the
correlation between MSE and the perceptual quality of the synthesized speech.
Rate
The rate of the CELP coder is determined by two factors:
1. Rate of the VQ =
2
log
VQ
M
R
k
=
2. Overhead bits needed to send the quantized values of gain for every subframe and the
LPC coefficients for each frame.
The rate of the coder in bits per second is given by
( #overhead bits/sample)*8000
VQ
R R = +
4
In this work, we consider only the rate of the VQ in all our experiments. In practical systems, the
number of bits is allocated for the VQ and the overhead values is approximately equal. So the rate
of the VQ is roughly half the rate of the coder.
2 Design and Performance Analysis of a CELP Coder
We designed a basic CELP coder in MATLAB with the following parameters:
- Frame size N=80
- L = 10 (10
th
order AR model)
- Fidelity criterion- MSE
- Dimension k=10 or 5
Our input speech sample for all experiments comprised of two sentences each of male and
female speech. The SNR of the reconstructed speech was determined at different rates for
dimensions k=5 and k=10. The SNR vs. rate curves obtained for 5 k = and 10 k = are shown in
Fig.2. We ran experiments with codebooks of size M = 32, 64, 128, 256, 512 and 1024. This
corresponds to a varying rate R
VQ
from 0.5 to 1 when 10 k = and from 1 to 2 when 5 k = .
The SNRs predicted by Zadors formula are also shown in the graph. The dotted line is the
Zador SNR assuming an AR-1 model for speech. This was computed by estimating the AR-1
parameter (correlation coefficient ) from the input speech signal. The line above the dotted line
represents the Zador SNR assuming a k
th
order AR model. The Zador factor
k
for this model is
given by
2
1
2
2
2
2
k
k
k
k
K
k
t
|
o
+
+ | |
|
\ .
=
The k autocorrelation values required for computing the determinant were estimated from the
input speech signal. While it may seem surprising that the SNR of the CELP coder is higher than
the Zador SNR, it should be noted that CELP is significantly different from a normal VQ (to
which we have applied Zadors formula). CELP has an adaptive codebook that changes every
frame according to the input speech. The SNR predicted by Zadors formula is applicable to a
fixed codebook VQ for a source that is characterized by the estimated covariance matrix K.
Thus, we cannot expect Zadors formula (in the form applied here) to give a reasonable estimate
of the SNR. An improved estimate could be obtained by calculating the Zador SNR for every
frame based on a covariance matrix that is estimated for that frame.
5
We also observe that for 10 k = (lower rates), the rate vs. SNR slope is clearly more than 6
dB/bit; for 5 k = (higher rates) the slope approaches 6dB/bit. This is consistent with our
expectation that SNR increases at approximately 6dB/bit at high rates.
(a)
(b)
Fig.2 SNR vs Rate characteristic of our CELP coder (a) k=5 (b)k=10
6
3 Complexity of the CELP encoder
Complexity is an important consideration while designing any VQ. The CELP algorithm gives
good output speech quality at low bit rates. However, this quality is obtained at the cost of very
high complexity. Since CELP coders are used for real-time applications such as the transmission
of speech over networks, long delays in encoding are undesirable. In this section, we analyze
complexity of the encoder and derive expressions for the number of computations and storage
required. We then discuss methods to reduce the complexity of the algorithm without significant
degradation in performance.
Computations for filtering the codebook each subframe
VQ (3M ops./sample)
Fig. 3: Computational complexity of CELP
To analyze complexity, it is convenient to look at CELP as a forward adaptive VQ. The following
analysis will serve as a baseline for comparison though the actual operations may be performed
differently pin real-time applications. From Fig.3, it is clear that the encoding process involves
two stages:
1. The filtering operation: Convolution of each codevector with the impulse response of the
filter. This has to be done every k samples of speech.
2. Choosing the best codevector from the filtered codebook. This part can be considered a
VQ and hence involves 3M ops/sample.
CODE BOOK
LPCs pitch
INPUT
SPEECH
y
n
_
+
error
1 s
n
1
A(z) B(z)
MSE
7
The number of operations in Step 1 can be derived as follows:
While filtering the codevectors in a subframe, we have to take care of the zero-input response
(ZIR) at the output of the filters. This is produced due to filtering that occurred in the previous
subframe. At the beginning of each subframe, we can subtract the ZIR from the speech. Note that
the subtraction needs to be done only once in a subframe, since the ZIR is the same for all
codevectors. We can then represent the convolution of each codevector with the impulse response
of the LPC filter by a matrix multiplication (i.e. as if we have zero initial state):
, s H d = (3.1)
where d is the codevector, s is the output of the LPC filter and H is the lower triangular matrix :
0
1 0
1 2 0
0 0 . . . 0
0 . . . 0
. .
.
. . . .
k k
a
a a
H
a a a
(
(
(
(
=
(
(
(
For each codevector (dimensionk ), the matrix multiplication requires ( 1) / 2 k k multiplications
and ( 1) / 2 k k additions.
The pitch filter equation can be written as
,
n n p
y s g y
= + (3.2)
where g is the pitch gain, p is the pitch period and
n
y is the filtered codevector. This requires 2k
operations per codevector. Thus the total number of operations per codevector is
( 1) ( 1)
/ 2 ( 1)
2 2
n
k k k k
op codevector k k k
= + + = +
Since we have M codevectors and k samples/codevector, the filtering in Step 1 will require
) 1 ( + k M operations/sample. The VQ part of the CELP coder will require 3M operations/sample.
Thus the overall computation for the encoding process is
( 4) . /
n
op M k ops sample = + (3.3)
We have to store the M codevectors; so we need storage corresponding toMk floating point
numbers.
To get an idea of the complexity involved, for 10 k = , encoding requires 14M ops./sample. Out
of this, 11M is due to the filtering part and while the VQ part requires 3M operations/sample.
Clearly, we would like to reduce the complexity of filtering without sacrificing performance. The
codebook that we use consists of codevectors with random numbers drawn from a (0,1) N
8
distribution. In the following subsection, we consider special types of codebooks[2], which
reduce the filtering complexity and analyze the performance vs. complexity tradeoffs obtained by
using them.
4 Special Codebooks for CELP
4.1 The Binary Codebook
Binary codebooks contain codevectors with only binary components. i.e. zeros and ones. Clearly,
the filtering operation does not require multiplications. The filtering of a codevector involves only
additions and that too, only when the component of the codevector is a one. The number of
ops./sample is computed using a method similar to the previous part.
For the LPC filter, the number of multiplications is zero while the number of additions is half
the number of additions in the previous case (there are 50% zeros and 50% ones in the binary
codebook). The long-term filter will still involve 2k operations per codevector. Thus the total
number of operations for the filtering part alone will require
( 1)
/ 2
2.2
b
k k
op codevector k
= +
Since we have M codevectors and k samples/codevector, the filtering will require
( 1)
2
4
k
M M