0% found this document useful (0 votes)
111 views

Fundamentals of Vector Quantization

This document provides an overview of fundamentals of vector quantization. It discusses key concepts such as: - The Shannon compression model of encoding, rate, and distortion. - Properties of optimal quantizers including achieving minimum distortion for a given rate, or minimum rate for a given distortion. - Design of structured vector quantizers using techniques like partitioning the input space into cells and assigning codewords. - Theoretical limits on achievable performance as determined by rate-distortion theory and source coding theory. Key concepts discussed include the rate-distortion function and operational rate-distortion function.

Uploaded by

monacer_ericson
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

Fundamentals of Vector Quantization

This document provides an overview of fundamentals of vector quantization. It discusses key concepts such as: - The Shannon compression model of encoding, rate, and distortion. - Properties of optimal quantizers including achieving minimum distortion for a given rate, or minimum rate for a given distortion. - Design of structured vector quantizers using techniques like partitioning the input space into cells and assigning codewords. - Theoretical limits on achievable performance as determined by rate-distortion theory and source coding theory. Key concepts discussed include the rate-distortion function and operational rate-distortion function.

Uploaded by

monacer_ericson
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Fundamentals of

Vector Quantization
Robert M. Gray
Information Systems Laboratory
Department of Electrical Engineering
http://www-isl.stanford.edu/~gray/compression.html

The Shannon compression model:


codes, rate, and distortion
Optimalility properties and quantizer design
Structured vector quantizers
Optimal achievable performance:
quantization and source coding theory

Part I:
codes, rate, and distortion

Input
Signal

Encoder

Channel

Decoder

Reconstructed
Signal

Figure 1

Classic Shannon model of point-to-point


communication system
General Goal: Given the signal and the
channel, find an encoder and decoder which give
the best possible reconstruction.
To formulate as precise problem, need
probabilistic descriptions of signal and channel
(parametric model or sample training data)
possible structural constraints on form of
codes
(block, sliding block, recursive)

quantifiable notion of what good or bad


reconstruction is (MSE, Pe)
Mathematical: Quantify what best or optimal
achievable performance is.
Practical: How do you build systems that are
good, if not optimal?
How measure performance?
SNR, MSE, Pe, bit rates, complexity, cost

Assumptions:
Signal is discrete time or space (e.g., already
sampled).
Do not separate out signal decompositions,
i.e., assume either done already or to be done
as part of the code.
Consider a code structure that maps blocks or
vectors of input data into possibly variable
length binary strings.
Later consider recursive code structures
Discrete-time random process input signal: {Xn}
Xn A <k ; distribution PX
X k = (X0, X1, . . . , Xk1), Xn <
Usually assume some form of stationarity (strict,
asymptotic, etc.)
If not stationary: handle with universal or
adaptive codes.
4

Encoder
An encoder (or source encoder) is a mapping
of the input vectors into a collection W of finite
length binary sequences.
: A W {0, 1}.
W = channel codebook , members are channel
codewords.
set of binary sequences that will be stored or
transmitted
Assume that W satisfies the prefix condition so
that it is uniquely decodable.
Given an i {0, 1}, define
l(i) = length of binary vector i
instantaneous rate r(i) = l(i)
k bits/input
symbol.
Average rate R(, W) = E[r((X))].

An encoder is said to be fixed length or fixed


rate if all channel codewords have the same
length, i.e., if l(i) = Rk for all i W.
Variable rate codes may require data
buffering, expensive and can overflow and
underflow
Harder to synchronize variable-rate codes.
Channel bit errors can have catastrophic
effects.
But variable rate codes can provide superior
rate/distortion tradeoffs.
E.g., in image compression can use more bits for
edges, fewer for flat areas. In voice compression,
more bits for plosives, fewer for vowels.

Define a source decoder : W A


usually A = A
Decoder is a table lookup.
Define the reproduction codebook
C {(i); i W}
members of C called reproduction codewords or
templates.
Convenient to reindex codebook using integers as
C {l ; l = 0, 1, . . . , M 1}
where M = ||W||=number of reproduction
codewords
A source code or compression code for the
source {Xn} consists of a triple (, W, ).

AWC
or, equivalently,

X i = (X) X = Q(x) = ((X)).

ENCODER
-

in = (Xn)

Xn
DECODER
in

n = (in)
X
General block memoryless source code
Later consider codes with memory, but general
block might operate in local nonmemoryless
fashion.
invertible or noiseless or lossless if ((x)) = x
lossy if it is not lossless.
Require a measure of In this case a measure of
distortion d to quantify how lossy.

Quality and Cost


Distortion
Distortion measure d(x, x) measures the
distortion or loss resulting if an original input x
is reproduced as x.
Mathematically: A distortion measure satisfies
d(x, x) 0
To be useful, d should be
easy to compute
tractable
meaningful for perception or application.
No single distortion measure accomplishes all of
these goals. Most common is MSE:
k1
X
2
|xl yl |2
d(x, y) = ||x y|| =
l=0

Weighted or transform/weighted versions are


used for perceptual coding. In particular:
Input-weighted squared error:

= (X X)
BX (X X),
d(X, X)
BX positive definite.
most common BX = I,
= ||X X||
2 (MSE)
d(X, X)
Other measures: Bx = x2 I
Performance of a compression system measured
by expected values of the distortion and rate.
D(, W, ) = D(, ) = E[d(X, ((X))]
1
R(, W, ) = R(, W) = E(r(X)) = E(l((X)))
k

10

? Every code yields point in distortion/rate


plane: (R, D).
Average Distortion
6

f
LL
L
L
L

v
L
L

v
L v
L
L
L
L
L
v
L
v
Lv
v
@
v
@
@
v
v
v
@
v
v
@
v
v
@v
PP
PP
v
v
PP
PP
Pv
hhhh
hhh v
hhv
-

.5

1.5 2 2.5 3.0


Average Rate
D(, ) and R(, W) measure costs, want to
minimize both Tradeoff
Given all else equal, one code is better than
another if it has lower rate or lower distortion.
11

Interested in undominated points in D-R plane:


For given rate (distortion) want smallest possible
distortion (rate) optimal codes
Optimization problem:
Given R, what is smallest possible D?
Given D, what is smallest possible R?
Lagrangian formulation: What is smallest
possible D + R?

12

Make more precise and carefully describe the


various optimization problems.
Rate-distortion approach: Constrain
R(, W) R. Then optimal code (, W, )
minimizes D(, ) over all allowed codes.
operational rate-distortion function

R(, W).
R(D)
=
inf
,W,:D(,)D

Distortion-rate approach: Constrain


D(, ) D. Then optimal code (, W, )
minimizes R(, W) over all allowed codes.
operational distortion-rate function

D(, ).
D(R)
=
inf
,W,:R(,W)R

13

Lagrangian approach: Fix Lagrangian


multiplier .
Optimal code (, W, ) minimizes
J(, W, ) = D(, ) + R(, W)
over all allowed codes.
operational Lagrangian distortion function
J = inf E(X, ((X))) =
,W,

inf [D(, ) + R(, W)].

,W,

First two problems are duals, all three are


equivalent. E.g., Lagrangian approach yields R-D
for some D or D-R for some R.

14

Lagrangian approach effectively uncostrained


minimization of modified distortion
J = E(X, (X)) where
(X, (X)) = d(X, ((X)) + l((X))
Extreme points:

As 0 distortion 0, rate R(0)


=??

As rate 0 distortion D(0)


=??

Easy: D(R)
and R(D)
are monontonically
nonincreasing in their arguments.
Note: usually wish to optimize over constrained
subset of computationally reasonable codes, or
implementable codes.
Examples: Fixed rate codes, tree-structured
codes, product codes
Introduce structured codes later, but mention
fixed rate codes now.

15

Fixed Rate Codes


Important special case: Fixed rate (length)
codes.
Require all words in W = range space of W to
have equal length.
Then r(X) constant and problem simplified.
Eases buffering requirments when use fixed-rate
transmission media
Eases effects of channel errors
In fixed rate case, minimizing modified distortion
equivalent to minimizing ordinary distortion.
Lagrangian = distortion-rate if fixed-rate

16

Useful to consider an alternative description of a


source code:
Idea: use a neutral index set: set of integers
I = {0, 1, . . . , M 1}
and refer everything to it.
Now:
Encoder
:AI
Decoder : I A
Index decoder : I {0, 1}

17

E.g., given (, W, ), index W = {i; i I}.


Then (i) i,
are
Representations (, W, ) and (,
, )
equivalent:
(X));
((X)) = (

(X) = ((X))

A
I{z } W | {zI } C.
|

Storage cost: instantaneous rate


1
1

r(x) = l((x)) = l(((x))


k
k
Average rate:
= R(,

R(,
, )
) = k1 E[l(((X))],
(x))
distortion: d(x, (

= d(x, ((x)))
average distortion:
= E[d(X, (
(X))],
D(, W, ) = D(,
)

18

Encoder partition S = {Si; i I}

= i}; i I.
Si = {x : (x)
Partitions input space into disjoint, exhaustive
collection of cells
Define

1 if x F
1F (x) =
0 otherwise.

Then
(x) =

iI

(i)1Si (x); (x)

iI

i1Si (x)

and
(x))
((x)) = (

19

iI

S (x)
(i)1
i

Optimalility Properties
and Quantizer Design
Extreme Points
0: all of the emphasis on distortion
force 0 average distortion
Minimize rate only as an afterthought.
(0, R) 0 distortion, corresponds to lossless codes

What is smallest R giving a lossless code? R(0)


Shannons noiseless coding theorem:
1
1
1

H(X) R(0) < H(X) +


k
k
k
Where H(X) = Px Pr(X = x) log2 Pr(X = x)
if X discrete, H(X) = supquantizers Q H(X)
otherwise.
Huffman code achieves minimum
Note: H = if X continuous.
20


All emphasis on rate, force 0 average rate
minimize distortion as an afterthought.
(D, 0): 0 rate, no bits communicated.

What is the best possible 0 rate code? D(0)


(Useless in practice, but provides key ideas for
general case in very simple setting.)
Given 0 rate, channel codeword is empty
sequence, length 0.
Only parameter: decoder

Optimal peformance: D(0)


= inf yA E[d(X, y)]
achieved by codebook with single word
minyA1E[d(X, y)]
Follows from simple inequalities

21

Example, A = A = <k , squared error distortion,


i.e.,
d(x, y) = ||x y||2 = (x y)t(x y)

Then D(0)
= inf y E[||X y||2] achieved by

miny 1E[||X y||2] = E(X)


Center of mass or centroid.

e.g.,
Input-weighted squared error

= (X X)
BX (X X)
d(X, X)
centroid is E[BX ]1E[BX X].

22

Empirical Distributions
Training set or learning set
L = {xi; i = 0, 1, . . . , L 1}
Empirical distribution is
1 L1
#i : xi F
X
=
1F (xi)
PL(F ) =
L
L i=0
Find the vector y that minimizes
1 L1
X
2
E[(X y) ] =
||X y||2.
L n=0
Answer = expectation
1 L1
X
xi ,
y=
L n=0
the Euclidean center of gravity of the collection
of vectors in L.
the sample mean or empirical mean.

23

Problem: Find optimal code for non-extreme .


Approach: find a code improvement algorithm.
or (, W, ) is a source
Suppose that (,
, )
code for a stationary source {X(n)} with
distribution PX .
Fact: If fix two of three components of code
then can describe optimal third
(,
, ),
component.
Yields necessary conditions for an optimal code
and a code improvement algorithm.
Will yield descent design algorithm for complete
code

24

Optimal Decoder for given encoder and


index decoder
Given
& the optimal decoder is
= min 1E[d(X, y)|(X)

= i]; i I
(i)
yA

Lloyd centroids Si w.r.t. PX .


Proof: Apply 0 rate result to conditional
distribution
(X)))]
+ R(,

= D(,
)
) =
E[(X, (
X

E[(X, (i))|
(X)

= i]PX (Si)+R(,
)

iI
X

iI

PX (Si) min
E[(X, y)|(X)

= i]+R(,
)
y

If squared error distortion conditional mean


E[X|(X)

= i].
If empirical distribution conditional sample
average
For more general input-weighted squared error:

= i]1E[BX X|(X)

= i].
xi = E[BX |(X)
25

Optimal Index Decoder for given


Encoder and Decoder
Given
& optimal is optimal lossless code
for (X),

e.g., a Huffman code.


Proof:
(X)))]
+ R(,

= D(,
)
)
E[(X, (
+ E[l((W ))]
= D(,
)
+ min E[l(((X)))]
= D(,
)

where W = (X)

& minimum is over all


prefix-free lossless codes
Can assume that a Huffman code can
approximately achieve the Shannon bound, i.e.,
that

l(((X)))

log2 Pr((X))
Replace average length by the entropy H((X))

entropy-constrained VQ.

26

Optimal Encoder given Decoder and


Index Decoder
Encoder is only component depending on all
other components.
Given & , the optimal encoder is the
minimum (modified) distortion encoder
(generalized nearest neighbor)

+ l((i))].
(x)

= min1[d(x, (i))
iI

Proof:
Z

E[(X, ((X)))] =
dPX (x)[d(x, ((x))) + l((x))]
Z
dPX (x) min[d(x, i) + l(i)]
iI

If code fixed rate, then l(i) = R all i and this


reduces to the usual minimum MSE rule
(Euclidean nearest neighbor)

(x)

= min1d(x, (i)).
iI

27

Optimality properties iterative design


algorithm
can improve it by applying
Given code (,
, ),
three properties:
optimize the encoder for the given decoders
optimize the decoder for the given encoder
optimize the index decoder for the given
encoder
Distortion is nonnegative and nondecreasing
descent algorithm.
In general converges only to a stationary point
no guarantee of global optimality

28

Variations
Problem with general codes satisfying Lloyd
conditions is complexity:
NN encoder requires computation of distortion
between input and output(s) for all codewords.
Exponentially growing computations and
memory.
Partial solutions:
Allow suboptimal, but good, components.
Decrease in performance may be compensated by
decrease in complexity or memory. May actually
yield a better code in practice.
E.g., by simplifying search and decreasing
storage, might be able to implement an otherwise
unimplemental dimension for a fixed rate R.

29

Suboptimal channel codebook Constrain


. use any good, if suboptimal, lossless code
(arithmetic, Lempel-Ziv)
Fixed rate codes.
Suboptimal encoder fast searches which
might not yield NN, e.g., greedy tree
structures or search only near to previous
selection
Constrained code structure Insist
codebooks have structure that ease searching
or lessen memory. E.g., lattice codes, product
codes (scalar, gain/shape, mean/shape),
transform codes (transform big vector then
use simple quantizers on output), and
tree-structured codes

30

Fixed Rate Channel Codebooks


Constrain all channel codewords to have same
length R bits.
in the Lagrangian formulation the codeword
lengths or instantaneous rate does not enter into
consideration and can simply use the
distortion-rate formulation, i.e., try to minimize
the average distortion for the given common rate
R of all codewords.
Here optimal encoder becomes simply Euclidean
nearest neighbor selection.
traditional vanilla VQ
For fixed rate codes, the algorithm is Lloyds
optimal PCM design algorithm extended to
vectors.
Forgeys algorithm, k-means, Isodata, principal
points
(classical clustering algorithms )
31

Other Clustering Algorithms


Most fixed-rate.
Pairwise nearest neighbor (PNN) (Ward,
Equitz)
Classical k-means: incremental update
variation of Lloyd.
k-means Codebook update:
recompute centroid for codeword i:
1 x(n) =
i(n) = n1

(n

1)
+
n i
n
1
i(n 1) + n (x(n) i(n 1)).
Alternative form:
define a(n) = 1/n
k-means update rule is
i(n) = i(n 1) + a(n)(x(n) i(n 1))
if d(x(n), i(n 1)) d(x(n), l (n 1)), all l.
This is idea behind Kohonens self organizing
feature map (SOFM) applied to clustering.
32

Neural net approaches: (all incremental)


back propagation
competitive learning
SOFM
simulated annealing, stochastic relaxation
Randomize and reduce randomization. E.g.,
put in random jumps out of local optima or
add white noise then reduce
(some proven global optima, but enormously
complex)
deterministic annealing: Rose, Gersho et al.
Randomize quantizer, replace by maximum
entropy distribution, then reduce entropy to
zero (to a deterministic quantizer)

33

Basic idea: Input x is mapped into a channel


codeword i randomly according to a
conditional probability mass function:
pi(x) = Pr( channel codeword = i|X = x)
=

1 d(x,y )

i
e T
P

1 d(x,y )

l
e T

Note this is not Lloyd optimal!


The centroid condition is then replaced by
P
n xnpi(xn)
(i) = P
n pi(xn)
For fixed T , iterate pmf/centroid
constructions to convergence.
Iterates as T 0 to hard quantizer.

34

Local vs. global optimality


No guarantee globally optimal. Conditions are
necessary not sufficient. No fully successful
approach to global optimization, but several
avenues:
Exhaustive search
Stochastic relaxation
Simulated Annealing
Deterministic Annealing
Multiple initial guesses
Most evidence anecdotal, very few theorems.

35

There are also non-clustering methods for


constructing fixed rate VQ codebooks.
Random selection or choose from training set.
Pruned random selection: only take words
when no existing words are good enough.
fractal codes: Take codebook from image and
using affine transformations of codewords to
get other codewords. Problem to find good
codebook.
As a VQ, decoding simple. Encoding hard.
Great for ferns and fjords.

36

Suboptimal Encoders
Several NN search speedups and shortcuts
reported in quantization and pattern recognition
literature. Most ad hoc and help to some degree.
Tree-structured codes
Every channel codebook is assumed to be a prefix
code and hence can be depicted as a binary tree.
Binary prefix-free codes can be depicted as a
binary tree:

37

1
}
@
 @

label





1

@
R
@

1
0

}


}

X
 XXXX

X}



}
HH
HH
HH}

1111
1110

110
terminal node

@
@



@ 
@}

10  codeword

root node




}
A
A
A

branch

A 

A
A
A
A

011

}





}
HH
HH
}

6
HH}


XXX
XXX}

1
1
0
A
0
A
A}
siblings
@
}
@

1



}
@
X
 XXXX
@

X}
?
1
0 @ 
0
@
}
HH
HH

}
1

HH

}
0 * XXXXX
parent
X}

0
child

0101
0100
0011
0010
0001
0000

Can view each channel codeword as pathmap


through the tree.
Can make code progressive and embedded by
producing reproductions with each bit.
Optimally done by centroid of node, conditional
expectation of input given pathmap to that node.
38

Suggests suboptimal greedy encoder: Instead


of finding best terminal node in tree (full-search),
advance through the tree one node at a time,
performing at each node a binary search
involving only two children nodes.
Both choices increase length by one pairwise
Euclidean NN.
Provides an approximate but fast search of the
reproduction codebook: tree-structured vector
quantization (TSVQ)

39

Can either
1. Begin with codebook and design a tree search
for the codebook.
2. Modify Lloyd to design the tree structured
codebook from scratch.
One way to design TSVQ from scratch:
Step 0 Begin with optimal 0 rate tree.
Step 1 Split node to form rate 1 bit per
vector tree.
Step 2 Run Lloyd.
Step 3 If desired rate achieved, stop. Else either
Split all terminal nodes (balanced tree), or
Split worst terminal node.
Step 4 Run Lloyd.
Step 5 Go to Step 3.

40

Effect is that once a vector encoded into a node,


it will only be tested against decendents of that
node.
6

v
-

(a) 0 bit resolution


6

3



Codeword 0

Codeword 1



v





f
f

(b) 1 bit resolution


41

v
I
@
@

Only training
vectors mapped
Aff
A into codeword 0
A
are used
AU
v

v
AK
A

Only training @ff@


vectors mapped @@R v
into codeword 1
are used
(c) 2 bit resolution

42

TSVQ Summary: balanced vs. unbalanced,


fixed-rate vs. variable rate
HH

0 




HH

0 

1
H

HH
j




J
^
J

J
J
^
J

H 1
HH
j
H

1
J
^
J

J 1
J
^
J

J 1
J
^
J

Sequence of binary decisions. (Each node


labeled, do pairwise NN.)
Search is linear in bit rate (not exponential).
Increased storage, possible performance loss
Code is successive approximation (progressive,
embedded)
Table lookup decoder (simple, cheap,
software)
Fixed or variable rate
43

BFOS TSVQ Design


Basic idea is taken from methods in statistics for
designing decision trees: (See, e.g.,
Classification and Regression Trees by
Breiman, Friedman, Olshen, & Stone)
first grow a tree,
then prune it.
trade off average distortion and average rate.
By first growing and then pruning back can get
optimal subtrees of good trees.
Can get benefits of lookahead without the full
computational load.

44

Growing Trees
Balanced Tree Growing
Split all nodes in a level, cluster labels for child
nodes using distribution conditioned on parent.
Problem: Can get empty nodes.
Unbalanced Tree Growing
Split one node at a time.
Split worst node, one with largest conditional
distortion or partial distortion. (McKhoul,
Roucos, Gish)
Split in a greedy tradeoff fashion: maximize
change in distortion if split node t
|.
(t) = |
change in rate if split node t

45

0HHHH1H




j
H

0HHHH1H




j
H

JJJ 1

J
^
J

46

Pruning trees
Prune trees by finding subtree T that minimizes
change in distortion if prune subtree T
(T ) = |
|.
change in rate if prune subtree T
These optimal subtrees are nested
(Generalized BFOS Algorithm, related to
CART algorithm
TM

0HHHH1H




j
H

JJJ 1

0HHHH1H




JJJ 1

J
^
J

JJJ 1

J
^
J

0HHHH1H




j
H

j
H

47

J
^
J

Average Distortion
6

f
LL
L

L
L
L

v
L
L

v
L
L
L
L
L
L

v
v

Lv
@
@

v
v
v

@v
PP
P

v
v

v
v
PP v
PPv
hhhh
hhh v
hhv

PP

.5

1.5

2.5

3.0

Average Rate

Points: distortion-rate pairs of all possible


subtrees of a large tree.

48

Part III: Structured VQ


Lattice VQ
Codebook is subset of a regular lattice.
A lattice L in <k is the set of all vectors of the
form Pni=1 miui, where the mi are integers and
the ui are linearly independent (usually
nondegenerate, i.e., n = k).
E.g., a uniform scalar quantizer (all bin widths
equal) is a lattice quantizer. A product of k
uniform quantizers is a lattice quantizer.
E.g., hexagonal lattice in 2D, E8 in 8D.
Multidimensional generalization of uniform
quantization. (Rectangular lattice corresponds to
scalar uniform quantization.)

49

Advantages:
Parameters: scale and support region.
Fast nearest neighbor algorithms known.
Good theoretical approximations for
performance.
Approximately optimal if used with entropy
coding: For high rate uniform quantization
approximately minimizes average distortion
subject to an entropy constraint.

50

Classified VQ
Switched VQ: Separate codebook for each input
class (e.g., active, inactive, textured, background,
edge with orientation)

xn
?

VQ
Codebook Cin

un Codeword Index

in Codebook Index

in

Classifier
f

C1

- m
@
@
R
@
f

C2

Cm

Requires on-line classification and side


information.

51

Transform VQ
X

X1

1
X

X2

2
X

..
.

Xk

T1

VQ

T
..
.

..
.
-

..
.
-

52

k
X

Multistage VQ
(Multistep or Cascade or Residual VQ)

E2
- +

6

+
X
-

Q1

Q2

2 +
E
- +
-
X

+ 6
1
X

1
X
2-Stage Encoder

-I
1

-I
2

1
X

E2

-+

+6

Q1

-I
3

2
E

E3

-+

+6
E2

Q2

Q3

3
E

+ - E4

+6
E3

Multistage Encoder

I1

D1

I2

D2

1
X

?
2 
E
-+
-
 X
6

I3

D3

3
E
3-stage Decoder

53

Product Codes
Mean-Removed VQ
Shape Codebook
Cr =
j ; j = 1, , Nr }
{U

VQ

6

- +


m1

Sample
Mean

Scalar
Quantizer
6

Mean Codebook
Cm =
{m
i ; i = 1, , Nm }

54

Shape/Gain VQ

Shape Codebook
Cs =
j ; j = 1, , Ns }
{S

Maximize
j
Xt S

Minimize
2
[
g Xt S)]

Gain Codebook
Cg =
{
gi ; i = 1, , Ng }
j
S
?


(i, j)

ROM
gi

Shape/Gain Decoder

55


6

Hierarchical VQ

{z

}|

{z

}|

{z

}|

{z

8 bpp

2D VQ

{z

}|

{z

4 bpp

4D VQ

{z

2 bpp

8D VQ

1 bpp

Each arrow is implemented as a table lookup


having 65536 possible input indexes and 256
possible output indexes. The table is populated
by an off-line minimum distortion search.

56

Recursive VQ
Predictive VQ
Codebook
C=
{i ; i = 1, , 2R }

e
+  n
- +

6

Xn

VQ

n
e

in

?+


- +
+ 

n
X

n
X

Vector 
Predictor
Encoder

n
e
in

V Q1
6

+ 
- +

6+

-X

Vector 
Predictor

n
X
Codebook
C=
{ei ; i = 1, , 2R }

Decoder

57

Finite State VQ
Switched Vector Quantizer: Different codebook
for each state + Next State Rule
f

in

CK
-

..
.



f

C2

C1

next-state
function

Sn+1

Sn

Unit
Delay

xin CSn

Decoder

58

in

xn

CK
-

..
.



f

C2

C1

next-state
function

Sn+1

Sn

Unit
Delay

Encoder

59

un

Part IV:
Quantization and Source Coding
Theory
X = X k , k allowed to vary.
Can define optima for increasing dimensions:
k (D), and J,k
k (R), R
D
Quantities are subadditive & can define
asymptotic optimal performance

k (R) = lim D
k (R)
D(R) = inf D
k
k

k (D) = lim R
k (D)
R(D)
= inf R
k
k

J = inf J,k = lim J,k .


k

(1)
(2)
(3)

Shannon coding theorems relate these


operationally optimal performances (impossible
to compute) to information theoretic
minimizations.
60

D(R)
= D(R),
R(D)
= R(D)
i.e., operational DRF (RDF) = Shannon DRF
(RDF)
To define these Shannon quantities need some
definitions from information theory:
Average mutual information between two
discrete random variables X and Y is
I(X; Y ) = H(X) + H(Y ) H(X, Y ) =
Pr(X = x, Y = y)
Pr(X
=
x,
Y
=
y)
log
2
x,y
Pr(X = x) Pr(Y = y)
X

Definition extends to continuous alphabets by


maximizing over all quantized versions:
I(X; Y ) = sup I(1(X); 2(Y ))
,
1

61

Shannon channel capacity: Channel described by


family of conditional probability distributions
PY k |X k , k = 1, 2, . . .
1
C = lim sup I(X k ; Y k )
k P k k
X

Shannon distortion-rate function: Source


described by family of source probability
distributions PX k , k = 1, 2, . . .
1
D(R) = lim
E[dk (X k , Y k )]
inf
k P k k :I(X k ;Y k )kR k
Y |X
Lagrangian a bit more complicated, equals

D + R, where D = D(R)
at point where is
the magnitude of the slope of the DRF.
Shannons distortion rate-theory is asymptotic:
fixed rate R and asymptotically large block size k

62

Another approach: Fixed k, asymptotically large


R.
X, fX (x), MSE
= {yi; i = 1, . . . , N },
VQ Q or (,
, ):C
S = {Si; i = 1, . . . , N }
Average distortion:
1
D(Q) = E[||X Q(X)||2]
k
1 N
X Z
2f (x) dx
||x

y
||
=
i
X
k i=1 Si
Average rate: If fixed rate code:
R(Q) = k 1 log N
If variable rate code:
1 N
X
PX (Si) log PX (Si)
R(Q) = Hk (X) =
k i=1

63

Operational distortion-rate functions:


k,f (R) =
D
k,v (R) =
D

inf

D(Q)

inf

D(Q)

QQf :R(Q)R
QQv :R(Q)R

where Qf and Qv are the sets of all fixed and


variable length quantizers respectively.
Since
1 N
X
Hk (X) =
PX (Si) log PX (Si) k 1 log N,
k i=1
must have
k,f (R) D
k,v (R)
D
I.e., since collection of codes satisfying
variable-rate constraint is bigger and hence
infimum is smaller.

64

Shannon described the asymptotic behavior of


operational distortion rate functions if R fixed
and k .
High rate quantization theory considers case k
fixed and R
Bennett(1948), Zador (1963), Gersho (1979),
Neuhoff and Na (1995)
Define volume V (Si) of cell Si:
Z
V (Si) = Si dx
If finite, say in granular region. Else in overload
region.
kD(Q) =

i:V (Si)<
X

i:V (Si)=

2f (x) dx+
||x

y
||
i
X
Si

2f (x) dx
||x

y
||
i
X
Si

Granular distortion and overload distortion

65

Bennett assumptions:
N is very large
fX is smooth (so that Riemann sums
approach Riemann integrals and mean value
theorem of calculus applies)
The total overload distortion is negligible.
E.g., all the probability is on a bounded set.
The volumes V (Si) of all bounded cells are
tiny.
The reproduction codewords are the Lloyd
centroids of their cell.

66

Since fX is smooth, the cells are small, and


y i Si ,
fX (x) fX (yi); x Si
From the mean value theorem of calculus
Z

PX (Si) = Si fX (x) dx V (Si)fX (yi)


hence
PX (Si)
.
fX (yi)
V (Si)

Z
||x yi||2
1 N
X
dx
PX (Si) Si
D
k i=1
V (Si)

67

For each i, yi is the centroid of Si and hence


Z
||x yi||2
dx
Si
V (Si)
= minimum MSE for a 0 bit code for a uniformly
distributed random variable on Si = moment of
inertia of the region Si about its centroid
Convenient to use normalized moments of
inertia so that they are invariant to scale.
Define
Z
1
||x y(S)||2
dx
M (S) =
S
2/k
V (S)
kV (S)
where y(S) denotes the centroid of S.

68

Then if c > 0 and cS = {cx : s S}, then


M (S) = M (cS)
M depends only on shape and not upon scale.
Now:
D
=

N
X
i=1
N
X
i=1

PX (Si)M (Si)V (Si)2/k


fX (yi)M (Si)V (Si)1+2/k

Assume as N , reproduction vectors


C = {yi; i = 1, . . . , N } have a smooth point
density (x) in the sense that
1
( # reproduction vectors in a set S)
N
Z
S (x) dx; all S.
Z

<k (x) dx = 1

Then

1
V (Si)
(yi)
69

(n)

Bi (x) as the cell of the codebook Cn which


contains x. Assume that as N and
(n)
Bi (x) x, then
(n)

M (Bi (x)) m(x),


the inertial profile of the sequence of codebooks.
(assumed smooth)
For large N
m(X) 2R
]2
,
D E[
2/k
(X)
where

1
R = log N
k

70

This is Bennetts integral or the Bennett


approximation.
Bennett: m = 1/12, k = 1
Gersho: k 1
Na & Neuhoff: general m(x)
Can use to evaluate scalar and vector quantizers,
transform coders, tree-structured quantizers.
Can provide bounds.
Used to derive a variety of approximations in
compression, e.g., bit allocation and optimal
transform codes.

71

Optimizing m(x)
Gershos conjecture:
If fX (x) is smooth & R is large, the
minimum distortion quantizer has cells Si
that are (approximately) scaled, rotated, &
translated copies of S , the convex
polytope that tesselates Rk with minimum
normalized moments of inertia M (S), i.e.,
m(x) = Ck = min M (S) = M (S )
S

Optimal Ck known only for k = 1 (interval


1 = 0.08333 . . .)
C1 = 12
and k = 2 (regular hexagon,
C2 = 5 = 0.08019 . . .)
36 3
Many guesses and bounds exist for higher k.
(Conway and Sloane)
Sphere bound:
2/k

2 dx
||x||
Vk
S
=
Ck M ( sphere ) =
1+2/k
k+2
kV (S)
R

72

where Vk is the volume of a sphere in k


dimensions with unit radius:
2 k/2
k/2
=
Vk = k
( 2 + 1) k( k2 )
where

(t) = 0 xt1ex dx
Z

73

High Rate Entropy Approximation


entropy of the quantized vector is given by
H = H(Q(X)) =

N
X
i=1

PX (Si) log PX (Si),

where PX (Si) = RSi fX (x) dx.


Again make approximation that
PX (Si) fX (yi)/(N (yi)) = fX (yi)V (Si)

1
H h(X) E(log
),
N (X)
where h(X) is the differential entropy.
Thus for large N
1
).
H(Q(X)) h(X) E(log
N (X)
Connection between differential entropy and
entropy!
74

Optimum Quantizer Point Densities


Assume that Gershos conjecture holds. (Or use
bound.)
(Otherwise can at least use same ideas to obtain
lower bounds to operational DRF and describe
point densities which result in lower bounds.)
fixed rate:
1
2R
]2
D Ck E[
(X)2/k
Find the optimum point density (x), i.e., find
which minimizes
E((X)2/k ) =
subject to

fX (x)
dx
2/k
(x)

(x) dx = 1

75

Holders inequality implies


E((X)2/k ) =

f (x)(x)2/k dx ||fX ||k/(k+2)

with equality if and only if (x) is proportional


to f (x)k/(k+2)
Thus using minimizing point density Gershos
conjecture for N large that
k,f Ck ||fX ||
22R
D
k/(k+2)

These provide an approximation and a bound to


the operational DFR for fixed rate codes.
Variable rate: constrain H(Q(X)) instead of
k 1 log N
First: derive an interesting property for the fixed
rate case.

76

Partial Distortion Property


Assume Gershos conjecture holds. Then
1
V (Si)(yi)
N

k
k+2
2
||fX || k
Si ||x yi|| fX (x) dx Ck N

k+2

for all i!
In fixed rate case, for asymptotically large N the
optimum quantizer has the property that the
cells contribute approximately equally to the
overall average distortion.

77

Variable rate codes


Recall that for large N
1
).
H(Q(X)) h(X) E(log
N (X)
Find optimum point density when entropy
constrained instead of N :
Jensens inequality
1
)1/k )
H h(X) kE(log(
N (X)
1
h(X) k log E((
)1/k ).
N (X)
2 (Rh(X))

D(Q) Ck 2 k
,

with equality iff (x) = constant, e.g.,


1
; xA
(x) =
V (A)
Thus under the Bennett conditions
2h(X)
2R
k,v (R) Ck 2
D
2 k
with equality iff (x) = constant.
78

If N very large, then lattice codes chosen so that


Voronoi cell has minimum normalized moment of
inertia are nearly optimal in the ECVQ problem.
Tradeoff: Lattice codes mean simple NN selection
and nearly optimal variable rate performance,
but need to entropy code to attain performance.
ECVQ design can do better, and sometimes
much better if rate not asymptotically big.
Note:
For the fixed rate case, cells should have
roughly equal partial distortion.
For the variable rate case, cells should have
roughly equal volume.
In neither case should you try to make cells have
equal probability (maximum entropy).

79

Summary
m(X)
2/k
]N
D(Q) E[
(X)2/k
2R
k,f (R) Ck ||fX ||
2
D
k/(k+2)
2h(X)

k,v (R) Ck 2 k 22R


D
Can use high rate quantization theory to
quantify the gains of vector quantization over
scalar quantization as a function of dimension k,
and to separate out the gains as those due
memory, to space filling (moment of inertia), and
to shape (of density function).

80

Gish & Pierce


Asymptotic theory implies that for iid sources,
high rates (low distortion)
1,v (R)
C1
D

1.533dB

Dk,v (R) C
Or, equivalently, for low distortion
1,v (D) R
k,v (D) 0.254 bits
R
famous quarter bit result.
Suggests at high rates there may be little to be
gained by using vector quantization (but still
need to use vectors to do entropy coding!)
Ziv (1985) showed that for all distortions
k,v (D) 0.754 bits
1,v (D) R
R
= Q(X + Z) Z
using a dithering argument: X
with Z uniform and independent of X.
(subtractive dither)
81

Comment: If fix R and let k , then


k,v (R)
D

D(R)
k

k,f (R)
D

D(R)
k

the Shannon DRF.


Bennett theory recently extended to
input-weighted squared error measures.

82

Recent Extensions
Can generalize to input-dependent squared error
and to non-difference distortion measures that
behave locally in this way (have a well-behaved
Taylor series expansion)

= (X X)
BX (X X),
d(X, X)
Fixed rate:
k
k2 2
Ck N k
D DL(Qopt) =
k+2
Z

1
1
k+2
1+2/k
[f (x)(det(B(x))) k ]
dx} k

Optimal point density


(x)

1 k
(p(x)(det(B(x))) k ) k+2

Variable rate:
D DL(Qopt)
k2
R
kCk
k2 (HQh(p) 12 log(det(B(x)))f (x)dx)
e
=

k+2

83

opt(x) =

1
(det(B(x))) 2
1
R
xG(det(B(x))) 2 dx

84

(4)

Final Comments

The Shannon compression model:


codes, rate, and distortion
Mathematical model of compression system.
Successfully applied to asymptotic
performance bounds and design algorithms.
Optimalility properties and quantizer
design
Lloyd optimal codes. For high rate, lattice
codes + lossless codes nearly optimal (variable
rate)
Rate-distortion ideas can be used to improve
standards-compliant and wavelet codes. (e.g.,
Ramchandran, Vetterli, Orchard)
Multiple distortion measures (include other
signal processing)
MPEG 4 style image structure?
85

Compression on networks, joint source and


channel coding
Structured vector quantizers
Transform/pyramid/subband/wavelet are
currently best (rate, distortion, complexity)
for image compression, especially embedded
zerotrees
Many variations and alternative structures
come and go, some stay.
Optimal achievable performance:
quantization and source coding theory
Very general Shannon theorems exist. Recent
flurry of work on universal and adaptive codes.
Multiple distortion measures, non-difference
(perceptual) distortion measures.
Current work on convergence and rate of
convergence (Vapnic-Cernovenkis theory), use
in classification and regression (Devroye,
Gyorfi, Gabor, Nobel, Olshen)
86

Open problems: Gershos conjecture, correct


combination of wavelet theory + Bennett
(Goldberg did for traditional bit allocation
approach, not yet done for embedded zerotree
codes and related), other improved
combinations of wavelet and compression
theory and design (Mallat, Orchard).
Another issue: Perceived quality of compressed
audio and images.
Different applications have different fundamental
quality requirements:
Entertainment, browsing, screening, diagnostic,
legal, scientific
Quantitative as predictors for subjective and
diagnostic.

87

You might also like