Fundamentals of Vector Quantization
Fundamentals of Vector Quantization
Vector Quantization
Robert M. Gray
Information Systems Laboratory
Department of Electrical Engineering
http://www-isl.stanford.edu/~gray/compression.html
Part I:
codes, rate, and distortion
Input
Signal
Encoder
Channel
Decoder
Reconstructed
Signal
Figure 1
Assumptions:
Signal is discrete time or space (e.g., already
sampled).
Do not separate out signal decompositions,
i.e., assume either done already or to be done
as part of the code.
Consider a code structure that maps blocks or
vectors of input data into possibly variable
length binary strings.
Later consider recursive code structures
Discrete-time random process input signal: {Xn}
Xn A <k ; distribution PX
X k = (X0, X1, . . . , Xk1), Xn <
Usually assume some form of stationarity (strict,
asymptotic, etc.)
If not stationary: handle with universal or
adaptive codes.
4
Encoder
An encoder (or source encoder) is a mapping
of the input vectors into a collection W of finite
length binary sequences.
: A W {0, 1}.
W = channel codebook , members are channel
codewords.
set of binary sequences that will be stored or
transmitted
Assume that W satisfies the prefix condition so
that it is uniquely decodable.
Given an i {0, 1}, define
l(i) = length of binary vector i
instantaneous rate r(i) = l(i)
k bits/input
symbol.
Average rate R(, W) = E[r((X))].
AWC
or, equivalently,
ENCODER
-
in = (Xn)
Xn
DECODER
in
n = (in)
X
General block memoryless source code
Later consider codes with memory, but general
block might operate in local nonmemoryless
fashion.
invertible or noiseless or lossless if ((x)) = x
lossy if it is not lossless.
Require a measure of In this case a measure of
distortion d to quantify how lossy.
= (X X)
BX (X X),
d(X, X)
BX positive definite.
most common BX = I,
= ||X X||
2 (MSE)
d(X, X)
Other measures: Bx = x2 I
Performance of a compression system measured
by expected values of the distortion and rate.
D(, W, ) = D(, ) = E[d(X, ((X))]
1
R(, W, ) = R(, W) = E(r(X)) = E(l((X)))
k
10
f
LL
L
L
L
v
L
L
v
L v
L
L
L
L
L
v
L
v
Lv
v
@
v
@
@
v
v
v
@
v
v
@
v
v
@v
PP
PP
v
v
PP
PP
Pv
hhhh
hhh v
hhv
-
.5
12
R(, W).
R(D)
=
inf
,W,:D(,)D
D(, ).
D(R)
=
inf
,W,:R(,W)R
13
,W,
14
Easy: D(R)
and R(D)
are monontonically
nonincreasing in their arguments.
Note: usually wish to optimize over constrained
subset of computationally reasonable codes, or
implementable codes.
Examples: Fixed rate codes, tree-structured
codes, product codes
Introduce structured codes later, but mention
fixed rate codes now.
15
16
17
(X) = ((X))
A
I{z } W | {zI } C.
|
R(,
, )
) = k1 E[l(((X))],
(x))
distortion: d(x, (
= d(x, ((x)))
average distortion:
= E[d(X, (
(X))],
D(, W, ) = D(,
)
18
= i}; i I.
Si = {x : (x)
Partitions input space into disjoint, exhaustive
collection of cells
Define
1 if x F
1F (x) =
0 otherwise.
Then
(x) =
iI
iI
i1Si (x)
and
(x))
((x)) = (
19
iI
S (x)
(i)1
i
Optimalility Properties
and Quantizer Design
Extreme Points
0: all of the emphasis on distortion
force 0 average distortion
Minimize rate only as an afterthought.
(0, R) 0 distortion, corresponds to lossless codes
All emphasis on rate, force 0 average rate
minimize distortion as an afterthought.
(D, 0): 0 rate, no bits communicated.
21
Then D(0)
= inf y E[||X y||2] achieved by
e.g.,
Input-weighted squared error
= (X X)
BX (X X)
d(X, X)
centroid is E[BX ]1E[BX X].
22
Empirical Distributions
Training set or learning set
L = {xi; i = 0, 1, . . . , L 1}
Empirical distribution is
1 L1
#i : xi F
X
=
1F (xi)
PL(F ) =
L
L i=0
Find the vector y that minimizes
1 L1
X
2
E[(X y) ] =
||X y||2.
L n=0
Answer = expectation
1 L1
X
xi ,
y=
L n=0
the Euclidean center of gravity of the collection
of vectors in L.
the sample mean or empirical mean.
23
24
= i]; i I
(i)
yA
= D(,
)
) =
E[(X, (
X
E[(X, (i))|
(X)
= i]PX (Si)+R(,
)
iI
X
iI
PX (Si) min
E[(X, y)|(X)
= i]+R(,
)
y
= i].
If empirical distribution conditional sample
average
For more general input-weighted squared error:
= i]1E[BX X|(X)
= i].
xi = E[BX |(X)
25
= D(,
)
)
E[(X, (
+ E[l((W ))]
= D(,
)
+ min E[l(((X)))]
= D(,
)
where W = (X)
l(((X)))
log2 Pr((X))
Replace average length by the entropy H((X))
entropy-constrained VQ.
26
+ l((i))].
(x)
= min1[d(x, (i))
iI
Proof:
Z
E[(X, ((X)))] =
dPX (x)[d(x, ((x))) + l((x))]
Z
dPX (x) min[d(x, i) + l(i)]
iI
(x)
= min1d(x, (i)).
iI
27
28
Variations
Problem with general codes satisfying Lloyd
conditions is complexity:
NN encoder requires computation of distortion
between input and output(s) for all codewords.
Exponentially growing computations and
memory.
Partial solutions:
Allow suboptimal, but good, components.
Decrease in performance may be compensated by
decrease in complexity or memory. May actually
yield a better code in practice.
E.g., by simplifying search and decreasing
storage, might be able to implement an otherwise
unimplemental dimension for a fixed rate R.
29
30
(n
1)
+
n i
n
1
i(n 1) + n (x(n) i(n 1)).
Alternative form:
define a(n) = 1/n
k-means update rule is
i(n) = i(n 1) + a(n)(x(n) i(n 1))
if d(x(n), i(n 1)) d(x(n), l (n 1)), all l.
This is idea behind Kohonens self organizing
feature map (SOFM) applied to clustering.
32
33
1 d(x,y )
i
e T
P
1 d(x,y )
l
e T
34
35
36
Suboptimal Encoders
Several NN search speedups and shortcuts
reported in quantization and pattern recognition
literature. Most ad hoc and help to some degree.
Tree-structured codes
Every channel codebook is assumed to be a prefix
code and hence can be depicted as a binary tree.
Binary prefix-free codes can be depicted as a
binary tree:
37
1
}
@
@
label
1
@
R
@
1
0
}
}
X
XXXX
X}
}
HH
HH
HH}
1111
1110
110
terminal node
@
@
@
@}
10 codeword
root node
}
A
A
A
branch
A
A
A
A
A
011
}
}
HH
HH
}
6
HH}
XXX
XXX}
1
1
0
A
0
A
A}
siblings
@
}
@
1
}
@
X
XXXX
@
X}
?
1
0 @
0
@
}
HH
HH
}
1
HH
}
0 * XXXXX
parent
X}
0
child
0101
0100
0011
0010
0001
0000
39
Can either
1. Begin with codebook and design a tree search
for the codebook.
2. Modify Lloyd to design the tree structured
codebook from scratch.
One way to design TSVQ from scratch:
Step 0 Begin with optimal 0 rate tree.
Step 1 Split node to form rate 1 bit per
vector tree.
Step 2 Run Lloyd.
Step 3 If desired rate achieved, stop. Else either
Split all terminal nodes (balanced tree), or
Split worst terminal node.
Step 4 Run Lloyd.
Step 5 Go to Step 3.
40
v
-
3
Codeword 0
Codeword 1
v
f
f
v
I
@
@
Only training
vectors mapped
Aff
A into codeword 0
A
are used
AU
v
v
AK
A
42
0
HH
0
1
H
HH
j
J
^
J
J
J
^
J
H 1
HH
j
H
1
J
^
J
J 1
J
^
J
J 1
J
^
J
44
Growing Trees
Balanced Tree Growing
Split all nodes in a level, cluster labels for child
nodes using distribution conditioned on parent.
Problem: Can get empty nodes.
Unbalanced Tree Growing
Split one node at a time.
Split worst node, one with largest conditional
distortion or partial distortion. (McKhoul,
Roucos, Gish)
Split in a greedy tradeoff fashion: maximize
change in distortion if split node t
|.
(t) = |
change in rate if split node t
45
0HHHH1H
j
H
0HHHH1H
j
H
JJJ 1
J
^
J
46
Pruning trees
Prune trees by finding subtree T that minimizes
change in distortion if prune subtree T
(T ) = |
|.
change in rate if prune subtree T
These optimal subtrees are nested
(Generalized BFOS Algorithm, related to
CART algorithm
TM
0HHHH1H
j
H
JJJ 1
0HHHH1H
JJJ 1
J
^
J
JJJ 1
J
^
J
0HHHH1H
j
H
j
H
47
J
^
J
Average Distortion
6
f
LL
L
L
L
L
v
L
L
v
L
L
L
L
L
L
v
v
Lv
@
@
v
v
v
@v
PP
P
v
v
v
v
PP v
PPv
hhhh
hhh v
hhv
PP
.5
1.5
2.5
3.0
Average Rate
48
49
Advantages:
Parameters: scale and support region.
Fast nearest neighbor algorithms known.
Good theoretical approximations for
performance.
Approximately optimal if used with entropy
coding: For high rate uniform quantization
approximately minimizes average distortion
subject to an entropy constraint.
50
Classified VQ
Switched VQ: Separate codebook for each input
class (e.g., active, inactive, textured, background,
edge with orientation)
xn
?
VQ
Codebook Cin
un Codeword Index
in Codebook Index
in
Classifier
f
C1
- m
@
@
R
@
f
C2
Cm
51
Transform VQ
X
X1
1
X
X2
2
X
..
.
Xk
T1
VQ
T
..
.
..
.
-
..
.
-
52
k
X
Multistage VQ
(Multistep or Cascade or Residual VQ)
E2
- +
6
+
X
-
Q1
Q2
2 +
E
- +
-
X
+ 6
1
X
1
X
2-Stage Encoder
-I
1
-I
2
1
X
E2
-+
+6
Q1
-I
3
2
E
E3
-+
+6
E2
Q2
Q3
3
E
+ - E4
+6
E3
Multistage Encoder
I1
D1
I2
D2
1
X
?
2
E
-+
-
X
6
I3
D3
3
E
3-stage Decoder
53
Product Codes
Mean-Removed VQ
Shape Codebook
Cr =
j ; j = 1, , Nr }
{U
VQ
6
- +
m1
Sample
Mean
Scalar
Quantizer
6
Mean Codebook
Cm =
{m
i ; i = 1, , Nm }
54
Shape/Gain VQ
Shape Codebook
Cs =
j ; j = 1, , Ns }
{S
Maximize
j
Xt S
Minimize
2
[
g Xt S)]
Gain Codebook
Cg =
{
gi ; i = 1, , Ng }
j
S
?
(i, j)
ROM
gi
Shape/Gain Decoder
55
6
Hierarchical VQ
{z
}|
{z
}|
{z
}|
{z
8 bpp
2D VQ
{z
}|
{z
4 bpp
4D VQ
{z
2 bpp
8D VQ
1 bpp
56
Recursive VQ
Predictive VQ
Codebook
C=
{i ; i = 1, , 2R }
e
+ n
- +
6
Xn
VQ
n
e
in
?+
- +
+
n
X
n
X
Vector
Predictor
Encoder
n
e
in
V Q1
6
+
- +
6+
-X
Vector
Predictor
n
X
Codebook
C=
{ei ; i = 1, , 2R }
Decoder
57
Finite State VQ
Switched Vector Quantizer: Different codebook
for each state + Next State Rule
f
in
CK
-
..
.
f
C2
C1
next-state
function
Sn+1
Sn
Unit
Delay
xin CSn
Decoder
58
in
xn
CK
-
..
.
f
C2
C1
next-state
function
Sn+1
Sn
Unit
Delay
Encoder
59
un
Part IV:
Quantization and Source Coding
Theory
X = X k , k allowed to vary.
Can define optima for increasing dimensions:
k (D), and J,k
k (R), R
D
Quantities are subadditive & can define
asymptotic optimal performance
k (R) = lim D
k (R)
D(R) = inf D
k
k
k (D) = lim R
k (D)
R(D)
= inf R
k
k
(1)
(2)
(3)
D(R)
= D(R),
R(D)
= R(D)
i.e., operational DRF (RDF) = Shannon DRF
(RDF)
To define these Shannon quantities need some
definitions from information theory:
Average mutual information between two
discrete random variables X and Y is
I(X; Y ) = H(X) + H(Y ) H(X, Y ) =
Pr(X = x, Y = y)
Pr(X
=
x,
Y
=
y)
log
2
x,y
Pr(X = x) Pr(Y = y)
X
61
D + R, where D = D(R)
at point where is
the magnitude of the slope of the DRF.
Shannons distortion rate-theory is asymptotic:
fixed rate R and asymptotically large block size k
62
y
||
=
i
X
k i=1 Si
Average rate: If fixed rate code:
R(Q) = k 1 log N
If variable rate code:
1 N
X
PX (Si) log PX (Si)
R(Q) = Hk (X) =
k i=1
63
inf
D(Q)
inf
D(Q)
QQf :R(Q)R
QQv :R(Q)R
64
i:V (Si)<
X
i:V (Si)=
2f (x) dx+
||x
y
||
i
X
Si
2f (x) dx
||x
y
||
i
X
Si
65
Bennett assumptions:
N is very large
fX is smooth (so that Riemann sums
approach Riemann integrals and mean value
theorem of calculus applies)
The total overload distortion is negligible.
E.g., all the probability is on a bounded set.
The volumes V (Si) of all bounded cells are
tiny.
The reproduction codewords are the Lloyd
centroids of their cell.
66
Z
||x yi||2
1 N
X
dx
PX (Si) Si
D
k i=1
V (Si)
67
68
N
X
i=1
N
X
i=1
<k (x) dx = 1
Then
1
V (Si)
(yi)
69
(n)
1
R = log N
k
70
71
Optimizing m(x)
Gershos conjecture:
If fX (x) is smooth & R is large, the
minimum distortion quantizer has cells Si
that are (approximately) scaled, rotated, &
translated copies of S , the convex
polytope that tesselates Rk with minimum
normalized moments of inertia M (S), i.e.,
m(x) = Ck = min M (S) = M (S )
S
2 dx
||x||
Vk
S
=
Ck M ( sphere ) =
1+2/k
k+2
kV (S)
R
72
(t) = 0 xt1ex dx
Z
73
N
X
i=1
1
H h(X) E(log
),
N (X)
where h(X) is the differential entropy.
Thus for large N
1
).
H(Q(X)) h(X) E(log
N (X)
Connection between differential entropy and
entropy!
74
fX (x)
dx
2/k
(x)
(x) dx = 1
75
76
k
k+2
2
||fX || k
Si ||x yi|| fX (x) dx Ck N
k+2
for all i!
In fixed rate case, for asymptotically large N the
optimum quantizer has the property that the
cells contribute approximately equally to the
overall average distortion.
77
D(Q) Ck 2 k
,
79
Summary
m(X)
2/k
]N
D(Q) E[
(X)2/k
2R
k,f (R) Ck ||fX ||
2
D
k/(k+2)
2h(X)
80
1.533dB
Dk,v (R) C
Or, equivalently, for low distortion
1,v (D) R
k,v (D) 0.254 bits
R
famous quarter bit result.
Suggests at high rates there may be little to be
gained by using vector quantization (but still
need to use vectors to do entropy coding!)
Ziv (1985) showed that for all distortions
k,v (D) 0.754 bits
1,v (D) R
R
= Q(X + Z) Z
using a dithering argument: X
with Z uniform and independent of X.
(subtractive dither)
81
D(R)
k
k,f (R)
D
D(R)
k
82
Recent Extensions
Can generalize to input-dependent squared error
and to non-difference distortion measures that
behave locally in this way (have a well-behaved
Taylor series expansion)
= (X X)
BX (X X),
d(X, X)
Fixed rate:
k
k2 2
Ck N k
D DL(Qopt) =
k+2
Z
1
1
k+2
1+2/k
[f (x)(det(B(x))) k ]
dx} k
1 k
(p(x)(det(B(x))) k ) k+2
Variable rate:
D DL(Qopt)
k2
R
kCk
k2 (HQh(p) 12 log(det(B(x)))f (x)dx)
e
=
k+2
83
opt(x) =
1
(det(B(x))) 2
1
R
xG(det(B(x))) 2 dx
84
(4)
Final Comments
87