C. A.
Bouman: Digital Image Processing - April 17, 2013
Types of Coding
Source Coding - Code data to more efficiently represent
the information
Reduces size of data
Analog - Encode analog source data into a binary format
Digital - Reduce the size of digital source data
Channel Coding - Code data for transmition over a noisy
communication channel
Increases size of data
Digital - add redundancy to identify and correct errors
Analog - represent digital values by analog signals
Complete Information Theory was developed by Claude
Shannon
C. A. Bouman: Digital Image Processing - April 17, 2013
Digital Image Coding
Images from a 6 MPixel digital cammera are 18 MBytes
each
Input and output images are digital
Output image must be smaller (i.e. 500 kBytes)
This is a digital source coding problem
C. A. Bouman: Digital Image Processing - April 17, 2013
Two Types of Source (Image) Coding
Lossless coding (entropy coding)
Data can be decoded to form exactly the same bits
Used in zip
Can only achieve moderate compression (e.g. 2:1 3:1) for natural images
Can be important in certain applications such as medical imaging
Lossly source coding
Decompressed image is visually similar, but has been
changed
Used in JPEG and MPEG
Can achieve much greater compression (e.g. 20:1 40:1) for natural images
Uses entropy coding
C. A. Bouman: Digital Image Processing - April 17, 2013
Entropy
Let X be a random variables taking values in the set {0, , M
1} such that
pi = P {X = i}
Then we define the entropy of X as
H(X) =
M
1
X
pi log2 pi
i=0
= E [log2 pX ]
H(X) has units of bits
C. A. Bouman: Digital Image Processing - April 17, 2013
Conditional Entropy and Mutual Information
Let (X, Y ) be a random variables taking values in the set
{0, , M 1}2 such that
p(i, j) = P {X = i, Y = j}
p(i, j)
p(i|j) = PM 1
k=0
p(k, j)
Then we define the conditional entropy of X given Y as
H(X|Y ) =
1
M
1 M
X
X
p(i, j) log2 p(i|j)
i=0 j=0
= E [log2 p(X|Y )]
The mutual information between X and Y is given by
I(X; Y ) = H(X) H(X|Y )
The mutual information is the reduction in uncertainty of
X given Y .
C. A. Bouman: Digital Image Processing - April 17, 2013
Entropy (Lossless) Coding of a Sequence
Let Xn be an i.i.d. sequence of random variables taking
values in the set {0, , M 1} such that
P {Xn = m} = pm
Xn for each n is known as a symbol
How do we represent Xn with a minimum number of bits
per symbol?
C. A. Bouman: Digital Image Processing - April 17, 2013
A Code
Definition: A code is a mapping from the discrete set of
symbols {0, , M 1} to finite binary sequences
For each symbol, m their is a corresponding finite binary sequence m
|m| is the length of the binary sequence
Expected number of bits per symbol (bit rate)
n
= E [|Xn |]
=
M
1
X
m=0
|m| pm
Example for M = 4
m
0
1
2
3
m
|m|
01
2
10
2
0
1
100100 6
Encoded bit stream
(0, 2, 1, 3, 2) (01|0|10|100100|0)
C. A. Bouman: Digital Image Processing - April 17, 2013
Fixed versus Variable Length Codes
Fixed Length Code - |m| is constant for all m
Variable Length Code - |m| varies with m
Problem
Variable length codes may not be uniquely decodable
Example: Using code from previous page
(6) (100100)
(1, 0, 2, 2) (10|01|0|0)
Different symbol sequences can yield the same code
Definition: A code is Uniquely Decodable if there exists
only a single unique decoding of each coded sequence.
Definition: A Prefix Code is a specific type of uniquely
decodable code in which no code is a prefix of another
code.
C. A. Bouman: Digital Image Processing - April 17, 2013
Lower Bound on Bit Rate
Theorem: Let C be a uniquely decodable code for the
i.i.d. symbol sequence Xn. Then the mean code length is
greater than H(Xn).
n
= E [|Xn |]
=
M
1
X
m=0
|m| pm
H(Xn)
Question: Can we achieve this bound?
Answer: Yes! Constructive proof using Huffman codes
C. A. Bouman: Digital Image Processing - April 17, 2013
Huffman Codes
Variable length prefix code Uniquely decodable
Basic idea:
Low probability symbols Long codes
High probability symbols short codes
Basic algorithm:
Low probability symbols Long codes
High probability symbols short codes
10
C. A. Bouman: Digital Image Processing - April 17, 2013
11
Huffman Coding Algorithm
1. Initialize list of probabilities with the probability of each
symbol
2. Search list of probabilities for two smallest probabilities,
pk and pl.
3. Add two smallest probabilities to form a new probability,
pm = pk + pl.
4. Remove pk and pl from the list.
5. Add pm to the list.
6. Go to step 2 until the list only contains 1 entry
C. A. Bouman: Digital Image Processing - April 17, 2013
12
Recursive Merging for Huffman Code
Example for M = 8 code
p0
p1
p2
p3
p4
p5
p6
p7
0.4
0.08
0.08
0.2
0.12
0.07
0.04
0.01
0.4
0.08
0.08
0.2
0.12
0.07
0.4
0.08
0.08
0.2
0.12
0.12
0.12
0.12
0.4
0.16
0.2
0.4
0.16
0.2
0.4
0.24
0.36
0.4
0.24
0.6
1.0
0.05
C. A. Bouman: Digital Image Processing - April 17, 2013
13
Resulting Huffman Code
Binary codes given by path through tree
p0
p1
p2
p3
p4
p5
p6
0111
0110
010
001
0001
p7
00001 00000
1
0
root
C. A. Bouman: Digital Image Processing - April 17, 2013
Upper Bound on Bit Rate of Huffman Code
Theorem: For a Huffman code, n
has the property that
H(Xn) n
< H(Xn) + 1
A Huffman code is within 1 bit of optimal efficiency
Can we do better?
14
C. A. Bouman: Digital Image Processing - April 17, 2013
15
Coding in Blocks
We can code blocks of symbols to achieve a bit rate that
approaches the entropy of the source symbols.
,X
, Xm1}, X
| m, {z, X2m1},
| 0, {z
Y0
Y1
So we have that
Yn = Xnm, , X(n+1)m1
where
Yn {0, , M m 1}
C. A. Bouman: Digital Image Processing - April 17, 2013
16
Bit Rate Bounds for Coding in Blocks
It is easily shown that H(Yn) = mH(Xn) and the number
n
y is the
of bits per symbol Xn is given by n
x = my where n
number of bits per symbol for a Huffman code of Yn.
Then we have that
H(Yn) n
y < H(Yn) + 1
1
H(Yn)
m
1
1
H(Yn) +
m
m
n
y
m
<
n
y
m
< H(Xn) +
1
m
H(Xn) n
x < H(Xn) +
1
m
H(Xn)
As the block size grows, we have
1
m m
lim H(Xn) lim n
x H(Xn) + lim
H(Xn) lim n
x H(Xn)
m
So we see that for a Huffman code of blocks with length
m
lim n
x = H(Xn)
m
C. A. Bouman: Digital Image Processing - April 17, 2013
17
Comments on Entropy Coding
As the block size goes to infinity the bit rate approaches
the entropy of the source
x = H(Xn)
lim n
A Huffman coder can achieve this performance, but it requires a large block size.
As m becomes large M m becomes very large large
blocks are not practical.
This assumes that Xn are i.i.d., but a similar result holds
for stationary and ergodic sources.
Arithmetic coders can be used to achieve this bitrate in
practical situations.
C. A. Bouman: Digital Image Processing - April 17, 2013
18
Run Length Coding
In some cases, long runs of symbols may occur. In this
case, run length coding can be effective as a preprocessor
to an entropy coder.
Typical run length coder uses
, (value, # of repetitions), (value, # of repetitions+1),
where 2b is the maximum number of repetitions
Example: Let Xn {0, 1, 2}
| 0000000
111 | 222222
| {z } | |{z}
| {z } |
07
13
26
If more than 2b repetitions occur, then the repetition is broken into segments
111 |
00 | |{z}
| 00000000
| {z } | |{z}
08
02
Many other variations are possible.
13
C. A. Bouman: Digital Image Processing - April 17, 2013
19
Predictive Entropy Coder for Binary Images
Uses in transmission of Fax images (CCITT G4 standard)
Framework
Let Xs be a binary image on a rectangular lattice s =
(s1, s2) S
Let W be a causal window in raster order
Determine a model for p(xs|xs+r r W )
Algorithm
1. For each pixel in raster order
(a) Predict
s = 1 if p(1|Xs+r r W ) > p(0|Xs+r r W )
X
0 otherwise
s send 0; otherwise send 1
(b) If Xs = X
2. Run length code the result
3. Entropy code the result
C. A. Bouman: Digital Image Processing - April 17, 2013
20
Predictive Entropy Coder Flow Diagram
Encoder
Xs
XOR
Run Length
Encoding
Huffman
Coding
Causal
Predictor
Decoder
Huffman
Decoding
Run Length
Decoding
XOR
Xs
Causal
Predictor
C. A. Bouman: Digital Image Processing - April 17, 2013
21
How to Choose p(xs|xs+r r W )?
Non-adaptive method
Select typical set of training images
Design predictor based on training images
Adaptive method
Allow predictor to adapt to images being coded
Design decoder so it adapts in same manner as encoder
C. A. Bouman: Digital Image Processing - April 17, 2013
22
Non-Adaptive Predictive Coder
Method for estimating predictor
1. Select typical set of training images
2. For each pixel in each image, form zs = (xs+r0 , , xs+rp1 )
where {r0, , rp1} W .
3. Index the values of zs from j = 0 to j = 2p 1
4. For each pixel in each image, compute
hs(i, j) = (xs = i)(zs = j)
and the histogram
h(i, j) =
hs(i, j)
sS
5. Estimate p(xs|xs+r r W ) = p(xs|zs) as
h(i, j)
p(xs = i|zs = j) = P1
k=0 h(k, j)
C. A. Bouman: Digital Image Processing - April 17, 2013
23
Adaptive Predictive Coder
Adapt predictor at each pixel
Update value of h(i, j) at each pixel using equations
h(i, j) h(i, j) + (xs = i)(zs = j)
N (j) N (j) + 1
Use updated values of h(i, j) to compute new predictor at
each pixel
p(i|j)
h(i, j)
N (j)
Design decoder to track encoder
C. A. Bouman: Digital Image Processing - April 17, 2013
24
Adaptive Predictive Entropy Coder Flow
Diagram
Encoder
Xs
XOR
Run Length
Encoding
Huffman
Coding
Causal
Predictor
Causal
Histogram Estimation
Decoder
Huffman
Decoding
Run Length
Decoding
XOR
Xs
Causal
Predictor
Causal
Histogram Estimation
C. A. Bouman: Digital Image Processing - April 17, 2013
25
Lossy Source Coding
Method for representing discrete-space signals with minimum distortion and bit-rate
Outline
Rate-distortion theory
Karhunen-Loeve decorrelating Transform
Practical coder structures
C. A. Bouman: Digital Image Processing - April 17, 2013
26
Distortion
Let X and Z be random vectors in IRM . Intuitively, X is
the original image/data and Z is the decoded image/data.
Assume we use the squared error distortion measure given
by
d(X, Y ) = ||X Z||2
Then the distortion is given by
h
D = E [d(X, Y )] = E ||X Z||
This actually applies to any quadratic norm error distortion measure since we can define
= AX and
X
So
Z = AZ
i
h
h
2 i
2
= E X
Z = E ||X Z||
D
B
where B = AtA.
C. A. Bouman: Digital Image Processing - April 17, 2013
27
Lossy Source Coding: Theoretical Framework
Notation for source coding
Xn IRM for 0 n < N - a sequence of i.i.d. random
vectors
Y {0, 1}K - a K bit random binary vector.
Zn IRM for 0 n < N - the decoded sequence of
random vectors.
X (N ) = (X0, , XN 1)
Z (N ) = (Z0, , ZN 1)
Encoder function: Y = Q(X0, , XN 1)
Decoder function: (Z0, , ZN 1) = f (Y )
Resulting quantities
Bit-rate =
K
N
N 1
i
1 X h
2
E ||Xn Zn||
Distortion =
N n=0
How do we choose Q() to minimize the bit-rate and distortion?
C. A. Bouman: Digital Image Processing - April 17, 2013
28
Differential Entropy
Notice that the information contained in a Gaussian random variable is infinite, so the conventional entropy H(X)
is not defined.
Let X be a random vector taking values in IRM with density function p(x). Then we define the differential entropy
of X as
Z
p(x) log2 p(x)dx
h(X) =
xIRM
= E [log2 p(X)]
h(X) has units of bits
C. A. Bouman: Digital Image Processing - April 17, 2013
29
Conditional Entropy and Mutual Information
Let X and Y be a random vectors taking values in IRM
with density function p(x, y) and conditional density p(x|y).
Then we define the differential conditional entropy of X
given Y as
Z
Z
h(X|Y ) =
p(x, y) log2 p(x|y)
xIRM
xIRM
= E [log2 p(X|Y )]
The mutual information between X and Y is given by
I(X; Y ) = h(X) h(X|Y ) = I(Y ; X)
Important: The mutual information is well defined for
both continuous and discrete random variables, and it represents the reduction in uncertainty of X given Y .
C. A. Bouman: Digital Image Processing - April 17, 2013
30
The Rate-Distortion Function
Define/Remember:
Let X0 be the first element of the i.i.d. sequence.
Let D 0 be the allowed distortion.
For a specific distortion, D, the rate is given by
2
R(D) = inf I(X0; Z) : E ||X0 Z|| D
Z
where the infimum (i.e. minimum) over Z is taken over
all random variables Z.
Later, we will show that for a given distortion we can find
a code that gets arbitrarily close to this optimum bit-rate.
C. A. Bouman: Digital Image Processing - April 17, 2013
Properties of the Rate-Distortion Function
Properties of R(D)
R(D) is a monotone decreasing function of D.
h
i
2
If D E ||X0|| , then R(D) = 0
R(D) is a convex function of D
31
C. A. Bouman: Digital Image Processing - April 17, 2013
32
Shannons Source-Coding Theorem
Shannons Source-Coding Theorem:
For any R > R(D) and D > D there exists a sufficiently
large N such that there is an encoder
Y = Q(X0, , XN 1)
which achieves
Rate =
K
R
N
and
N 1
i
1 X h
Distortion =
E ||Xn Zn||2 D
N n=0
Comments:
One can achieve a bit rate arbitrarily close to R(D) at
a distortion D.
Proof is constructive (but not practical), and uses codes
that are randomly distributed in the space IRM N of source
symbols.
C. A. Bouman: Digital Image Processing - April 17, 2013
33
Example 1: Coding of Gaussian Random
Variables
Let X N (0, ) with distortion function E |X Z| ,
then it can be shown that the rate-distortion function has
the form
2
1
R() = max
log
,0
2
2
D() = min ,
Intuition:
is a parameter which represents the quantization step
2
1
2 log represents the number of bits required to encode the quantized scalar value.
Minimum number of bits must be 0.
Maximum distortion must be 2.
Distortion versus Rate Curve
Distortion
Rate
C. A. Bouman: Digital Image Processing - April 17, 2013
34
Example 2: Coding of N Independent Gaussian
Random Variables
Let X = [X1, , XN 1]t with independent components
such that Xn N (0, n2 ), and define the distortion function to be
2
Distortion = E ||X Z||
Then it can be shown that the rate-distortion function has
the form
2
N
1
X
1
n
max
R() =
log
,0
2
n=0
D() =
N
1
X
n=0
Intuition:
min
n2 ,
In an optimal coder, the quantization step should be
approximately equal for each random variable being
quantized.
The bit rate and distortion both add for each component.
It can be proved that this solution is optimal.
C. A. Bouman: Digital Image Processing - April 17, 2013
35
Example 3: Coding of Gaussian Random
Vector
Let X N (0, R) be a N dimensional Gaussian random
vector, and define the distortion function to be
2
Distortion = E ||X Z||
Analysis: We know that we can always represent the covariance in the form
R = T tT
where the columns of T are the eigenvectors of R, and
2
= diag{02, , N
1 } is a diagonal matrix of eigenvalues. We can then decorrelate the Gaussian random vector with the following transformation.
= T tX
X
has the covariance matrix
From this we can see that X
given by
t
t
t
E X X = E T XX T
t
t
= T E XX T = T tRT =
meets the conditions of Example 2. Also,
So therefore, X
we see that
2
2
E ||X Z|| = E ||X Z||
= T tX and Z = T tZ because T is an orthonorwhere X
mal transform.
C. A. Bouman: Digital Image Processing - April 17, 2013
36
Example 3: Coding of Gaussian Random
Vector (Result)
Let X N (0, R) be a N dimensional Gaussian random
vector, and define the distortion function to be
2
Distortion = E ||X Z||
Then it can be shown that the rate-distortion function has
the form
2
N
1
X
n
1
log
,0
max
R() =
2
n=0
D() =
N
1
X
n=0
min
n2 ,
2
where 02, , N
1 are the eigenvalues of R.
Intuition:
An optimal code requires that the components of a vector be decorrelated before source coding.
C. A. Bouman: Digital Image Processing - April 17, 2013
37
Example 4: Coding of Stationary Gaussian
Random Process
Let Xn be a stationary Gaussian random process with power
spectrum Sx(), and define the distortion function to be
2
Distortion = E |Xn Z|
Then it can be shown that the rate-distortion function has
the form
Z
1
Sx()
1
R() =
log
, 0 d
max
2 Z
2
1
D() =
min {Sx(), } d
2
Intuition:
The Fourier transform decorrelates a stationary Gaussian random process.
Frequencies with amplitude below are clipped to zero.
Distortion versus Rate Curve
Distortion
Rate
Distortion
C. A. Bouman: Digital Image Processing - April 17, 2013
38
The Discrete Cosine Transform (DCT)
DCT (There is more than one version)
N
1
X
(2n + 1)k
1
f (n) c(k) cos
F (k) =
2N
N n=0
where
c(k) =
Inverse DCT (IDCT)
1
f (n) =
N
1 k = 0
2 k = 1, , N 1
N
1
X
F (k) c(k) cos
k=0
(2n + 1)k
2N
Comments:
In this form, the DCT is an orthonormal transform. So
if we define the matrix F such that
(2n + 1)k
Fn,k = c(k) cos
,
2N
then F 1 = F H where
F
= F
t
= FH
Takes and N -point real valued signal to an N -point
real valued signal.
C. A. Bouman: Digital Image Processing - April 17, 2013
39
Relationship Between DCT and DFT
Let us define the padded version of f (n) as
f (n) 0 n N 1
fp(n) =
0
N n 2N 1
and its 2N -point DFT denoted by Fp(k). Then the DCT
can be written as
N
1
X
(2n + 1)k
c(k)
f (n) cos
F (k) =
2N
N n=0
N 1
n 2(2n+1)k
o
k
c(k) X
j
j
2N
f (n)Re e
e 2N
=
N n=0
(
)
2N
1
2(2n+1)k
k X
c(k)
j 2N
j
2N
= Re e
fp(n)e
N
n=0
o
c(k) n j k
= Re e 2N Fp(k)
N
k
k
c(k)
j 2N
+j 2N
Fp(k)e
+ Fp(k)e
=
N
k
j 2N
2k
c(k)e
Fp(k) + Fp(k)ej 2N
=
N
C. A. Bouman: Digital Image Processing - April 17, 2013
40
Interpretation of DFT Terms
Consider the inverse DCT for each of the two terms
Fp(k) fp(n)
DCT {fp(n)} Fp(k)
DCT {fp(n + N 1)} Fp(k)e
Simple example for N = 4
j 2k
2N
f (n)
= [f (0), f (1), f (2), f (3)]
fp(n)
= [f (0), f (1), f (2), f (3), 0, 0, 0, 0]
fp(n+N 1)
= [0, 0, 0, 0, f (3), f (2), f (1), f (0)]
f (n) + fp(n+N 1)
= [f (0), f (1), f (2), f (3), f (3), f (2), f (1), f (0)]
C. A. Bouman: Digital Image Processing - April 17, 2013
41
Relationship Between DCT and DFT
(Continued)
So the DCT is formed by 2N -point DFT of f (n)+f (n+
N 1).
F (k) =
k
j 2N
c(k)e
DF T2N {f (n) + f (n + N 1)}