Lecture3 PDF
Lecture3 PDF
1 Overview
This handout covers the basics of Image and Video compression as follows
6. Video Compression
1
2 THE NEED FOR COMPRESSION
4:2:2 4:1:1
4:2:0
In one frame there are pixels. As each pixel is repre-
sented by one byte, then that is bytes. At 25 frames/sec this means a bandwidth of
The available bandwidth for a single Digital television channel is at best 6Mbits/sec. This is
about 30 times smaller than the 20MB/sec needed. DVD can store at most 4GB, how does
one fit 2 hours of movie on a DVD?
You digital mobile phone can handle maybe 1 Mbit/sec absolute TOPS . That is 180 times
smaller than required for video.
Imagine you are a film and TV archive (like www.ina.fr or the BBC or rte). You need to keep
a record of 24 hours of programming on 100’s of channels daily for up to 50 years (in the case
of the BBC). Hmm.. there is not enough space in a town to stack up the CD’s needed to store
that!
So a mechanism is needed to represent images with fewer bytes than the raw data
3 Towards compression
I don’t really need pixels for my 1 inch mobile screen do I? So I can throw away
every 4th pixel and 4th line (subsampling) for instance, and yield a picture instead.
So now I can show the same picture for 1/16 the storage. Not good enough. Besides,
pictures look really crap on a TV set.
CIF
Full
What if I start to think about mathematical models for pictures . . . ? Then I can send/store the
parameters of my model instead of the actual pictures, and if my model is simple, I can store
less parameters than pixels and get some compression. Hmmm. But pictures look pretty com-
plicated. In fact most interesting pictures tend to be different from other pictures. Otherwise
why look?
It turns out that you can make some generic statements about images and image sequences.
1. In small, local regions, pixel intensity and colour tends to be the same or at least slowly
Consider that you want to transmit a fax as an image. There are just 2 colours 0 = black and 1
= white. Let’s say your image is as below (the letter H in a binary image).
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 1 0 1 1 0 0
0 1 1 0 1 1 0 0
0 1 1 1 1 1 0 0
0 1 1 1 1 1 0 0
0 1 1 0 1 1 0 0
0 0 0 0 0 0 0 0
Instead of sending every single pixel, since there tend to be long lengths of consecutive re-
peated pixels (i.e. long runs) we could send a (for instance) followed by the number of times
it is repeated.
So instead of sending or storing for instance, you would store , the first number
being the colour, and the second being the number of times that colour occurred consecutively.
Instead of storing 8 bytes, we have stored just 2. We have encoded some raw data of 8 zeros,
as just 2 bytes. We have achieved a compression factor of !
In typical RLE schemes, you do not account for all possible runs. Instead you only allow for
runs of length say 0 to 32 for instance. Then a run of length 64 would need to be encoded as
2 runs of length 32.
Lets say for our RLE scheme we allow a maximum run length of 8, and the data is either 0 or
1. The image example then can be represented by . . .
But what about a real/grayscale image? Hmm. RLE might get inefficient if the data is not
mostly flat!
10 32 22 12
10 20 20 10
10 30 20 10
8 31 20 15
5 Signal Transforms
What if it were possible to change the image in some reversible process, so that we created a result
that was easier to compress? In other words we take our data and transform it in some clever way
to make RLE work better.
This is related to another idea.
Suppose I had a photoalbum/dictionary of all the possible images in the world ever made in the
past and ever will be possible in the future. And suppose I gave you a copy of this dictionary in
which each image was assigned a number.
Then instead of having to send you the raw data, I would just send you the number of the image
in the dictionary, and you could look it up and you’d have the picture!
This dictionary would be very large since pictures come in many flavours. To make a smaller
dictionary, you can instead choose images which when added together make up the picture you
want to send or store.
So now to send a picture, the transmitting end has to work out which set of images could be
added together to give the picture. Then the transmitter sends the indexes of those images to the
receiver. The receiver then looks up the pictures and then adds them together to give the received
image.
About 200 years ago1 , a guy called Fourier, spotted that you could actually do this with any
signal. He was working on 1D signals but the same applies to 2D ones.
1
No electricity, no computers, no cinema, no television, no hot baths, no baths, no showers. Lice in your hair all the time, no
soap, no nylon, no jeans, no flushing toilets, no sewage system ...
The brilliant discovery of Fourier, was that any 1D signal can be represented by a weighted sum of
sines and cosines. So to make a triangle wave for instance, all you need to do is to add a bunch of
sines and cosines together of different frequencies and different amplitudes.
−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
1
−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0.2
−0.2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
1
−1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (seconds)
2/π
−1/π
0 1 2 3 4 5 6 7
And he came up with a mathematical formula that says which frequencies and which amplitudes
were needed to synthesise a particular signal.
Since we all know what sines and cosines look like, we can summarise this signal decomposition
with a graph of Amplitude versus Frequency. That graph will tell us how much of each frequency
With 2D signals things are a bit trickier. 2D sines and cosines look a bit like a wave in a wave tank,
or a wave in your bath, or a wave in the sea. Except the wave is a wave in intensity or brightness.
The equation for working out how much of each wave you need to make a picture is also a bit tricky.
Furthermore, each wave is represented by a complex number. Urgh?
10
1
20
0.5
30 0
−0.5
40
−1
80
60 70
50 60
40 50
40
30
20 20
10
0 0
60
10 20 30 40 50 60
Wave is directed at degrees off
horizontal, frequency is cycles per pel in that direction and phase lag .
Instead electrical/signal processing engineers have come up with a simpler4 Transform that uses
only Cosine waves. This transform, known as the Discrete Cosine Transform, results in only real
numbers. It is the basis of JPEG.
4
Not really
JPEG is based on Transforming blocks of pixels using the 2D DCT. For a signal of 8
samples, the 8 possible DCT basis function (the dictionary) is as below.
8-point DCT: rows 1 to 4 8-point DCT: rows 5 to 8
0 -4
-0.5 -4.5
-1 -5
-1.5 -5.5
-2 -6
-2.5 -6.5
-3 -7
-3.5 -7.5
-4 -8
-4.5 -8.5
-5 -9
0 2 4 6 8 0 2 4 6 8
The 64 2D DCT basis functions and the 2D DCT of a block in Lenna are shown below.
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Now we can see that the effect of Transforming a block of pixels is to reduce its overall energy.
Its flatter in the DCT space. This means that we have less information to transmit. Here is
what happens if we take every block in Lenna and transform it with the 2D DCT.
6 Video Compression
All the best codecs for media are based on transforming the data in some way. JPEG2000 is
based on a new kind of transform, the Wavelet Transform discovered only in the late 1980’s.
Compression of audio .mp3 is based on 1D DCT. MPEG (Motion Picture Experts Group) is
used for compression of video for DVD or DTV [MPEG1,2,4]. Ireland was a major player in
establishing the MPEG 4 standard.
Intel Indeo, Apple Quicktime, Divx are all based on MPEGGy ideas.
MPEG is based again on the 8 point DCT just like JPEG except....
In video most consecutive pictures look the same. So if I knew what one picture looked
like, then in theory I could build all the others by slightly adjusting that one. This is called
prediction.
But things move around in video, so we have to estimate that motion to work out how to shift
the pixels around in order to create the next image.
To understand how prediction can help with video compression, The top row of figure 2 shows a sequence of
images of the Suzie sequence. It is QCIF ( ) resolution and at a frame rate of 30 frames/sec.
We have already seen that Transform coding of images yields significant levels of compression, e.g.
JPEG. Therefore a first step at compressing a sequence of data is to consider each picture separately. Consider
using the 2D DCT of blocks. The DCT coefficients for each frame of Suzie are shown in the second
row of figure 2. The use of the DCT on the raw image data yields a compression of the original 8 bits/pel
data to about 0.8 bits/pel on each frame. Note that the DCT coefficients have NOT been quantised using the
standard JPEG Quantisation matrix for demonstration purposes.
We know that most images in a sequence are mostly the same as the frames nearby except with different
object locations. Thus we can propose that the image sequence obeys a simple predictive model (discussed
in previous lectures) as follows:
(1)
where
is some small prediction error that is due to a combination of noise and “model mismatch”. Thus
we can measure the prediction error at each pixel in a frame as
(2)
This is the motion compensated prediction error, sometimes referred to as the Displaced Frame Difference
(DFD). The only model parameter required to be estimated is the motion vector
. Assume for the moment
that we use some process to estimate these vectors. We will look at that later.
Figure 1 illustrates how motion compensation can be applied to predict any frame from any previous
frame using motion estimation. The figure shows block based motion vectors being used to match every
Motion Vector
Motion
Vector
Frame n−1
Object
n−1 n
Block in Frame n
Frame n
block in frame with the block that is most similar in frame . The difference between the corresponding
In MPEG, the situation shown in figure 1 (where frame is predicted by a motion compensated version
of frame ) is called Forward Prediction. The block that is to be constructed i.e. frame is called the
Target Block. The frame that is supplying the prediction is called the Reference Picture, and the resulting
data used for the motion compensation (i.e. the displaced block in frame ) is the Prediction Block.
The Fourth row of Figure 2 shows the prediction error of each frame of the Suzie sequence starting from the
first frame as a reference. A three level Block Matcher was used with blocks and a motion threshold
for motion detection of at the highest resolution level. The accuracy of the search was
pixels. Each
DFD frame is the difference between frame and a motion compensated frame
, given the original
frame
.
Again, we can compress this sequence of ‘transformed’ images (including the first I frame) using the
DCT of blocks of . Now the amount of data needed per is about 0.4 bits/pel. Substantial compression
has been achieved over attempting to compress each image separately. Of course, you will have deduced that
this was going to be the case because there is much less information content in the DFD frames than in the
original picture data.
To confirm that it is indeed motion compensated prediction that is contributing most of the benefit, the
Figure 2: Frames 50-53 of the Suzie sequence processed by various means. From Top to Bottom row: Orig-
inal Frames; DCT of Top Row; Non-motion compensated DFD; Motion Compensated DFD with backward
prediction; DCT of previous row.
I B B P B B P B B I
3rd row of figure 2 shows the non-motion compensated frame difference (FD)
between the
frames of Suzie. There is substantially more energy in these FD frames than in the DFD frames, hence the
higher bit rate.
A closer look at the DFD frame sequence in row 2 of Figure 2 shows that in frames 52 and 53 (in particular)
there are some areas that show very high DFD. This is explained by observing the behaviour of Suzie in the
top row. In those frames her head moves such that she uncovers or occludes some area of the background.
The phone handset also uncovers a portion of her swinging hair. In the situation of uncovering, the data in
some parts of frame simply does not exist in frame . Thus the DFD must be high. However, the data
that is uncovered in frame , typically is also exposed in frame
. Therefore, if we could look into the
next frame as well as the previous frame we probably will be able to find a good match for any block whether
it is occluded or uncovered.
Using such Bi-directional prediction gives much better image fidelity. This idea is used in MPEG-2. It
uses both backward prediction for some frames (P frames) and bidirectional prediction for others (B frames).
The sequencing is shown below. Typically MPEG2 encodes images in the following order IBBPBBPBBPBBPI. . . .
I-frames (Intra-coded frames) are encoded just like JPEG i.e. without any motion compensation. This allows
the codec to cope with varying image content...think what would happen if you tried to predict every image
in a movie from the first frame. Its not going to work is it? So I-frames are slipped in every 12 frames or so
to give a new reference frame for prediction of the next 12 frames.
The most popular and to some extent the most robust technique to date for motion estimation is Block Match-
ing (BM).
1. Constant translational motion over small blocks (say or ) in the image. This is the same
as saying that there is a minimum object size that is larger than the chosen block size.
2. There is a maximum (pre-determined) range for the horizontal and vertical components of the motion
vector at each pixel site. This is the same as assuming a maximum velocity for the objects in the
sequence. This restricts the range of vectors to be considered and thus reduces the cost of the algorithm.
The image in frame , is divided into blocks usually of the same size ,
. Each block is considered in
turn and a motion vector is assigned to each. The motion vector is chosen by matching the block in frame
with a set of blocks of the same size at locations defined by some search pattern in the previous frame.
Define the Mean Absolute Error of the DFD between the block in the current frame and that in the previous
frame as
(4)
Block
We can use Mean Squared Error (MSE) as well, but MAE is more robust to noise.
The block matching algorithm then proceeds as follows at each image block.
1. Pre-determine a set of candidate vectors to be tested as the motion vector for the current block
2. For each calculate the MAE
3. Choose the motion vector for the block as that which yields the minimum MAE.
The set of vectors in effect yield a set of candidate motion compensated blocks in the previous frame
for evaluation. The separation of the candidate blocks in the search space determines the smallest vector
that can be estimated. For integer accurate motion estimation the position of each block coincides with the
image grid. For fractional accuracy, blocks need to be extracted between locations on the image grid. This
requires some interpolation. In most cases Bilinear interpolation is sufficient.
Figure 4 shows the search space used in a full motion search technique. The current block is compared
to every block of the same size in an area of size
. The search5 space is chosen by
deciding on the maximum displacement allowed: in Figure 4 the maximum displacement estimated is
for both horizontal and vertical components.
The technique arises from a direct solution of equation 1. The BM solution can be seen to minimize
the Mean Absolute DFD (or Mean Square DFD) with respect to , over the block. The chosen
displacement, satisfies the model equation 1 in some ‘average’ sense.
5
There are searched locations.
N+2w
N
Figure 4: Motion estimation via Block Matching. The positions indicated by a in frame are searched
for a match with the
block in frame . One block to be examined is located at displacement
,
and is shaded.
6.1.1 Computation
The Full Motion Search is computationally demanding. Given a maximum expected displacement of
pels, there are
searched blocks (assuming integer displacements only). Each
The simplest mechanism for reducing the computational burden of Full Search BM is to reduce the number
of motion vectors that are evaluated. The Three-step search is a hierarchical search strategy that evaluates
first 9 then 8 and finally again 8 motion vectors to refine the motion estimate in three successive steps. At
each step the distance between the evaluated blocks is reduced. The next search is centred on the position
of the best matching block in the previous search. It can be generalised to more steps to refine the motion
estimate further. Figure 5 shows the searched blocks in frame for this process.
The cross search is another variant on the subsampled motion vector visiting strategy. It changes the geometry
of the search pattern to a
or pattern. Figure 5 shows the searched blocks in frame for this process.
If the best match is found at the centre of the search pattern or the boundary of the search window, then the
search step is reduced.
2 2 2
1 1 2 1 2
3 3 3
2 2 3 2 3 1 2 3
3 3 3
1 0 1 1 0 1 2 3
5
1 2 3 5 4 5
5
1 1 1 4
Figure 5: Illustration of searched locations (central pixel of the searched block is shown) in Three-step BM
(left) and Cross-search BM (right). The search window extent is shown in red for Cross-search. The best
matches at each search level are circled in blue.
6.1.4 Problems
The BM algorithm is noted for being a robust estimator of motion since noise effects tend to be averaged out
over the block operations. However, if there is no textural information in the the two blocks compared, then
noise dominates the matching process and causes spurious motion estimates.
This problem can be isolated by comparing the best match found ( ), to the ‘no motion’ match ( ). If
these matches are sufficiently different then the motion estimate is accepted otherwise no motion is assumed.
A threshold acts on the ratio . The error measure used is the MAE. If , where is some
threshold chosen according to the noise level suspected, then no motion is assumed. This algorithm verifies
the validity of the motion estimate once motion is detected.
The main disadvantages of Block Matching are the heavy computation involved (although these are byte
wise manipulations) and the motion averaging effect of the blocks. If the blocks chosen are too large then
many differently moving objects may be enclosed by one block and the chosen motion vector is unlikely to
match the motion of any of the objects. The advantages are that it is very simple to implement6 and it is
robust to noise due to the averaging over the blocks.
There are many more useful motion estimators than this. These others do give you motion better matched
to what is actually going on in the scene. But we will not look at these here.
DVD and DTV both use MPEG-2 , and the core is exactly as described here. MPEG-2 became a standard
around 1992, and just 4 years later Digital Television was a reality. This is quite amazing considering that the
advances in research in video compression that made this possible were only really about 15 years old at the
time. Compare that to the 200 years it took Fourier to be really appreciated!
6
It has been implemented on Silicon for video coding applications.
Mobile phone video communications will use MPEG-4 (established around 1998). Unfortunately that is
going through some teething trouble at the moment.
Sadly, the creation of MPEG standards is not as simple as motion estimation, DFD, DCT, quantisation
and transmission. When you actually start to think about putting together codecs the following issues arise.
Compression There are at least three fundamentally different types of multimedia data sources: pictures, audio and
text. Different compression techniques are needed for each data type. Each piece of data has to be
identified with unique codewords for transmission.
Sequencing The compressed data from each source is scanned into a sequence of bits. This sequence is then
packetised for transport. The problem here is to identify each different part of the bitstream uniquely
to the decoder, e.g. header information, DCT coefficient information.
Multiplexing The audio and video data (for instance) has to be decoded at the same time (or approximately the
same time) to create a coherent signal at the receiver. This implies that the transmitted elementary
data streams should be somehow combined so that they arrive at the correct time at the decoder. The
challenge is therefore to allow for identifying the different parts of the multiplexed stream and to insert
information about the timing of each elementary data stream.
Media The compressed and multiplexed data has to be stored on some DSM and then later (or live) broadcast
to receivers across air or other links. Access to different Media channels (including DSM) is governed
by different constraints and this must somehow be allowed for in the standards description.
Errors Errors in the received bitstream invariably occur. The receiver must cope with errors such that the
system performance is robust to errors or it degrades in some graceful way.
Bandwidth The bandwidth available for the multimedia transmission is limited. The transmission system must
ensure that the bandwidth of the bitstream does not exceed these limits. This problem is called Rate
Control and applies both to the control of the bitrate of the elementary data streams and the multiplexed
stream.
Multiplatform The coded bitstream may need to be decoded on many different types of device with varying proces-
sor speeds and storage resources. It would be interesting if the transmission system could provide a
bitstream which could be decoded to varying extents by different devices. Thus a low capacity device
could receive a lower quality picture than a high capacity device that would receive further features and
higher picture quality. This concept applied to the construction of a suitable bitstream format is called
Scalability.
What we have covered here is the core of the standard used for image and video compression. This just
says how the data itself is compressed. If you open up an .avi or .mpg file, you will not see this data in that
same form. It has to be encoded into symbols, and timing and copyright information embedded at the very
least. This makes the design of codecs a tricky business. But it is certainly true that without standards, there
would be no business in video communications.
Finally, note that none of the compression standards actually describe how you do the things you have
to do. It just describes how to represent bits and package them. So you can use cleverer DCTs or cleverer
motion estimators to get better speed and performance. That is why one manufacturer’s codec could be better
than another’s even though they both create compressed video according to the same standard.