CH: 7 Fundamentals of Video Coding
CH: 7 Fundamentals of Video Coding
where i, j – the coordinates of the motion vector of the current block, i ϵ (–Vw/2; Vw/2), j ϵ (–Vh/2;
Vh/2), where Vw × Vh – size of the area which can be is the upper left corner of the title block on the
reference frame; x, y – coordinates of the current block B; Nw × Nh – block size B; S – reference area
of size Sw × Sh, where Sw = Nw + Vw, Sh = Nh + Vh; B and S – luminance images in color format
YUV. Inside the search area size Sw × Sh is the minimum value of SSD criterion for the current block
B, which determines the coordinates of the motion vector in order. SSD can be calculated through
fewer number of operations by decomposition into three components and those are:
Nh-1 Nw-1 2
∑ ∑ B (x,y)
y=0 x=0
Nh-1 Nw-1
-∑ ∑ B(x,y)S(x+i,y+j)
y=0 x=0
Nh-1 Nw-1 2
+∑ ∑ S (x+i,y+j)
y=0 x=0
We propose to replace this algorithm by other fast transforms: Winograd algorithm and the number-
theoretic transform of Farm (NTT).
Backward motion estimation:
The motion estimation that we have discussed in Section-20.3 and Section 20.4 is essentially
backward motion estimation, since the current frame is considered as the candidate frame and the
reference frame on which the motion vectors are searched is a past frame, that is, the search is
backward. Backward motion estimation leads to forward motion prediction.
Backward motion estimation, illustrated in below fig
Forward motion estimation
It is just the opposite of backward motion estimation. Here, the search for motion vectors is carried out
on a frame that appears later than the candidates frame in temporal ordering. In other words, the search
is “forward”. Forward motion estimation leads to backward motion prediction.
Forward motion estimation, illustrated in fig 20.3
It may appear that forward motion estimation is unusual, since one requires future frames to predict the
candidate frame. However, this is not unusual, since the candidate frame, for which the motion vector
is being sought is not necessarily the current, that is the most recent frame. It is possible to store more
than one frame and use one of the past frames as a candidate frame that uses another frame, appearing
later in the temporal order as a reference.
Forward motion estimation (or backward motion compensation) is supported under the MPEG 1 & 2
standards, in addition to the conventional backward motion estimation. The standard also supports bi-
directional motion compensation in which the candidate frame is predicted from a past reference as
well as a future reference frame with respect to the candidates frame.
Frame classification:
In video compression, a video frame is compressed by using different algorithms.These different
algorithms for video frames are called picture types or frame types and they are I, P and B.The
characteristics of frame types are:
I-frame:
I-frame are the least compressible but do not require other video frames to decode.
P-frame:
It can use data from previous frames to decompress and are more compressible than I-frame.
B-frame:
It can use both previous and forward frames for data reference to get the highest amount of data
compression.
An I frame(Intra coded picture) is a complete image like JPG image file.
A P-frame(Predicted picture) holds only the changes in the image from the previous frame. For
example , in a scene where a car moves across a stationary background, only the car’s movements
need to be encoded. The encoder does not need to store the unchanging background pixels in the P-
frame for saving space. P-frames are also known as delta frames.
A B-frame(Bidirectional predicted picture) saves even more space by using differences between
the current frame and both the preceding and following frames to specify its content.
Picture/Frame:
The term picture and frame are used interchangeably. The term picture is more general notion as a
picture can be either a frame or a field. A frame is complete image and a field is the set of odd
numbered or even numbered scan lines composing a partial image. For example, an HD 1080 picture
has 1080 lines of pixels. An odd field consist of pixel information for lines 1,3,5,......1079. An even
field has pixel information for lines 2,4,6,....1080.Whwn video is sent in interlaced scan format then
each frame is sent in two fields, the fields of odd numbered lines followed by the field of even
numbered lines. A frame used as a reference for preceding other frames is called a reference frame.
Frame encoded without information from other frames are called I-frames. Frame that use prediction
from a single preceding reference frame are called P-frames. The frames that use prediction from a
average of two reference frames, one preceding and one succeeding are called B-frames.
Slices:
A slice is a spatially distinct region of a frame that is encoded separately from any other region in the
same frame . I-slices, P-slices, and B-slices take the place of I, P and B frames.
Macroblocks:
It is a processing unit in image and video compression formats based on linear block transforms,
typically the DCT. It consist of 16x16 samples and is further subdivide into transform blocks and may
be further subdivided into prediction blocks.
Partitioning of picture:
Slices:
•A picture is split into 1 or several slices
•Slices are self-contained
•Slices are a sequence of macroblocks
Macroblocks:•
Basic syntax & processing unit
•Contains 16x16 luma samples and 2 x 8x8 chroma samples
•Macroblocks within a slice depend on each other
•Macroblocks can be further partitioned
Elements of video encoding and decoding:
Discrete Cosine Transform- DCT transformation decomposes each input block into a series of
waveforms with a specific spatial frequency. Outputs an 8x8 block of horizontal and vertical
frequency coefficients.
Quantization- Quantization block uses the psychovisual characteristics to eliminate the
unimportant DCT coefficients, high frequency coefficients.
Inverse Quantization- IQ computes the inverse quantization matrix by multiplying the quantized
DCT with the quantization table.
Inverse Discrete Cosine Transform- IDCT computes the original input block. Errors are expected
due to quantization.
Motion Estimation- ME uses a scheme with fewer search locations and fewer pixels to generate
motion vectors indicating the directions of the moving images.
Motion Compensation- MC block increases the compression ratio by removing the redundancies
between frames.
Variable Length Coding Lossless- VLC coding reduces the bit rate by sending shorter codes for
common pairs (number of zeros and number of non-zeros) and longer codes for less common pairs.
Example of the wireless video codec (encoder and decoder) that includes pre-processing of the
captured data to interface with the encoder and post-processing of the data to interface with the LCD
panel. The video codec is compliant with the low bit rate codec for multimedia telephony defined by
the Third Generation Partnership Project (3GPP) .
The baseline CODEC defined by 3GPP is H.263 and MPEG-4 Simple Visual Profile is defined as an
optional. The video codec implemented supports the following video formats.
1. SQCIF or 128 x 96 resolution
2 .QCIF or 176 x 144 resolution at Simple Profile Level 1
3 .CIF or 352 x 288 resolution at Simple Profile Level 2
4. 64 kbits/s for Simple Profile Level 1
5 .128 kbits/s for Simple Profile Level 2
Video CODEC Description
The video encoder implemented requires a YUV 4:2:0 non-interface video input and, therefore, pre-
processing of the video input may be required depending on the application. For the video decoder,
post-processing is needed to convert the decoded YUV 4:2:0 data to RGB for displaying.
Features
1.Pre-processing:
− YUV 4:2:2 interlaced (from camera for example) to YUV 4:2:0 non-interlaced, only
decimation and no filtering of the UV components.
2. Post-processing:
− YUV 4:2:0 to RGB conversion
− Display formats of 16 bits or 12 bits RGB
− 0 to 90 degrees rotation for landscape and portrait displays
3. MPEG-4 Simple Profile Level 0, Level 1 and Level 2 support
4. H.263 and MPEG-4 decoder and encoder compliant
5. MPEG-4 video decoder options are:
− AC/DC prediction
− Reversible Variable Length Coding (RVLC)
− Resynchronization Marker (RM)
− Data Partitioning (DP)
− Error concealment, proprietary techniques
− 4 Motion Vectors per Macroblock (4MV)
− Unrestricted Motion Compensation
− Decode VOS layers
6. MPEG-4 video encoder options are:
− Reversible Variable Length Coding (RVLC)
− Resynchronization Marker (RM)
− Data Partitioning (DP)
− 4 Motion Vectors per Macroblock (4MV)
− Header Extension Codes
− Bit rate target change during encoding
− Coding frame rate change during encoding
− Insertion or not of Visual Object Sequence start code
7. Insertion of I-frame during the encoding of a sequence support
8. Encoder Adaptive Intra Refresh (AIR) support
9. Multi-codec support, multiple codecs running from the same code
Video Architecture
Pixel Representation
Red, Green and Blue or RGB are the primary colors for the computer display and the color
depth supported by the OMAP5910 is programmable up to 16 bits per pixel, RGB565 (5 bits for Red,
6 bits for Green and 5 bits for Blue). In the consumer video such as DVD, camera, digital TV and
others, the common color coding scheme is YCbCr where Y is the luminance, Cb is the blue
chrominance and Cr is the red chrominance. Human eyes are much more sensitive to the Y
component of the video and this enables video sub-sampling to reduce the chrominance component
without being detected by the human eyes. This method is referred to as YCbCr 4:2:0, YCbCr 4:2:2 or
YCbCr 4:4:4.
Video coding standards MPEG and H.26X
The Moving Picture Experts Group (MPEG) was established in 1988 in the framework of the Joint
ISO/IEC Technical Committee (JTC 1) on Information Technology with the mandate to
develop standards for coded representation of moving pictures, associated audio and their
combination when used for storage and retrieval on Digital Storage Media with a bitrate at up
to about 1.5 Mbit/s. The standard was nicknamed MPEG-1 and was issued in 1992. The scope of the
group was later extended to provide appropriate MPEG-2 video and associated audio compression
algorithms for a wide range of audio-visual applications at substantially higher bitrates not
successfully covered or envisaged by the MPEG-1 standard. Specifically, MPEG-2 was given the
charter to provide video quality not lower than NTSC/PAL and up to CCIR601 quality with bitrates
targeted between 2 and 10 Mbit/s. Emerging applications, such as digital cable TV distribution,
networked database services via ATM, digital VTR applications, and satellite and terrestrial digital
broadcasting distribution, were seen to benefit from the increased quality expected to result from the
emerging MPEG-2 standard. The MPEG-2 standard was released in 1994. The Table I below
summarizes the primary applications and quality requirements targeted by the MPEG-1 and MPEG-2
video standards together with examples of typical video input parameters and compression ratios
achieved.
The MPEG-1 and MPEG-2 video compression techniques developed and standardized by the
MPEG group have developed into important and successful video coding standards worldwide, with
an increasing number of MPEG-1 and MPEG-2 VLSI chip-sets and products
becoming available on the market. One key factor for the success is the generic structure of
the MPEG standards, supporting a wide range of applications and applications specific parameters
[schaf, siko1]. To support the wide range of applications profiles a diversity of
input parameters including flexible picture size and frame rate can be specified by the user.
Another important factor is the fact that the MPEG group did only standardize the decoder
structures and the bitstream formats. This allows a large degree of freedom for manufactures
to optimize the coding efficiency (or in other words the video quality at a given bit rate) by
developing innovative encoder algorithms even after the standards were finalized.
.
MPEG-1 Standard (1991) (ISO/IEC 11172)
Target bit-rate about 1.5 Mbps
Typical image format CIF, no interlace
Frame rate 24 ... 30 fps
Main application: video storage for multimedia (e.g., on CD-ROM)
MPEG-2 Standard (1994) (ISO/IEC 13818)
Extension for interlace, optimized for TV resolution
(NTSC: 704 x 480 Pixel)
Image quality similar to NTSC, PAL, SECAM at 4 -8 Mbps
HDTV at 20 Mbps
MPEG-4 Standard (1999) (ISO/IEC 14496)
Object based coding
Wide-range of applications, with choices of interactivity, scalability, error resilience, etc.
MPEG-4
H.26X
H.261 Video Compression Standard
CH 8 Video segmentation
Temporal Segmentation:
Segmentation is highly dependent on the model and criteria for grouping pixels into regions. In
motion segmentation, pixels are grouped together based on their similarity in motion. For any given
application, the segmentation algorithm needs to find a balance between model complexity and
analysis stability. An insufficient model will inevitably result in over segmentation. Complicated
models will introduce more complexity and require more computation and constraints for stability. In
image coding, the objective of segmentation is to exploit the spatial and temporal coherences in the
video data by adequately identifying the coherent motion regions with simple motion models.
Block-based video coders avoid the segmentation problem altogether by artificially imposing a
regular array of blocks and applying motion coherence within these blocks. This model requires very
small overhead in coding, but it does not accurately describe an image and does not fully exploit the
coherences in the video data. Region-based approaches which exploit the coherence of object motion
by grouping similar motion regions into a single description, have shown improved performances
over block-based coders.
In the layered representation coding,14,15 video data is decomposed into a set of overlapping layers.
Each layer consists of: an intensity map describing the intensity profile of a coherent motion region
over many frames; an alpha map describing its relationship with other layers; and a parametric motion
map describing the motion of the region. The layered representation has potentials for achieving
greater compression because each layer exploits both the spatial and temporal coherences of video
data. In addition, the representation is similar to those used in computer graphics and so it provides a
convenient way to manipulate video data. Our goal in spatiotemporal segmentation is to identify the
spatial and temporal coherences in video data and derive the layered representation for the image
sequence.
Temporal coherence
Motion estimation provides the necessary information for locating corresponding regions in different
frames. The new positions for each region can be predicted given the previously estimated motion for
that region. Motion models are estimated within each of these predicted regions and an updated set of
motion hypotheses derived for the image. Alternatively, the motion models estimated from the
previous segmentation can be used by the region classifier to directly determine the corresponding
coherent motion regions. Thus, segmentation based on motion conveniently provides a way to track
coherent motion regions.
In addition, when the analysis is initialized with the segmentation results from previous frame,
computation is reduced and robustness of estimation is increased.
Temporal segmentation adds structure to the video by partitioning the video into chapters.
This is a first step for video summarization methods, which should also enable fast browsing
and indexing so that a user can quickly discover important activities or objects.
Shot boundary detection, hard cut and soft cuts:
The concept of temporal image sequence (video) segmentation is not a new one, as it dates back to the
first days of motion pictures, well before the introduction of computers. Motion picture specialists
perceptually segment their works into a hierarchy of partitions. A video (or film) is completely and
disjointly segmented into a sequence of scenes, which are subsequently segmented into a sequence of
shots. Scenes (also called story units) are a concept that is much older than motion pictures, ultimately
originating in the theater. Traditionally, a scene is a continuous sequence that is temporally and
spatially cohesive in the real world, but not necessarily cohesive in the projection of the real world on
film. On the other hand, shots originate with the invention of motion cameras and are defined as the
longest continuous sequence that originates from a single camera take, which is what the camera
images in an uninterrupted run. In general, the automatic segmentation of a video into scenes ranges
from very difficult to intractable. On the other hand, video segmentation into shots is both exactly
defined and also characterized by distinctive features of the video stream itself. This is because video
content within a shot tends to be continuous, due to the continuity of both the physical scene and the
parameters (motion, zoom, focus) of the camera that images it.
Therefore, in principle, the detection of a shot change between two adjacent frames simply requires to
compute an appropriate continuity or similarity metric. However, this simple concept has three major
complications. The first, and most obvious one, is defining a continuity metric for the video in such a
way that it is insensitive to gradual changes in camera parameters, lighting, and physical scene
content, easy to compute and discriminant enough to be useful. The simplest way to do that is to
extract one or more scalar or vector features from each frame and to define distance functions on the
feature domain. Alternatively the features themselves can be used either for clustering the frames into
shots, or for detecting shot transition patterns. The second complication is deciding which values of
the continuity metric correspond to a shot change and which do not. This is not trivial, since the
feature variation within certain shots can exceed the respective variation across shots. Decision
methods for shot boundary detection include fixed thresholds, adaptive thresholds and statistical
detection methods. The third complication, and the most difficult to handle, is the fact that not all shot
changes are abrupt. Using motion picture terminology, changes between shots can belong to the
following categories:-
1. Cut. This is the classic abrupt change case, where one frame belongs to the disappearing shot and
the next one to the appearing shot.
2. Dissolve. In this case, the last few frames of the disappearing shot temporally overlap with the
first few frames of the appearing shot. During the overlap, the intensity of the disappearing shot
decreases from normal to zero (fade out), while that of the appearing shot increases from zero to
normal (fade in).
3. Fade. Here, first the disappearing shot fades out into a blank frame, and then the blank frame fades
in into the appearing shot.
4. Wipe. This is actually a set of shot change techniques, where the appearing and disappearing shots
coexist in different spatial regions of the intermediate video frames, and the region occupied by the
former grows until it entirely replaces the latter.
5. Other transition types. There is a multitude of inventive special effects techniques used in motion
pictures. These are in general very rare and difficult to detect.
Shot-boundary detection is the first step towards scene extraction in videos, which is useful for video
content analysis and indexing. A shot in a video is a sequence of frames taken continuously by one
camera. A common approach to detect shot-boundary consists of computing similarity between
pairs of consecutive frames and marking the occurrence of boundary where the similarity is lower
than some threshold. The similarity is measured globally, such as histogram, or locally within
rectangular blocks. Previously, luminance/color, edges, texture and SIFT have been used to
represent individual frames