0% found this document useful (0 votes)
148 views

CH: 7 Fundamentals of Video Coding

This document discusses key concepts in video coding and compression. It explains that inter frame redundancy can be exploited by expressing frames in terms of neighboring frames using inter frame prediction. Motion estimation works by estimating rigid body motion between frames using block matching to find motion vectors. Various motion estimation techniques are discussed like full search, fast search, and others to reduce computational complexity. The document also covers frame types (I, P, B frames), motion estimation directions (forward, backward), and other concepts like slices, macroblocks and picture partitioning.

Uploaded by

kd17209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views

CH: 7 Fundamentals of Video Coding

This document discusses key concepts in video coding and compression. It explains that inter frame redundancy can be exploited by expressing frames in terms of neighboring frames using inter frame prediction. Motion estimation works by estimating rigid body motion between frames using block matching to find motion vectors. Various motion estimation techniques are discussed like full search, fast search, and others to reduce computational complexity. The document also covers frame types (I, P, B frames), motion estimation directions (forward, backward), and other concepts like slices, macroblocks and picture partitioning.

Uploaded by

kd17209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CH: 7 FUNDAMENTALS OF VIDEO CODING

Inter frame redundancy:


An inter frame is a frame in a video compression stream which is expressed in terms of one or more
neighboring frames. The "inter" part of the term refers to the use of Inter frame prediction. This kind
of prediction tries to take advantage from temporal redundancy between neighboring frames enabling
higher compression rates.
Motion estimation:
The temporal encoding aspect of this system relies on the assumption that rigid body motion is
responsible for the differences between two or more successive frames. The objective of the motion
estimator is to estimate the rigid body motion between two frames. The motion estimator operates on
all current frame 16 x 16 image blocks and generates the pixel displacement or motion vector for each
block. The technique used to generate motion vectors is called block-matching motion estimation.The
method uses the current frame Ik and the previous reconstructed frame fk-l as input. Each block in the
previous frame is assumed
to have a displacement that can be found by searching for it in the current frame. The search is usually
constrained to be within a reasonable neighborhood so as to minimize the complexity of the operation.
Search matching is usually based on a minimum MSE or MAE criterion. When a match is found, the
pixel displacement is used to encode the particular block. If a search does not meet a minimum MSE
or MAE threshold criterion, the motion compensator will indicate that the current block is to be
spatially
encoded by using the intraframe mode.
Motion Estimation technique full search and fast search techniques:
Motion estimation (ME) is used extensively in video codecs based on MPEG-4 standards to remove
interframe redundancy. Motion estimation is based on the block matching method which evaluates
block mismatch by the sum of squared differences (SSD) measure. Winograd’s Fourier transform is
applied and the redundancy of the overlapped area computation among reference blocks is eliminated
in order to reduce the computational amount of the ME. When the block size is N × N and the number
of reference blocks in a search window is the same as the current block, this method reduces the
computational amount (additions and multiplications) by 58 % of the straightforward approach for N
= 8 and to 81 % for N = 16 without degrading motion tracking capability. The proposed fast full-
search ME method enables more accurate motion estimation in comparison to conventional fast ME
methods, thus it can be applied in video systems.
The popularity of video as a mean of data representation and transmission is increasing. Hence the
requirements for a quality and size of video are growing. High visual quality of video is provided by
coding. In 1960s the motion estimation (ME) and compensation were proposed to improve the
efficiency of video coding .
The current frame is divided into non-overlapping blocks. For each block of the current frame the
most similar block of the reference frame within the limited search area is found. The criterion of the
similarity of the two blocks is called a metric comparison of the two blocks. The position of the block,
for which an extremum of metric is founded, determines the coordinates of the motion vector of the
current block. The full search algorithm is the most accurate method of the block ME, i.e. the
proportion of true motion vectors found is the highest . The current block is compared to all candidate
blocks within the restricted search area in order to find the best match. This ME algorithm requires a
lot of computing resources. Therefore, a lot of alternative
fast motion estimation algorithms were developed. In 1981 T. Koga and other authors proposed a
three-step search algorithm (TTS).
The disadvantage of fast search methods is finding a local extremum of a function of the difference of
two blocks. Consequently motion estimation degrades by half degradation in some sequences
compared to brute-force and visual quality of video degrades as well.
The Criterion To Compare Blocks
The standards of video coding do not regulate the choice of criterion for matching two blocks
(metric).
One of the most popular metrics is the sum of square difference (SSD):
Nh-1 Nw-1 2
SSD(i,j) = ∑ ∑(B(x,y)-s(x+i,y+j))
Y=0 x=0

where i, j – the coordinates of the motion vector of the current block, i ϵ (–Vw/2; Vw/2), j ϵ (–Vh/2;
Vh/2), where Vw × Vh – size of the area which can be is the upper left corner of the title block on the
reference frame; x, y – coordinates of the current block B; Nw × Nh – block size B; S – reference area
of size Sw × Sh, where Sw = Nw + Vw, Sh = Nh + Vh; B and S – luminance images in color format
YUV. Inside the search area size Sw × Sh is the minimum value of SSD criterion for the current block
B, which determines the coordinates of the motion vector in order. SSD can be calculated through
fewer number of operations by decomposition into three components and those are:
Nh-1 Nw-1 2
∑ ∑ B (x,y)
y=0 x=0

Nh-1 Nw-1
-∑ ∑ B(x,y)S(x+i,y+j)
y=0 x=0

Nh-1 Nw-1 2
+∑ ∑ S (x+i,y+j)
y=0 x=0
We propose to replace this algorithm by other fast transforms: Winograd algorithm and the number-
theoretic transform of Farm (NTT).
Backward motion estimation:
The motion estimation that we have discussed in Section-20.3 and Section 20.4 is essentially
backward motion estimation, since the current frame is considered as the candidate frame and the
reference frame on which the motion vectors are searched is a past frame, that is, the search is
backward. Backward motion estimation leads to forward motion prediction.
Backward motion estimation, illustrated in below fig
Forward motion estimation
It is just the opposite of backward motion estimation. Here, the search for motion vectors is carried out
on a frame that appears later than the candidates frame in temporal ordering. In other words, the search
is “forward”. Forward motion estimation leads to backward motion prediction.
Forward motion estimation, illustrated in fig 20.3

It may appear that forward motion estimation is unusual, since one requires future frames to predict the
candidate frame. However, this is not unusual, since the candidate frame, for which the motion vector
is being sought is not necessarily the current, that is the most recent frame. It is possible to store more
than one frame and use one of the past frames as a candidate frame that uses another frame, appearing
later in the temporal order as a reference.
Forward motion estimation (or backward motion compensation) is supported under the MPEG 1 & 2
standards, in addition to the conventional backward motion estimation. The standard also supports bi-
directional motion compensation in which the candidate frame is predicted from a past reference as
well as a future reference frame with respect to the candidates frame.
Frame classification:
In video compression, a video frame is compressed by using different algorithms.These different
algorithms for video frames are called picture types or frame types and they are I, P and B.The
characteristics of frame types are:
I-frame:
I-frame are the least compressible but do not require other video frames to decode.
P-frame:
It can use data from previous frames to decompress and are more compressible than I-frame.
B-frame:
It can use both previous and forward frames for data reference to get the highest amount of data
compression.
An I frame(Intra coded picture) is a complete image like JPG image file.
A P-frame(Predicted picture) holds only the changes in the image from the previous frame. For
example , in a scene where a car moves across a stationary background, only the car’s movements
need to be encoded. The encoder does not need to store the unchanging background pixels in the P-
frame for saving space. P-frames are also known as delta frames.
A B-frame(Bidirectional predicted picture) saves even more space by using differences between
the current frame and both the preceding and following frames to specify its content.
Picture/Frame:
The term picture and frame are used interchangeably. The term picture is more general notion as a
picture can be either a frame or a field. A frame is complete image and a field is the set of odd
numbered or even numbered scan lines composing a partial image. For example, an HD 1080 picture
has 1080 lines of pixels. An odd field consist of pixel information for lines 1,3,5,......1079. An even
field has pixel information for lines 2,4,6,....1080.Whwn video is sent in interlaced scan format then
each frame is sent in two fields, the fields of odd numbered lines followed by the field of even
numbered lines. A frame used as a reference for preceding other frames is called a reference frame.

Frame encoded without information from other frames are called I-frames. Frame that use prediction
from a single preceding reference frame are called P-frames. The frames that use prediction from a
average of two reference frames, one preceding and one succeeding are called B-frames.
Slices:
A slice is a spatially distinct region of a frame that is encoded separately from any other region in the
same frame . I-slices, P-slices, and B-slices take the place of I, P and B frames.
Macroblocks:
It is a processing unit in image and video compression formats based on linear block transforms,
typically the DCT. It consist of 16x16 samples and is further subdivide into transform blocks and may
be further subdivided into prediction blocks.
Partitioning of picture:
Slices:
•A picture is split into 1 or several slices
•Slices are self-contained
•Slices are a sequence of macroblocks
Macroblocks:•
Basic syntax & processing unit
•Contains 16x16 luma samples and 2 x 8x8 chroma samples
•Macroblocks within a slice depend on each other
•Macroblocks can be further partitioned
Elements of video encoding and decoding:

Video coding basic system


Encoder block diagram of typical block based hybrid coder

Discrete Cosine Transform- DCT transformation decomposes each input block into a series of
waveforms with a specific spatial frequency. Outputs an 8x8 block of horizontal and vertical
frequency coefficients.
Quantization- Quantization block uses the psychovisual characteristics to eliminate the
unimportant DCT coefficients, high frequency coefficients.
Inverse Quantization- IQ computes the inverse quantization matrix by multiplying the quantized
DCT with the quantization table.
Inverse Discrete Cosine Transform- IDCT computes the original input block. Errors are expected
due to quantization.
Motion Estimation- ME uses a scheme with fewer search locations and fewer pixels to generate
motion vectors indicating the directions of the moving images.
Motion Compensation- MC block increases the compression ratio by removing the redundancies
between frames.
Variable Length Coding Lossless- VLC coding reduces the bit rate by sending shorter codes for
common pairs (number of zeros and number of non-zeros) and longer codes for less common pairs.

Decoder block diagram

Example of the wireless video codec (encoder and decoder) that includes pre-processing of the
captured data to interface with the encoder and post-processing of the data to interface with the LCD
panel. The video codec is compliant with the low bit rate codec for multimedia telephony defined by
the Third Generation Partnership Project (3GPP) .
The baseline CODEC defined by 3GPP is H.263 and MPEG-4 Simple Visual Profile is defined as an
optional. The video codec implemented supports the following video formats.
1. SQCIF or 128 x 96 resolution
2 .QCIF or 176 x 144 resolution at Simple Profile Level 1
3 .CIF or 352 x 288 resolution at Simple Profile Level 2
4. 64 kbits/s for Simple Profile Level 1
5 .128 kbits/s for Simple Profile Level 2
Video CODEC Description
The video encoder implemented requires a YUV 4:2:0 non-interface video input and, therefore, pre-
processing of the video input may be required depending on the application. For the video decoder,
post-processing is needed to convert the decoded YUV 4:2:0 data to RGB for displaying.
Features
1.Pre-processing:
− YUV 4:2:2 interlaced (from camera for example) to YUV 4:2:0 non-interlaced, only
decimation and no filtering of the UV components.
2. Post-processing:
− YUV 4:2:0 to RGB conversion
− Display formats of 16 bits or 12 bits RGB
− 0 to 90 degrees rotation for landscape and portrait displays
3. MPEG-4 Simple Profile Level 0, Level 1 and Level 2 support
4. H.263 and MPEG-4 decoder and encoder compliant
5. MPEG-4 video decoder options are:
− AC/DC prediction
− Reversible Variable Length Coding (RVLC)
− Resynchronization Marker (RM)
− Data Partitioning (DP)
− Error concealment, proprietary techniques
− 4 Motion Vectors per Macroblock (4MV)
− Unrestricted Motion Compensation
− Decode VOS layers
6. MPEG-4 video encoder options are:
− Reversible Variable Length Coding (RVLC)
− Resynchronization Marker (RM)
− Data Partitioning (DP)
− 4 Motion Vectors per Macroblock (4MV)
− Header Extension Codes
− Bit rate target change during encoding
− Coding frame rate change during encoding
− Insertion or not of Visual Object Sequence start code
7. Insertion of I-frame during the encoding of a sequence support
8. Encoder Adaptive Intra Refresh (AIR) support
9. Multi-codec support, multiple codecs running from the same code
Video Architecture
Pixel Representation
Red, Green and Blue or RGB are the primary colors for the computer display and the color
depth supported by the OMAP5910 is programmable up to 16 bits per pixel, RGB565 (5 bits for Red,
6 bits for Green and 5 bits for Blue). In the consumer video such as DVD, camera, digital TV and
others, the common color coding scheme is YCbCr where Y is the luminance, Cb is the blue
chrominance and Cr is the red chrominance. Human eyes are much more sensitive to the Y
component of the video and this enables video sub-sampling to reduce the chrominance component
without being detected by the human eyes. This method is referred to as YCbCr 4:2:0, YCbCr 4:2:2 or
YCbCr 4:4:4.
Video coding standards MPEG and H.26X
The Moving Picture Experts Group (MPEG) was established in 1988 in the framework of the Joint
ISO/IEC Technical Committee (JTC 1) on Information Technology with the mandate to
develop standards for coded representation of moving pictures, associated audio and their
combination when used for storage and retrieval on Digital Storage Media with a bitrate at up
to about 1.5 Mbit/s. The standard was nicknamed MPEG-1 and was issued in 1992. The scope of the
group was later extended to provide appropriate MPEG-2 video and associated audio compression
algorithms for a wide range of audio-visual applications at substantially higher bitrates not
successfully covered or envisaged by the MPEG-1 standard. Specifically, MPEG-2 was given the
charter to provide video quality not lower than NTSC/PAL and up to CCIR601 quality with bitrates
targeted between 2 and 10 Mbit/s. Emerging applications, such as digital cable TV distribution,
networked database services via ATM, digital VTR applications, and satellite and terrestrial digital
broadcasting distribution, were seen to benefit from the increased quality expected to result from the
emerging MPEG-2 standard. The MPEG-2 standard was released in 1994. The Table I below
summarizes the primary applications and quality requirements targeted by the MPEG-1 and MPEG-2
video standards together with examples of typical video input parameters and compression ratios
achieved.
The MPEG-1 and MPEG-2 video compression techniques developed and standardized by the
MPEG group have developed into important and successful video coding standards worldwide, with
an increasing number of MPEG-1 and MPEG-2 VLSI chip-sets and products
becoming available on the market. One key factor for the success is the generic structure of
the MPEG standards, supporting a wide range of applications and applications specific parameters
[schaf, siko1]. To support the wide range of applications profiles a diversity of
input parameters including flexible picture size and frame rate can be specified by the user.
Another important factor is the fact that the MPEG group did only standardize the decoder
structures and the bitstream formats. This allows a large degree of freedom for manufactures
to optimize the coding efficiency (or in other words the video quality at a given bit rate) by
developing innovative encoder algorithms even after the standards were finalized.
.
MPEG-1 Standard (1991) (ISO/IEC 11172)
􀁺Target bit-rate about 1.5 Mbps
􀁺Typical image format CIF, no interlace
􀁺Frame rate 24 ... 30 fps
􀁺Main application: video storage for multimedia (e.g., on CD-ROM)
MPEG-2 Standard (1994) (ISO/IEC 13818)
􀁺Extension for interlace, optimized for TV resolution
(NTSC: 704 x 480 Pixel)
􀁺Image quality similar to NTSC, PAL, SECAM at 4 -8 Mbps
􀁺HDTV at 20 Mbps
MPEG-4 Standard (1999) (ISO/IEC 14496)
􀁺Object based coding
􀁺Wide-range of applications, with choices of interactivity, scalability, error resilience, etc.

MPEG-1: coding of I-pictures

􀁺I-pictures: intraframe coded


􀁺8x8 DCT
􀁺Arbitrary weighting matrix for coefficients
􀁺Differential coding of DC-coefficients
􀁺Uniform quantization
􀁺Zig-zag-scan, run-level-coding
􀁺Entropy coding
􀁺Unfortunately, not quite JPEG
MPEG-1: coding of P-pictures

􀁺Motion-compensated prediction from an encoded I-picture or P-picture (DPCM)


􀁺Half-pel accuracy of motion compensation, bilinear interpolation
􀁺One displacement vector per macroblock
􀁺Differential coding of displacement vectors
􀁺Coding of prediction error with 8x8-DCT, uniform threshold quantization, zig-zag-scan as in I-
pictures

MPEG-1: coding of B-pictures

􀁺Motion-compensated prediction from two consecutive P-or I-pictures


􀁺either
•only forward prediction (1 vector/macroblock)
􀁺or
•only backward prediction (1 vector/macroblock)
􀁺or
•Average of forward and backward prediction = interpolation(2 vectors/macroblock)
􀁺Half-pel accuracy of motion compensation, bilinear interpolation
􀁺Coding of prediction error with 8x8-DCT, uniform quantization, zig-zag-scan as in I-pictures

MPEG-4

􀁺Support highly interactive multimedia applications as well as traditional applications


􀁺Advanced functionalities: interactivity, scalability, error resilience…
􀁺Coding of natural and synthetic audio and video, as well as graphics
􀁺Enable the multiplexing of audiovisual objects and composition in a scene
MPEG-4: Scene with audiovisual objects
MPEG Family
◼ MPEG-1
Similar to H.263 CIF in quality
◼ MPEG-2
Higher quality: DVD, Digital TV, HDTV
◼ MPEG-4/H.264
More modern codec.
Aimed at lower bitrates.
Works well for HDTV too.
MPEG-1 Compression
◼ MPEG: Motion Pictures Expert Group
◼ Finalized in 1991
◼ Optimized for video resolutions:
352x240 pixels at 30 fps (NTSC)
352x288 pixels at 25 fps (PAL/SECAM)
◼ Optimized for bit rates around 1-1.5Mb/s.
◼ Syntax allows up to 4095x4095 at 60fps, but not commonly
used.
◼ Progressive scan only (not interlaced)
MPEG Frame Types
◼ Unlike H.261, each frame must be of one type.
H.261 can mix intra and inter-coded MBs in one frame.
◼ Three types in MPEG:
I-frames (like H.261 intra-coded frames)
P-frames (“predictive”, like H.261 inter-coded frames)
B-frames (“bidirectional predictive”)
MPEG I-frames
◼ Similar to JPEG, except:
 Luminance and chrominance share quantization tables.
 Quantization is adaptive (table can change) for each macroblock.
◼ Unlike H.261, every n frames, a full intra-coded frame is included.
 Permits skipping. Start decoding at first I-frame following the
point you skip to.
 Permits fast scan. Just play I-frames.
 Permits playing backwards (decode previous I-frame, decode
frames that depend on it, play decoded frames in reverse order)
◼ An I frame and the successive frames to the next I frame (n frames)
is known as a Group of Pictures.
MPEG P-Frames
◼ Similar to an entire frame of H.261 inter-coded blocks.
Half-pixel accuracy in motion vectors (pixels are
averaged if needed).
◼ May code from previous I frame or previous P frame.
B-frames
◼ Bidirectional Predictive Frames.
◼ Each macroblock contains two sets of motion vectors.
◼ Coded from one previous frame, one future frame, or a combination
of both.
1. Do motion vector search separately in past reference frame and
future reference frame.
2. Compare:
◼ Difference from past frame.
◼ Difference from future frame.
◼ Difference from average of past and future frame.
3. Encode the version with the least difference.
B-frame disadvantages
◼ Computational complexity.
 More motion search, need to decide whether or not to average.
◼ Increase in memory bandwidth.
 Extra picture buffer needed.
 Need to store frames and encode or playback out of order.
◼ Delay
 Adds several frames delay at encoder waiting for need later
frame.
 Adds several frames delay at decoder holding decoded I/P frame,
while decoding and playing prior B-frames that depend on it.
B-frame advantage
◼ B-frames increase compression.
◼ Typically use twice as many B frames as I+P frames.
MPEG-2
◼ ISO/IEC standard in 1995
◼ Aimed at higher quality video.
◼ Supports interlaced formats.
◼ Many features, but has profiles which constrain common
subsets of those features:
Main profile (MP): 2-15Mb/s over broadcast channels
(eg DVB-T) or storage media (eg DVD)
PAL quality: 4-6Mb/s, NTSC quality: 3-5Mb/s.
MPEG-3
◼ Doesn’t exist.
Was aimed at HDTV.
Ended up being folded into MPEG-2.
MPEG-4
◼ ISO/IEC designation 'ISO/IEC 14496’: 1999
◼ MPEG-4 Version 2: 2000
◼ Aimed at low bitrate (10Kb/s)
◼ Can scale very high (1Gb/s)
◼ Based around the concept of the composition of basic
video objects into a scene.

H.26X
H.261 Video Compression Standard

•First major video compression standard


•Targeted for 2-way video conferencing and for ISDN networks that supported 40Kbps to 2Mbps.
•Supported resolutions include CIF and QCIF.
•Chrominance resolution subsampling4:2:0
•Low complexity and low delay to support real-time communications
•Only I and P frames. No B frames.
•Full-pixel accuracy motion estimation
•8x8 block-based DCT coding of residual
•Fixed linear quantization across all AC coefficients of DCT
•Run-length coding of quantized DCT coefficients followed by Huffman coding for DCT and motion
information
•Loop filtering (simple digital filter applied on the block edges) applied to reference frames to reduce
blocking artifacts.
ISO/IEC MPEG-2 / ITU-T H.262
•Profiles defined for scalable video applications with scalable coding tools to allow multiple layer
video coding, including temporal, spatial, and SNR scalability, and data partitioning.
•MPEG-2 Main Profile supports single layer coding (non scalable) and is the one that is widely
deployed.
•MPEG-2 non-scalable (single layer) profiles‒Simple profile: no B frames for low-delay applications
‒Main profile: support for B frames; can also decode MPEG-1 video
•MPEG-2 scalable profiles‒SNR profile: adds enhancement layers for DCT coefficient refinement
‒Spatial profile: adds support for enhancement layers carrying the coded image at different spatial
resolutions (sizes)
‒High profile: adds support for coding a 4:2:2 video signal and includes scalability tools of SNR and
spatial profiles
ITU-T H.263: Main Features
Enhancement of H.261
Baseline algorithm
•Half-pixel accuracy motion estimation and compensation.
•MV differentially coded, median MV prediction.
•8 by 8 discrete cosine transform and uniform quantization.
•variable length coding of DCT coefficients and MVs.
Four optional modes
•Unrestricted Motion Vector (UMV) mode ‒increased motion vector range with frame boundary
extrapolation.
•Advanced Prediction (AP) mode ‒4 MVs per macroblock.
‒Overlapped Block Motion Compensation (OBMC).
•PB frame mode: bi-directional prediction.
•Arithmetic coding mode.
•About 3 to 4 dB PSNR improvement over H.261 at bit-rates less or equal to 64Kbits/s.
•30% saving in bit-rate as compared to MPEG-1.
Design flexibility (things not specified by standard)
•H.263 standard inherently has the capability to adapt to varying input video content.
•Frame level: Intra or Inter or skipped.
•Macroblock(MB) level: ‒Intra, Inter or Un-coded.
‒One MV or 4 MVs
‒Quantizerparameter, QP value. Constant QP almost constant quality, variable bit-rate.
Varying QP variable quality, try to achieve almost constant bit-rate.
H.261
Compression standard defined by the ITU-T
for provision of video telephony and
1.videoconferencing services over ISDN
2. 64kbps
CIF (videoconferencing) or quarter CIF (QCIF) (video telephony) used
3.each frame divided into macroblocks of 16 x 16 pixels
Only I- and P-frames used
Three P-frames between each pair of Iframes
Start of each new encoded video frame is indicated by the picture start code
H.263
Defined by ITU-T for use in video applications over wireless and PSTN
􀃌e.g. video telephony, video conferencing,security surveillance, interactive games playing
􀃌real time applications over a modem
• therefore, 28.8 kbps - 56 kbps
Based on H.261, but H.261 gives poor picture quality below 64 kbps
􀃌therefore H.263 is more advanced
QCIF and sub-QCIF used
Horizontal resolution reduced
Uses I-, P- and B-frames
Also, neighbouring pairs of P- and Bframes can be encoded as a single entity
􀃌PB-frame
• reduced encoding overheads
• increases frame rate
Other mechanisms used:
􀃌unrestricted motion vectors
􀃌error resilience
􀃌error tracking
􀃌independent segment decoding
􀃌reference picture selection

CH 8 Video segmentation

Temporal Segmentation:
Segmentation is highly dependent on the model and criteria for grouping pixels into regions. In
motion segmentation, pixels are grouped together based on their similarity in motion. For any given
application, the segmentation algorithm needs to find a balance between model complexity and
analysis stability. An insufficient model will inevitably result in over segmentation. Complicated
models will introduce more complexity and require more computation and constraints for stability. In
image coding, the objective of segmentation is to exploit the spatial and temporal coherences in the
video data by adequately identifying the coherent motion regions with simple motion models.
Block-based video coders avoid the segmentation problem altogether by artificially imposing a
regular array of blocks and applying motion coherence within these blocks. This model requires very
small overhead in coding, but it does not accurately describe an image and does not fully exploit the
coherences in the video data. Region-based approaches which exploit the coherence of object motion
by grouping similar motion regions into a single description, have shown improved performances
over block-based coders.
In the layered representation coding,14,15 video data is decomposed into a set of overlapping layers.
Each layer consists of: an intensity map describing the intensity profile of a coherent motion region
over many frames; an alpha map describing its relationship with other layers; and a parametric motion
map describing the motion of the region. The layered representation has potentials for achieving
greater compression because each layer exploits both the spatial and temporal coherences of video
data. In addition, the representation is similar to those used in computer graphics and so it provides a
convenient way to manipulate video data. Our goal in spatiotemporal segmentation is to identify the
spatial and temporal coherences in video data and derive the layered representation for the image
sequence.
Temporal coherence
Motion estimation provides the necessary information for locating corresponding regions in different
frames. The new positions for each region can be predicted given the previously estimated motion for
that region. Motion models are estimated within each of these predicted regions and an updated set of
motion hypotheses derived for the image. Alternatively, the motion models estimated from the
previous segmentation can be used by the region classifier to directly determine the corresponding
coherent motion regions. Thus, segmentation based on motion conveniently provides a way to track
coherent motion regions.
In addition, when the analysis is initialized with the segmentation results from previous frame,
computation is reduced and robustness of estimation is increased.

Temporal segmentation adds structure to the video by partitioning the video into chapters.
This is a first step for video summarization methods, which should also enable fast browsing
and indexing so that a user can quickly discover important activities or objects.
Shot boundary detection, hard cut and soft cuts:
The concept of temporal image sequence (video) segmentation is not a new one, as it dates back to the
first days of motion pictures, well before the introduction of computers. Motion picture specialists
perceptually segment their works into a hierarchy of partitions. A video (or film) is completely and
disjointly segmented into a sequence of scenes, which are subsequently segmented into a sequence of
shots. Scenes (also called story units) are a concept that is much older than motion pictures, ultimately
originating in the theater. Traditionally, a scene is a continuous sequence that is temporally and
spatially cohesive in the real world, but not necessarily cohesive in the projection of the real world on
film. On the other hand, shots originate with the invention of motion cameras and are defined as the
longest continuous sequence that originates from a single camera take, which is what the camera
images in an uninterrupted run. In general, the automatic segmentation of a video into scenes ranges
from very difficult to intractable. On the other hand, video segmentation into shots is both exactly
defined and also characterized by distinctive features of the video stream itself. This is because video
content within a shot tends to be continuous, due to the continuity of both the physical scene and the
parameters (motion, zoom, focus) of the camera that images it.
Therefore, in principle, the detection of a shot change between two adjacent frames simply requires to
compute an appropriate continuity or similarity metric. However, this simple concept has three major
complications. The first, and most obvious one, is defining a continuity metric for the video in such a
way that it is insensitive to gradual changes in camera parameters, lighting, and physical scene
content, easy to compute and discriminant enough to be useful. The simplest way to do that is to
extract one or more scalar or vector features from each frame and to define distance functions on the
feature domain. Alternatively the features themselves can be used either for clustering the frames into
shots, or for detecting shot transition patterns. The second complication is deciding which values of
the continuity metric correspond to a shot change and which do not. This is not trivial, since the
feature variation within certain shots can exceed the respective variation across shots. Decision
methods for shot boundary detection include fixed thresholds, adaptive thresholds and statistical
detection methods. The third complication, and the most difficult to handle, is the fact that not all shot
changes are abrupt. Using motion picture terminology, changes between shots can belong to the
following categories:-

1. Cut. This is the classic abrupt change case, where one frame belongs to the disappearing shot and
the next one to the appearing shot.
2. Dissolve. In this case, the last few frames of the disappearing shot temporally overlap with the
first few frames of the appearing shot. During the overlap, the intensity of the disappearing shot
decreases from normal to zero (fade out), while that of the appearing shot increases from zero to
normal (fade in).
3. Fade. Here, first the disappearing shot fades out into a blank frame, and then the blank frame fades
in into the appearing shot.
4. Wipe. This is actually a set of shot change techniques, where the appearing and disappearing shots
coexist in different spatial regions of the intermediate video frames, and the region occupied by the
former grows until it entirely replaces the latter.
5. Other transition types. There is a multitude of inventive special effects techniques used in motion
pictures. These are in general very rare and difficult to detect.

Shot-boundary detection is the first step towards scene extraction in videos, which is useful for video
content analysis and indexing. A shot in a video is a sequence of frames taken continuously by one
camera. A common approach to detect shot-boundary consists of computing similarity between
pairs of consecutive frames and marking the occurrence of boundary where the similarity is lower
than some threshold. The similarity is measured globally, such as histogram, or locally within
rectangular blocks. Previously, luminance/color, edges, texture and SIFT have been used to
represent individual frames

You might also like