0% found this document useful (0 votes)
128 views117 pages

Chapter E 5

Uploaded by

triveni.nainar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views117 pages

Chapter E 5

Uploaded by

triveni.nainar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 117

• Fundamentals of Video Coding: Inter-frame redundancy,

motion estimation techniques – full search, fast search


strategies, forward and backward motion prediction, frame
classification – I, P and B;
• Video sequence hierarchy – Group of pictures, frames, slices,
macro-blocks and blocks; Elements of a video encoder and
decoder; Video coding standards – MPEG and H.26X.
• Video Segmentation: Temporal segmentation–shot boundary
detection, hard-cuts and soft-cuts; spatial segmentation –
motion-based; Video object detection and tracking.
PART-I
FUNDAMENTALS OF VIDEO CODING
What is the difference between interframe and intraframe?
• Intra-frame means that all the
compression is done within
that single frame and
generates what is sometimes
referred to as an i-frame.
• Inter-frame refers to
compression that takes place
across two or more frames,
where the encoding scheme
only keeps the information that
changes between frames.
Inter frame redundancy:
• An inter frame is a frame in a video compression stream which is
expressed in terms of one or more neighboring frames.
• The "inter" part of the term refers to the use of Inter frame prediction.
• This kind of prediction tries to take advantage from temporal
redundancy between neighboring frames enabling higher compression
rates.
• The term intra-frame coding refers to the fact that the various lossless
and lossy compression techniques are performed relative to information
that is contained only within the current frame, and not relative to any
other frame in the video sequence.
Motion Estimation Techniques
Full Search, Fast Search Strategies
Motion estimation:
• The temporal encoding aspect of this system relies on the assumption that rigid body motion is
responsible for the differences between two or more successive frames. The objective of the
motion estimator is to estimate the rigid body motion between two frames.
• The motion estimator operates on all current frame 16 x 16 image blocks and generates the
pixel displacement or motion vector for each block. The technique used to generate motion
vectors is called block-matching motion estimation.
• The method uses the current frame Ik and the previous reconstructed frame fk-l as input. Each
block in the previous frame is assumed to have a displacement that can be found by searching
for it in the current frame.
• The search is usually constrained to be within a reasonable neighborhood so as to minimize
the complexity of the operation.
• Search matching is usually based on a minimum MSE or MAE criterion. When a match is
found, the pixel displacement is used to encode the particular block. If a search does not meet
a minimum MSE or MAE threshold criterion, the motion compensator will indicate that the
current block is to be spatially encoded by using the intraframe mode
Motion Estimation technique full search and fast search techniques:

• Motion estimation (ME) is used extensively in video codecs based on MPEG-4 standards to
remove interframe redundancy.
• Motion estimation is based on the block matching method which evaluates block mismatch
by the sum of squared differences (SSD) measure. Winograd’s Fourier transform is applied
& the redundancy of the overlapped area computation among reference blocks is
eliminated in order to reduce the computational amount of ME.
• When the block size is N × N, the number of reference blocks in a search window is the
same as the current block,this method reduces the computational amount (additions and
multiplications) by 58 % of the straightforward approach for N= 8 and to 81 % for N = 16
without degrading motion tracking capability.
• The proposed fast full search ME method enables more accurate motion estimation in
comparison to conventional fast ME methods, thus it can be applied in video systems.
• The popularity of video as a mean of data representation and transmission is
increasing. Hence the requirements for a quality and size of video are growing.
• High visual quality of video is provided by coding.
• In 1960s the motion estimation (ME) and compensation were proposed to improve
the efficiency of video coding .
• The current frame is divided into non-overlapping blocks. For each block of the
current frame the most similar block of the reference frame within the limited
search area is found. The criterion of the similarity of the two blocks is called a
metric comparison of the two blocks. The position of the block, for which an
extremum of metric is founded, determines the coordinates of the motion vector of
the current block.
• The full search algorithm is the most accurate method of the block ME, i.e. the
proportion of true motion vectors found is the highest . The current block is
compared to all candidate blocks within the restricted search area in order to find
the best match. This ME algorithm requires a lot of computing resources.
• Therefore, a lot of alternative fast motion estimation
algorithms were developed. In 1981 T. Koga and other
authors proposed a three-step search algorithm (TTS).
• The disadvantage of fast search methods is finding a local
extremum of a function of the difference of two blocks.
• Consequently motion estimation degrades by half
degradation in some sequences compared to brute-force
and visual quality of video degrades as well.
Spatial redundancy & Temporal redundancy
• Spatial redundancy is redundancy within a single picture or object,
for example repeated pixel values in a large area of blue sky.
• Temporal redundancy exists between successive pictures or
objects.
• In MPEG, where temporal compression is used, the current
picture/ object is not sent in its entirety; instead the difference
between the current picture/object and the previous one is sent.
• The decoder already has the previous picture/object, and so it can
add the difference, or residual image, to make the current
picture/object. A residual image is created by subtracting every
pixel in one picture/object from the corresponding pixel in another.
This is trivially easy when pictures are restricted to progressive
scan.
Motion Estimation and Compensation
The Criterion To Compare Blocks
Forward and Backward Motion Prediction
Backward motion estimation:
• In backward motion estimation,
since the current frame is
considered as the candidate
frame and the reference frame
on which the motion vectors are
searched is a past frame, that is,
the search is backward.
• Backward motion estimation
leads to forward motion
prediction. Backward motion
estimation, illustrated in fig
Forward motion estimation
• It is just the opposite of backward
motion estimation. Here, the
search for motion vectors is
carried out on a frame that
appears later than the candidates
frame in temporal ordering.
• In other words, the search is
“forward”. Forward motion
estimation leads to backward
motion prediction. Forward
motion estimation, illustrated in
fig
• It may appear that forward motion estimation is unusual, since one requires future
frames to predict the candidate frame. However, this is not unusual, since the
candidate frame, for which the motion vector is being sought is not necessarily the
current, that is the most recent frame.
• It is possible to store more than one frame and use one of the past frames as a
candidate frame that uses another frame, appearing later in the temporal order as
a reference.
• Forward motion estimation (or backward motion compensation) is supported under
the MPEG 1 & 2 standards, in addition to the conventional backward motion
estimation.
• The standard also supports bidirectional motion compensation in which the
candidate frame is predicted from a past reference as well as a future reference
frame with respect to the candidates frame.
Basic approaches to motion estimation
• There exists two basic approaches to motion estimation –
• a) Pixel based motion estimation
• b) Block-based motion estimation.
• The pixel based motion estimation approach seeks to determine motion vectors for
every pixel in the image. This is also referred to as the optical flow method, which
works on the fundamental assumption of brightness constancy, that is the intensity of
a pixel remains constant, when it is displaced.
• However, no unique match for a pixel in the reference frame is found in the direction
normal to the intensity gradient. It is for this reason that an additional constraint is
also introduced in terms of the smoothness of velocity (or displacement) vectors in
the neighborhood. The smoothness constraint makes the algorithm interactive and
requires excessively large computation time, making it unsuitable for practical and
real time implementation.
• An alternative and faster approach is the block based motion estimation. In
this method, the candidates frame is divided into non-overlapping blocks ( of
size 16 x 16, or 8 x8 or even 4 x 4 pixels in the recent standards) and for each
such candidate block, the best motion vector is determined in the reference
frame.
• Here, a single motion vector is computed for the entire block, whereby we
make an inherent assumption that the entire block undergoes translational
motion. This assumption is reasonably valid, except for the object boundaries
and smaller block size leads to better motion estimation and compression.
• Block based motion estimation is accepted in all the video coding standards
proposed till date. It is easy to implement in hardware and real time motion
estimation and prediction is possible.
Frame classification:
• In video compression, a video frame is compressed by using different algorithms. These different
algorithms for video frames are called picture types or frame types and they are I, P and B.The
characteristics of frame types are:
I-frame:
• I-frame are the least compressible but do not require other video frames to decode.
P-frame:
• It can use data from previous frames to decompress and are more compressible than I-frame.
B-frame:
• It can use both previous and forward frames for data reference to get the highest amount of data
compression.
• An I frame(Intra coded picture) is a complete image like JPG image file.
• A P-frame(Predicted picture) holds only the changes in the image from the previous frame. For
example , in a scene where a car moves across a stationary background, only the car’s movements
need to be encoded. The encoder does not need to store the unchanging background pixels in the P-
frame for saving space. P-frames are also known as delta frames.
Recovery from Bitstream Errors
PART-II
Video Sequence Hierarchy
Group of pictures/frames,
slices,
macro-blocks and blocks;
Elements of a video encoder and decoder;
Video coding standards – MPEG and H.26X.
Picture/Frame:
• The term picture and frame are used interchangeably. The term picture is more
general notion as a picture can be either a frame or a field. A frame is complete
image and a field is the set of odd numbered or even numbered scan lines
composing a partial image. For example, an HD 1080 picture has 1080 lines of
pixels.
• An odd field consist of pixel information for lines 1,3,5,......1079. An even field has
pixel information for lines 2,4,6,....1080.When video is sent in interlaced scan
format then each frame is sent in two fields, the fields of odd numbered lines
followed by the field of even numbered lines.
• A frame used as a reference for preceding other frames is called a reference
frame.
• Frame encoded without information from other frames are called I-frames. Frame
that use prediction from a single preceding reference frame are called P-frames.
The frames that use prediction from a average of two reference frames, one
preceding and one succeeding are called B-frames.
Slices:
• A slice is a spatially distinct region of a frame that is encoded separately from any
other region in the same frame . I-slices, P-slices, and B-slices take the place of I,
P and B frames.
Macroblocks:
• It is a processing unit in image and video compression formats based on linear
block transforms, typically the DCT. It consist of 16x16 samples and is further
subdivide into transform blocks and may be further subdivided into prediction
blocks.
Partitioning of picture:
Slices:
• A picture is split into 1 or several slices
• Slices are self-contained
• Slices are a sequence of macroblocks
Macroblocks:
• Basic syntax & processing unit
• Contains 16x16 luma samples and 2 x 8x8 chroma samples
• Macroblocks within a slice depend on each other
• Macroblocks can be further partitioned

Elements of video encoding and decoding:
It can code one
microblock at a time
Y is the brightness (luma), Cb is blue minus luma (B-Y) and Cr is red minus luma (R-Y).
• Discrete Cosine Transform- DCT transformation decomposes each input block into a series of
waveforms with a specific spatial frequency. Outputs an 8x8 block of horizontal and vertical
frequency coefficients.
• Quantization- Quantization block uses the psychovisual characteristics to eliminate the
unimportant DCT coefficients, high frequency coefficients.
• Inverse Quantization- IQ computes the inverse quantization matrix by multiplying the quantized
with the quantization table.
• Inverse Discrete Cosine Transform- IDCT computes the original input block. Errors are expected
due to quantization.
• Motion Estimation- ME uses a scheme with fewer search locations and fewer pixels to generate
motion vectors indicating the directions of the moving images.
• Motion Compensation- MC block increases the compression ratio by removing the redundancies
between frames.
• Variable Length Coding Lossless- VLC coding reduces the bit rate by sending shorter codes for
common pairs (number of zeros and number of non-zeros) and longer codes for less common
pairs.
Video coding standards – MPEG and H.26X
CIF (Common Intermediate Format or Common Interchange Format)
Summary
PART-C

Video Segmentation
Temporal segmentation
•shot boundary detection,
•hard-cuts and soft-cuts;
Spatial segmentation
•motion-based;
•Video object detection and tracking.
Temporal segmentation
Temporal Segmentation:
• Segmentation is highly dependent on the model and criteria for grouping
pixels into regions.
• In motion segmentation, pixels are grouped together based on their
similarity in motion. For any given application, the segmentation algorithm
needs to find a balance between model complexity and analysis stability.
• An insufficient model will inevitably result in over segmentation.
Complicated models will introduce more complexity and require more
computation and constraints for stability.
• In image coding, the objective of segmentation is to exploit the spatial and
temporal coherences in the video data by adequately identifying the
coherent motion regions with simple motion models.
• Block-based video coders avoid the segmentation problem
altogether by artificially imposing a regular array of blocks
and applying motion coherence within these blocks.
• This model requires very small overhead in coding, but it
does not accurately describe an image and does not fully
exploit the coherences in the video data.
• Region-based approaches which exploit the coherence of
object motion by grouping similar motion regions into a
single description, have shown improved performances over
block-based coders
• In the layered representation coding, video data is decomposed into a
set of overlapping layers.
• Each layer consists of: an intensity map describing the intensity profile
of a coherent motion region over many frames; an alpha map
describing its relationship with other layers; and a parametric motion
map describing the motion of the region.
• The layered representation has potentials for achieving greater
compression because each layer exploits both the spatial and
temporal coherences of video data.
• In addition, the representation is similar to those used in computer
graphics and so it provides a convenient way to manipulate video data.
Temporal coherence

• Motion estimation provides the necessary information for locating corresponding


regions in different frames. The new positions for each region can be predicted given
the previously estimated motion for that region.
• Motion models are estimated within each of these predicted regions and an updated
set of motion hypotheses derived for the image. Alternatively, the motion models
estimated from the previous segmentation can be used by the region classifier to
directly determine the corresponding coherent motion regions.
• Thus, segmentation based on motion conveniently provides a way to track coherent
motion regions. In addition, when the analysis is initialized with the segmentation
results from previous frame, computation is reduced and robustness of estimation is
increased.
• Temporal segmentation adds structure to the video by partitioning the video into chapters. This is a
first step for video summarization methods, which should also enable fast browsing & indexing so
that a user can quickly discover important activities or objects
Shot boundary detection, hard cut and soft cuts:

• The concept of temporal image sequence (video) segmentation is not a new one, as it dates back to
the first days of motion pictures, well before the introduction of computers.
• Motion picture specialists perceptually segment their works into a hierarchy of partitions. A video (or
film) is completely and disjointly segmented into a sequence of scenes, which are subsequently
segmented into a sequence of shots.
• Scenes (also called story units) are a concept that is much older than motion pictures, ultimately
originating in the theater. Traditionally, a scene is a continuous sequence that is temporally and spatially
cohesive in the real world, but not necessarily cohesive in the projection of the real world on film.
• On the other hand, shots originate with the invention of motion cameras and are defined as the longest
continuous sequence that originates from a single camera take, which is what the camera images in an
uninterrupted run. In general, the automatic segmentation of a video into scenes ranges from very
difficult to intractable. On the other hand, video segmentation into shots is both exactly defined and also
characterized by distinctive features of the video stream itself.
• This is because video content within a shot tends to be continuous, due to the continuity of both the
physical scene and the parameters (motion, zoom, focus) of the camera that images it.
• Using motion picture terminology, changes between shots can belong to the following
categories:-
• 1. Cut. This is the classic abrupt change case, where one frame belongs to the
disappearing shot and the next one to the appearing shot.
• 2. Dissolve. In this case, the last few frames of the disappearing shot temporally overlap
with the first few frames of the appearing shot. During the overlap, the intensity of the
disappearing shot decreases from normal to zero (fade out), while that of the appearing
shot increases from zero to normal (fade in).
• 3. Fade. Here, first the disappearing shot fades out into a blank frame, and then the blank
frame fades in into the appearing shot.
• 4. Wipe. This is actually a set of shot change techniques, where the appearing and
disappearing shots coexist in different spatial regions of the intermediate video frames,
and the region occupied by the former grows until it entirely replaces the latter.
• 5. Other transition types. There is a multitude of inventive special effects techniques used
in motion pictures. These are in general very rare and difficult to detect.
• Therefore, in principle, the detection of a shot change between two adjacent frames
simply requires to compute an appropriate continuity or similarity metric. However,
this simple concept has three major complications.
• The first, and most obvious one, is defining a continuity metric for the video in such
a way that it is insensitive to gradual changes in camera parameters, lighting, and
physical scene content, easy to compute and discriminant enough to be useful.
• The simplest way to do that is to extract one or more scalar or vector features from
each frame and to define distance functions on the feature domain. Alternatively the
features themselves can be used either for clustering the frames into shots, or for
detecting shot transition patterns.
• The second complication is deciding which values of the continuity metric
correspond to a shot change and which do not. This is not trivial, since the feature
variation within certain shots can exceed the respective variation across shots.
• Decision methods for shot boundary detection include fixed thresholds, adaptive
thresholds and statistical detection methods. The third complication, and the most
difficult to handle, is the fact that not all shot changes are abrupt.
• Shot-boundary detection is the first step towards scene
extraction in videos, which is useful for video content
analysis and indexing.
• A shot in a video is a sequence of frames taken continuously
by one camera. A common approach to detect shot-
boundary consists of computing similarity between pairs of
consecutive frames and marking the occurrence of boundary
where the similarity is lower than some threshold.
• The similarity is measured globally, such as histogram, or
locally within rectangular blocks. Previously, luminance/color,
edges, texture and SIFT have been used to represent
individual frames
END of Chap5….

You might also like