0% found this document useful (0 votes)
62 views50 pages

A31 Video I Segmentation PDF

Video segmentation involves breaking down a video into shots and scenes. Shots are sequences of frames captured continuously by the same camera, while scenes are longer segments like episodes. Traditional segmentation requires manual annotation, while automatic segmentation detects shot boundaries. There are two types of shot transitions - cuts and gradual transitions like fades/dissolves. Methods for detecting these transitions work in both uncompressed and compressed domains by analyzing visual properties and discontinuities between frames.

Uploaded by

Atersaw Tigyhun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views50 pages

A31 Video I Segmentation PDF

Video segmentation involves breaking down a video into shots and scenes. Shots are sequences of frames captured continuously by the same camera, while scenes are longer segments like episodes. Traditional segmentation requires manual annotation, while automatic segmentation detects shot boundaries. There are two types of shot transitions - cuts and gradual transitions like fades/dissolves. Methods for detecting these transitions work in both uncompressed and compressed domains by analyzing visual properties and discontinuities between frames.

Uploaded by

Atersaw Tigyhun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Video segmentation

Video segmentation
• Segmentation is the process of breaking out a video in its constituent basic
elements, the shots, and in their higher-level aggregates, like episodes or
scenes.

• A common definition of shot is: “a sequence of frames that was (or appears
to be) continuously captured from the same camera”. A shot-break is the
transition from one shot to the next. Shot segmentation is therefore the
process of detecting transitions between two consecutive shots.
• Traditional approaches to perform segmentation is previewing the whole
video, and then annotating them and their boundaries with textual labels.
A fully manual segmentation of a movie may require, approximately, 10
hours of work for one hour of data

• A less expensive approach uses edit decision lists created by video


producers during post production. Final changes to the video stream can
determine misalignments with edit decision lists. A large part of existing
videos do not contain any edit decision lists.

• Automatic segmentation is a viable approach to produce reliable shot


segmentation. Segmentation into episodes is highly dependent on the type of
video and context information available.
Shot segmentation and edit effects
• There are two types of shot transitions: sharp shot transitions (cuts) and gradual
shot transitions (fades, dissolves, wipes and mattes)

Hard cut

Dissolve (combined
fade-out fade-in)

Wipe

Matte
• Edit effects are used differently in different types of video

SPORTS VIDEO
shots with large camera zoom in
shots with large fast moving objects (close ups)
shots with invasive edit effects (partial mattes)

NEWS VIDEO
shots with little motion (only part of the frame 1/4 approx)
shots with almost no motion
COMMERCIALS
different edit effects and shots of different duration
depending on targets and goals
Telecom
14 shots 9 cut 4 horizontal wipes 1 flip wipe
1 shot very fast (5 frames)
1 shot with large motion
12 shots with little or almost no motion
Golia
29 shots 29 cut
27 shot very fast (5 frames or less)
3 shots with fast motion
1 shot with almost no motion
Kia
12 shots 12 cut
2 shot with large fast moving objects
2 shots static
Findus
10 shots 10 cut
2 shot with large camera zoom in
1 shot with camera rotation
7 shots with little or almost no motion
• Methods for edit effect detection and shot segmentation work either in the
uncompressed or in the compressed domain
– In the uncompressed domain solutions are based on evaluation of similarity measure
between successive images. When two images are sufficiently dissimilar, there may
be a cut. Gradual transitions are found by using cumulative difference measures.
– In the compressed domain methods do not perform decoding/re-encoding, but exploit
the fact that the encoded video stream already contains a rich set of precomputed
features, such as motion vectors (MVs) and block averages (DC coefficients), that can
be used for temporal video segmentation.

• Shot Segmentation problems are anyway concerned with:


– object motion person moves into a camera shot ...
– camera motion panning, zooming …
– lighting changes camera flash , lightning ..
– some types of shot boundary dissolves , fades ...
– digital effects swirls , morphing …

• To reduce false shot change detections


– Algorithmic solutions
– Threshold values e.g.: higher values
– Empirical restrictions e.g.: shot must be greater than 100 frames
…..
Cut detection

• A cut is defined as a sharp transition between one shot and the one
following. It is obtained by simply joining two different shots without the
insertion of any other photographic effect.

Cut

• Automatic cut detection is based on the information that is extracted from


the shots which contribute to the cut (brightness and color distribution
change, motion, edges...).
• Cuts generally correspond to an abrupt change in the brightness pattern for two
consecutive images.

• Therefore, cuts between shots with small motion and constant illumination can
be easily detected by looking for sharp brightness changes. The principle
behind this approach is that, since two consecutive frames in a shot do not
change significantly in their background and object content, their overall
brightness distribution differs little.

• However, detection is difficult in the presence of continuous object motion, or


camera movements, or change of illumination in the shot. Researches have
been concentrated on developing algorithms that amplify visual properties of the
shots in order to detect a discontinuity in the visual property.
Cut detection - uncompressed domain
Pairwise pixel comparison
full frame, pixelwise, intensity based

• Pairwise comparison is simply based on the differences between gray-


levels Ixy (ft), Ixy (ft+1) of corresponding pixels (pointwise gray-level
difference ) in two consecutive frames ft and ft+1: Dcut = | Ixy (ft) - Ixy (ft+1) |

Frame t Frame t + 1
X Y
D(t,t+1)=
∑x=1 ∑y=1 | Ixy (ft) - Ixy (ft+1) |
XY

• Pairwise comparison can be extended to color frames, by calculating the


pointwise color difference in each color channel Dpcut and calculating the

sum of such differences: Dcut = ΣDp


cut
 A sequence break is detected if the number of pixels, that have been
changed, exceeds a certain threshold

Pairwise pixel comparison example


Consecutive frame differences
full frame, intensity based

• The average intensity difference is applied to two consecutive color frames


according to the following procedure:

– Compute the normalized sum St of pixel intensity values, for each frame ft, of
size of M x N: N- 1 M- 1

ååI xy (ft )
x= 0 y= 0
St =
MN

– Evaluate the inter-frame difference Dcut between frames ft-1, ft and ft+1, in the
following manner:
S - S t +1
d= t
S t- 1 - S t
Color histogram comparison
full frame, color histogram based

• Histogram comparison method is simply based on the differences between


values of corresponding brightness histogram bins in two consecutive
frames: N
d( f , f ' ) = å | H ( f , j ) - H ( f ', j ) |
j=1

• A sequence break is detected whenever a predefined threshold τ is


exceeded (peaks reveal cuts). The threshold can be obtained by computing
all the frame-to-frame differences and their mean µ and variance σ. The
threshold is calculated as:

τ = µ + ασ

where α is typically a small number.


• For color images, this equation can be applied to each individual color channel. A
64 bin histogram (2 bits for each color channel) has been suggested in order to
obtain fairly accurate results.
• Peaks of the function for color images are sharper than for gray-level histograms.

Color histogram comparison example


• Implementations of the color histogram methods differ in a number of
factors, including:
– The color space used to represent the pixel values.
– Threshold calculation. Threshold can be global or local, and can be
determined using several methods.
– Differencing criterion: it is possible to use several methods and metrics to
compute the difference of two histograms. Some of the most used criteria
are reported in the following.
Histogram intersection

• Histogram intersection is applied to the values of corresponding brightness


histogram bins in two consecutive frames:
N
d( f , f ' ) = å min(H ( f , j ), H ( f ', j ))
j= 0
• Since the intersection of two identical frames is equal to the number N of pixels in
the frame, the dissimilarity metric is defined from the minima of the function:

D= N – d

A∩Β
Normalized χ2 test

• The normalized χ2 test amplifies the distance between color histogram bins of two
consecutive frames:
N
(H ( f , j ) - H ( f ', j )) 2
d( f , f ' ) = å H ( f ', j )
j= 0

• Measures are not taken at full video rate, but instead at sampled frames (typically from 3
to 10 frames per second)

• A modification of the original χ2 test that has been proposed is:


N
1 (H ( f , j ) - H ( f ', j )) 2
d( f , f ' ) =
N2
å max(H ( f , j ),H ( f ', j ))
j= 0
Edge differences
whole frame, edge based

• This method considers edge images and gray level information. It is based on
the consideration that during scene breaks new edges appear far from old
edges and old edges disappear in location far from the new edges.

• Cuts are detected by counting the number of entering edges (ρ in) and exiting
edges (ρ out) in two consecutive frames, using a fixed threshold over a
temporal window.
• Processing steps:
– Perform image smoothing
(Gaussian filtering)
– Compute image gradient and
threshold
– Extract edges (Canny filtering and
dilation)
– Detect dissimilarity from the peaks
of ρ = max ( ρ in, ρ out)
Using subframes

• Use of subframes minimizes the influence of local changes in illumination and


motion each frame is divided into subframes (typically 16: 4x4).

subframe ft+1, i

subframe ft, i
Likelihood ratio
subframe, intensity based

• Likelihood ratio is computed by considering corresponding subframes i or


blocks of two consecutive frames and the second order statistics of their
intensity values.

• If mi(ft,i) and σi(ft) are respectively the mean value and the variance in the i-
th block of frame ft in the sequence, then likelihood ratio for a block is
defined as:

éæ ù2
2
ê s i ( f ) + s i ( f ' ) ö æm i ( f ' ) - m i ( f ) ö ú
di ( f , f ' ) = ç ÷+ ç ÷
êè 2 ø è 2 ø úû
ë

• A sequence break is detected, if most of the blocks, into which the image
has been partitioned, exhibit likelihood ratios greater than a predefined
threshold.
Bin to bin histogram difference
subframe, histogram based

• Bin to bin histogram difference can be computed for each image subframe k.
N = 9 subframes has been suggested. Cuts are detected by averaging the bin
to bin differences computed at each subframe with appropriate tresholding D
χ2 test
subframe, histogram based

• Corresponding subframes i of consecutive frames ft, ft+1 are compared by


considering their color histograms. Equation can thus be rewritten
according to:
N
(H ( f , j ) - H ( f ', j )) 2
d( f , f ' ) = å H ( f ', j )
j= 0

The 8 largest difference-values are discarded and only the 8 remaining


ones are retained.
Color histogram moment comparison
subframe, histogram based

• Performs color histogram differences between two corresponding


subframes of consecutive frames plus the statistical moments of the
histogram, up to the third order.

• Each frame in the sequence is partitioned into subframes i. Since


horizontal panning and motion are statistically more frequent, the
frequency of subframes is set higher in the horizontal direction than in
the vertical one.

• The interblock difference is then defined as: Di = Σ p di


The global difference D is obtained from this measure by discarding the n
worse values. A shot change is detected within a temporal window centred
in t with amplitude 5 frames
• Being Hi(ft) the histogram of subframe i for one color channel of RGB in frame ft,
the difference between the corresponding subframes of two consecutive frames f
and f’ is defined as follows: N
di ( f , f ' ) = å | H ( f , j ) - H ( f ', j ) |
i i + a T | mi( f ) - mi( f ' ) |
j=1

where mi (f) = [m1, m2, m3] is the moment vector of histogram Hi (ft) for the color
channel and a = [ a1, a2, a3] is the vector of scale parameters.
The scale factor a1 is adaptively tuned depending on the absolute value of m1 (f’).

The k-order moment is defined as the average of the kth power of the
deviation from the average:
µk = Σ xkH(x)
mk = Σ (x – µ ) H(x)
1
k

µ1 = arithmetic average ; m2 = µ2 – µ12 (variance) ; m3 = µ3 -3µ2µ1 + 2µ13


Remarks and comments on cut detection

• Histogram-based methods vs pixelwise


– The histogram-based methods minimize the sensitivity to camera movements (such
as panning and zooming) and are not sensibly affected by histogram dimensionality.
They offer better performance wrt Intensity based pixelwise methods.
– Precision values obtainable are close to 90%.
– Most of the histogram-based solutions are however sensitive to fast camera
movements, large moving objects, and fast moving objects. Abrupt changes of
brightness also have a negative impact on the algorithm performance.

Histogram intersection
– Histogram intersection is the simplest approach among the histogram-based methods.
It requires low computation effort. May lead to wrong estimations, since in exchanging
pixel positions, the histogram remains unchanged while the image pattern may
largely vary.
– In non-critical cases, histogram intersection performance is to be preferred to χ2 test. If
the number of color codes is high and L*u*v* or MTM color space is used it
outperforms the χ2 test method.

χ2 test method
− The χ2 test, like the pointwise absolute difference method, gives false cut detections
in scenes where fast motion is present. This is mainly due to the fact that a two-frame
window is used.
• Color histogram moments method
– The method based on color histogram moments uses a window of five frames to
observe changes in brightness-histogram differences with an adaptive threshold.
A comparative analysis has shown a superior performance.
– Misses and false detections of this method occur in the presence very dark
shots or very fast motion (a large object that rapidly obscures the camera
view within 3 to 5 frames).

• Edge-change method
– The edge-change method performance is ruled by three parameters:
• the edge detector smoothing factor;
• the edge detector threshold;
• the radius r of the neighbourhood in which ρ is evaluated.
Low values of r makes the algorithm very sensitive to shifts in edges due to
noise and non-rigid motion. Large values of r make values of the ρ parameter
to become lower. This makes cut detection more difficult and unstable.
– The edge-change method is strongly impaired by the presence of low contrast
between two consecutive frames.
• Locally adaptive vs global fixed thresholding
– The choice of threshold is a critical point for almost all of the techniques. Setting
appropriate thresholds may require a pre-analysis of the video to be segmented.
– Global tresholding compute statistics over the video fails in the presence of large
variety of behaviors and is usually inadequate
– Local thresholding improves performance: e.g. a window is centered around
each frame and the mean value is calculated. The threshold at any frame is thus
calculated as a multiple of the local window average and a constant factor k,
dependent on the frame difference.

• Full frame vs subframe


– Full frame – based: very resistant to motion tend to be poor at detecting changes
between similar shots
– Subframe – based: minimize the influence of local changes in illumination and
motion; adequately discriminant; the choice of the size of the blocks influences
behavior
Cut detection - compressed domain
For JPEG encoded video

JPEG processing
chain
RGB YCrCb
8x8 image block 8x8 shifted image block
a

8x8
blocks

a
a
8x8 DCT coefficients
DC coefficient of the previous block
a

Huffman table RLE sequence 8x8 quantized DCT


a
coefficients
a

a
• Shot boundaries can be detected using DCTcoefficients of JPEG compressed
video:
– For each video frame, a subset of the 8x8 pixel blocks is considered.
– For each block only a subset of the 64 DCT coefficients (the most significant
coefficients) is taken . These DCT coefficients are considered as representatives of
the frame content.
– Cuts are detected by evaluating the normalized inner product between coefficient
vectors Cf , Cf+k of two frames shifted by k on the temporal axis

c f · c f +k
d(f,f + k) = 1 -
c f c f +k
For MPEG encoded video

MPEG suggests that an encoding of the differences between adjacent still


pictures is a fruitful approach to compression. It assumes that:
- A moving picture is simply a succession of still pictures.
- The differences between adjacent still pictures are generally small.

Main MPEG features:


– Transform-domain-based compression (intra-frame coding)
o DCT, quantization and run-length encoding
– Block-based motion compensation
o Similar blocks of pixels common to two or more successive frames are
replaced by a pointer (motion vector) that references one of the blocks.
o 16x16 pixels Macroblocks (MBs)
o Predictive Encoding is done with reference to an anchor frame
– Interpolative techniques (inter-frame coding)
o Bidirectional interpolation (forward-predicted and backward-predicted)
MPEG GoP

• A video sequence is divided into Groups of Pictures (GOPs).


The smaller GoP is the better performance is with respect to motion, although
compression is lower (more I frames are present):
– Four types of frames:
• I (intra coded)
• P (predictive forward coded)
• B (bi-directional coded)
• D frames

I and P frames are anchor frames


I frames have no reference to other frames
P frames have forward reference to I or P frames
D frames only use the DC component (low resolution rarely used)
I frames have distance M wrt P frames and N wrt I frames. N is typically a multiple of M

Example: M=3, N=9


MPEG macroblocks and frames

• Each video frame contains 64x64 pixels grouped into 16 macroblocks


covering a region of 16x16 pixel. Macroblocks are necessary for motion
compensation.
• Three types of macroblocks are possible:
I encoded independently of other macroblocks
P encode not the region but the motion vector and error block of the previous frame
B same as above except that the motion vector and error block are encoded from the
previous or next frame

Skipped macroblocks encode the case of 0 motion (the macroblock of the previous
frame is copied)

• Frame have in their turn types: I, P, B


Different frame types have different macroblocks:
P frames: intra-coded MBs or Forward-predicted MBs
B frames: intra-coded, Forward or/and Backward-predicted MBs or skipped
Example: the match of the shaded macroblock of the current frame in the previous frame
is in position (24,4). Then the motion vector for the current frame is (8, -4)
Forward motion vector Current B frame

Best matching Backward motion vector


macroblock

Best matching
macroblock

Block motion
compensation

• Each macroblock is encoded separately for luminance and chrominance components.


MPEG processing
chain
I frame

P frame

The block diagram of the MPEG encoder

B frame
Using I-frame Histograms

• Exploits the fact that MPEG frames and macroblocks may be of type I, B or P.

• Extracts I frames from the MPEG video. For each I-frame evaluates the
histogram by considering the first coefficient of each 8x8 DCT block.
Histograms of consecutive I-frames are compared according to a statistical
test.

• Experiments suggest that the χ2 test provide the most satisfactory solution
Using MPEG motion vectors and pairwise comparison
• Exploits the fact that in MPEG, each B-frame is predicted and
interpolated from its preceding and succeeding I- frames and P-frames
by using motion compensation algorithms. Only residual error is
encoded. If there is a large discontinuity between two frames, this will
cause large residual errors in the frame blocks. MPEG directly
transforms the original pixel values into DCT coefficients.

• The presence of a small number of motion vectors in B-frames is used


as a clue in detecting video cuts. The pairwise comparison technique is
then applied to DCT coefficients of I-frames.

• Processing is reduced by approximately 1/6 with respect to using


JPEG compressed video.
Remarks on cut detection in the compressed domain

ideo segmentation based on MPEG is sensitive to the MPEG encoder that is

used. It may:
• use different methods to perform motion compensation,
• calculate motion vectors,
• use different quantization tables for DCT coefficient compression,
• use preferred direction for predictive encoding

esults of the comparison of MPEG algorithms by Kasturi [‘96] and Boreczky and
Kasturi [‘96] indicate that they have a much higher rate of false detection when
dealing with cuts, if compared to histogram based algorithms in the non-
compressed domain. Moreover their computational cost is the highest of all the
algorithms.
Fades, dissolve, wipe and matte detection

• Fades and Dissolves make the boundary, between two shots spread
across a number of frames. They have therefore both starting and
ending frames:
– Fading is an optical process, which determines the progressive darkening of
a shot until the last frame becomes completely black (fade-out) or, the
opposite, allowing the gradual transition from black to light (fade-in).
– Dissolve is a superimposition of a fade-out and a fade-in: the first shot fades
out to black while the following fades in to full light.

• If made with optical machines are with sequences of standard duration of


16,24,32, 48, 96 frames with linear variation of pixel brightness. With
electronic equipments these effects can be very fast (similar to cuts).

• Semantic meaning is associated with fades and dissolves:


– In movies, fades reflect a change of context (sharp change of place - time; the
end of an episode).
– Dissolves are used in movies as “…..” in a text. They convey the idea of shifting
the action in time and place and are commonly used for flashbacks.-
In documentaries, they to smooth changes from one description to another
and make presentation flowing.
Example of Fade out followed by Fade in with semantic meaning
(falling asleep and …. after a while …. waking up)

From “Chinatown” movie


Examples of dissolve (moderate length and fast dissolve)
Histogram differences and twin tresholding

• Gradual transitions are usually detected using a twin thresholding


mechanism and histogram difference metric as used for cut detection.

• Two thresholds are used to detect cuts (the higher τb ) and special
effects (the lower τs).
– A cut is detected whenever the τb threshold is exceeded.
– If the τs threshold is exceeded, and τb is not, then the frame at which this
happens is identified as a potential starting frame for gradual transition.

• Threshold τb is set automatically, according to video statistics.


Threshold τs do not vary too much among different video sources.
A suggestion is to assume for this threshold a value considerably
greater than the mean value of the frame to frame difference (from 8 to
10). Setting of tolerance values around thresholds may add robustness
to this technique.
• Twin comparison method

− If Tb < diff
− If Ts < diff < Tb accumulate differences in δ
− If diff < Ts nothing
− If the accumulated value (δ) is greater than Tb, a gradual change is detected.
Production model
• A mathematical approximation is used for fades and dissolves. Fades and
dissolves are modeled as chromatic scaling operations. If G(x,y,t) is a grey
scale sequence and ls the length of this sequence, a fade-out and fade-in are
respectively modeled as:
æ tö
E(x,y,t) = G(x,y,t) ç1 - ÷+ vec(0) t in [ t0, t0 + ls]
è ls ø
æt ö
E(x,y,t) = vec(0) + G(x,y,t) ç ÷
è ls ø

where vec(0) represents black.

• The first order difference image, obtained by differentiating the model


equation, is a constant image, proportional to the fade rate.
• The presence of a constant image is used to detect the chromatic change
associated with fading
Production model with full color information

• Exploiting full color information allows to distinguish between fades,


dissolve and other gradual transition effects. Since the most important
parameter in a fade is the linear change of pixels brightness, the
basic idea is to adopt color spaces that separate brightness from color
and do not present instability (e.g. L*u*v* ). In this case, during a fade,
while L*changes, the values of u* and v* are approximately constant.

• The algorithm for fade detection is based on the verification of a


pseudo-linear variation in the L* values and a constancy of u* and v*
values.
Wipes

• Wipes are a category of effects in which an image, the last frame of a shot,
is progressively pushed out of the screen by the appearing on, which is the
first frame of the following shot. They can be distinguished as horizontal,
vertical and flip wipes.

• Wipes are generally fast transition effects (10-15 frames) and therefore
create during the effect a large interfarme difference. This phenomenon
typically generates a train of peaks in the cut detection measure, spanning
the duration of the effect, that can be used as a clue to reveal the presence
of wipes.

t
T
Mattes

• Mattes are progressive darkening of the objective by a dark mask of


varying shape. Typically this transition is as fast as the wipe.

• According to this, mattes can be detected in the same way as wipes are,
checking for the presence of a train of peaks in the cut detection measure.
To distinguish mattes the central frame in the transition can be converted to
gray and the relative histogram H(x) analyzed.

H(X)

You might also like