A31 Video I Segmentation PDF
A31 Video I Segmentation PDF
Video segmentation
• Segmentation is the process of breaking out a video in its constituent basic
elements, the shots, and in their higher-level aggregates, like episodes or
scenes.
• A common definition of shot is: “a sequence of frames that was (or appears
to be) continuously captured from the same camera”. A shot-break is the
transition from one shot to the next. Shot segmentation is therefore the
process of detecting transitions between two consecutive shots.
• Traditional approaches to perform segmentation is previewing the whole
video, and then annotating them and their boundaries with textual labels.
A fully manual segmentation of a movie may require, approximately, 10
hours of work for one hour of data
Hard cut
Dissolve (combined
fade-out fade-in)
Wipe
Matte
• Edit effects are used differently in different types of video
SPORTS VIDEO
shots with large camera zoom in
shots with large fast moving objects (close ups)
shots with invasive edit effects (partial mattes)
NEWS VIDEO
shots with little motion (only part of the frame 1/4 approx)
shots with almost no motion
COMMERCIALS
different edit effects and shots of different duration
depending on targets and goals
Telecom
14 shots 9 cut 4 horizontal wipes 1 flip wipe
1 shot very fast (5 frames)
1 shot with large motion
12 shots with little or almost no motion
Golia
29 shots 29 cut
27 shot very fast (5 frames or less)
3 shots with fast motion
1 shot with almost no motion
Kia
12 shots 12 cut
2 shot with large fast moving objects
2 shots static
Findus
10 shots 10 cut
2 shot with large camera zoom in
1 shot with camera rotation
7 shots with little or almost no motion
• Methods for edit effect detection and shot segmentation work either in the
uncompressed or in the compressed domain
– In the uncompressed domain solutions are based on evaluation of similarity measure
between successive images. When two images are sufficiently dissimilar, there may
be a cut. Gradual transitions are found by using cumulative difference measures.
– In the compressed domain methods do not perform decoding/re-encoding, but exploit
the fact that the encoded video stream already contains a rich set of precomputed
features, such as motion vectors (MVs) and block averages (DC coefficients), that can
be used for temporal video segmentation.
• A cut is defined as a sharp transition between one shot and the one
following. It is obtained by simply joining two different shots without the
insertion of any other photographic effect.
Cut
• Therefore, cuts between shots with small motion and constant illumination can
be easily detected by looking for sharp brightness changes. The principle
behind this approach is that, since two consecutive frames in a shot do not
change significantly in their background and object content, their overall
brightness distribution differs little.
Frame t Frame t + 1
X Y
D(t,t+1)=
∑x=1 ∑y=1 | Ixy (ft) - Ixy (ft+1) |
XY
– Compute the normalized sum St of pixel intensity values, for each frame ft, of
size of M x N: N- 1 M- 1
ååI xy (ft )
x= 0 y= 0
St =
MN
– Evaluate the inter-frame difference Dcut between frames ft-1, ft and ft+1, in the
following manner:
S - S t +1
d= t
S t- 1 - S t
Color histogram comparison
full frame, color histogram based
τ = µ + ασ
D= N – d
A∩Β
Normalized χ2 test
• The normalized χ2 test amplifies the distance between color histogram bins of two
consecutive frames:
N
(H ( f , j ) - H ( f ', j )) 2
d( f , f ' ) = å H ( f ', j )
j= 0
• Measures are not taken at full video rate, but instead at sampled frames (typically from 3
to 10 frames per second)
• This method considers edge images and gray level information. It is based on
the consideration that during scene breaks new edges appear far from old
edges and old edges disappear in location far from the new edges.
• Cuts are detected by counting the number of entering edges (ρ in) and exiting
edges (ρ out) in two consecutive frames, using a fixed threshold over a
temporal window.
• Processing steps:
– Perform image smoothing
(Gaussian filtering)
– Compute image gradient and
threshold
– Extract edges (Canny filtering and
dilation)
– Detect dissimilarity from the peaks
of ρ = max ( ρ in, ρ out)
Using subframes
subframe ft+1, i
subframe ft, i
Likelihood ratio
subframe, intensity based
• If mi(ft,i) and σi(ft) are respectively the mean value and the variance in the i-
th block of frame ft in the sequence, then likelihood ratio for a block is
defined as:
éæ ù2
2
ê s i ( f ) + s i ( f ' ) ö æm i ( f ' ) - m i ( f ) ö ú
di ( f , f ' ) = ç ÷+ ç ÷
êè 2 ø è 2 ø úû
ë
• A sequence break is detected, if most of the blocks, into which the image
has been partitioned, exhibit likelihood ratios greater than a predefined
threshold.
Bin to bin histogram difference
subframe, histogram based
• Bin to bin histogram difference can be computed for each image subframe k.
N = 9 subframes has been suggested. Cuts are detected by averaging the bin
to bin differences computed at each subframe with appropriate tresholding D
χ2 test
subframe, histogram based
where mi (f) = [m1, m2, m3] is the moment vector of histogram Hi (ft) for the color
channel and a = [ a1, a2, a3] is the vector of scale parameters.
The scale factor a1 is adaptively tuned depending on the absolute value of m1 (f’).
The k-order moment is defined as the average of the kth power of the
deviation from the average:
µk = Σ xkH(x)
mk = Σ (x – µ ) H(x)
1
k
Histogram intersection
– Histogram intersection is the simplest approach among the histogram-based methods.
It requires low computation effort. May lead to wrong estimations, since in exchanging
pixel positions, the histogram remains unchanged while the image pattern may
largely vary.
– In non-critical cases, histogram intersection performance is to be preferred to χ2 test. If
the number of color codes is high and L*u*v* or MTM color space is used it
outperforms the χ2 test method.
χ2 test method
− The χ2 test, like the pointwise absolute difference method, gives false cut detections
in scenes where fast motion is present. This is mainly due to the fact that a two-frame
window is used.
• Color histogram moments method
– The method based on color histogram moments uses a window of five frames to
observe changes in brightness-histogram differences with an adaptive threshold.
A comparative analysis has shown a superior performance.
– Misses and false detections of this method occur in the presence very dark
shots or very fast motion (a large object that rapidly obscures the camera
view within 3 to 5 frames).
• Edge-change method
– The edge-change method performance is ruled by three parameters:
• the edge detector smoothing factor;
• the edge detector threshold;
• the radius r of the neighbourhood in which ρ is evaluated.
Low values of r makes the algorithm very sensitive to shifts in edges due to
noise and non-rigid motion. Large values of r make values of the ρ parameter
to become lower. This makes cut detection more difficult and unstable.
– The edge-change method is strongly impaired by the presence of low contrast
between two consecutive frames.
• Locally adaptive vs global fixed thresholding
– The choice of threshold is a critical point for almost all of the techniques. Setting
appropriate thresholds may require a pre-analysis of the video to be segmented.
– Global tresholding compute statistics over the video fails in the presence of large
variety of behaviors and is usually inadequate
– Local thresholding improves performance: e.g. a window is centered around
each frame and the mean value is calculated. The threshold at any frame is thus
calculated as a multiple of the local window average and a constant factor k,
dependent on the frame difference.
JPEG processing
chain
RGB YCrCb
8x8 image block 8x8 shifted image block
a
8x8
blocks
a
a
8x8 DCT coefficients
DC coefficient of the previous block
a
a
• Shot boundaries can be detected using DCTcoefficients of JPEG compressed
video:
– For each video frame, a subset of the 8x8 pixel blocks is considered.
– For each block only a subset of the 64 DCT coefficients (the most significant
coefficients) is taken . These DCT coefficients are considered as representatives of
the frame content.
– Cuts are detected by evaluating the normalized inner product between coefficient
vectors Cf , Cf+k of two frames shifted by k on the temporal axis
c f · c f +k
d(f,f + k) = 1 -
c f c f +k
For MPEG encoded video
Skipped macroblocks encode the case of 0 motion (the macroblock of the previous
frame is copied)
Best matching
macroblock
Block motion
compensation
P frame
B frame
Using I-frame Histograms
• Exploits the fact that MPEG frames and macroblocks may be of type I, B or P.
• Extracts I frames from the MPEG video. For each I-frame evaluates the
histogram by considering the first coefficient of each 8x8 DCT block.
Histograms of consecutive I-frames are compared according to a statistical
test.
• Experiments suggest that the χ2 test provide the most satisfactory solution
Using MPEG motion vectors and pairwise comparison
• Exploits the fact that in MPEG, each B-frame is predicted and
interpolated from its preceding and succeeding I- frames and P-frames
by using motion compensation algorithms. Only residual error is
encoded. If there is a large discontinuity between two frames, this will
cause large residual errors in the frame blocks. MPEG directly
transforms the original pixel values into DCT coefficients.
used. It may:
• use different methods to perform motion compensation,
• calculate motion vectors,
• use different quantization tables for DCT coefficient compression,
• use preferred direction for predictive encoding
esults of the comparison of MPEG algorithms by Kasturi [‘96] and Boreczky and
Kasturi [‘96] indicate that they have a much higher rate of false detection when
dealing with cuts, if compared to histogram based algorithms in the non-
compressed domain. Moreover their computational cost is the highest of all the
algorithms.
Fades, dissolve, wipe and matte detection
• Fades and Dissolves make the boundary, between two shots spread
across a number of frames. They have therefore both starting and
ending frames:
– Fading is an optical process, which determines the progressive darkening of
a shot until the last frame becomes completely black (fade-out) or, the
opposite, allowing the gradual transition from black to light (fade-in).
– Dissolve is a superimposition of a fade-out and a fade-in: the first shot fades
out to black while the following fades in to full light.
• Two thresholds are used to detect cuts (the higher τb ) and special
effects (the lower τs).
– A cut is detected whenever the τb threshold is exceeded.
– If the τs threshold is exceeded, and τb is not, then the frame at which this
happens is identified as a potential starting frame for gradual transition.
− If Tb < diff
− If Ts < diff < Tb accumulate differences in δ
− If diff < Ts nothing
− If the accumulated value (δ) is greater than Tb, a gradual change is detected.
Production model
• A mathematical approximation is used for fades and dissolves. Fades and
dissolves are modeled as chromatic scaling operations. If G(x,y,t) is a grey
scale sequence and ls the length of this sequence, a fade-out and fade-in are
respectively modeled as:
æ tö
E(x,y,t) = G(x,y,t) ç1 - ÷+ vec(0) t in [ t0, t0 + ls]
è ls ø
æt ö
E(x,y,t) = vec(0) + G(x,y,t) ç ÷
è ls ø
• Wipes are a category of effects in which an image, the last frame of a shot,
is progressively pushed out of the screen by the appearing on, which is the
first frame of the following shot. They can be distinguished as horizontal,
vertical and flip wipes.
• Wipes are generally fast transition effects (10-15 frames) and therefore
create during the effect a large interfarme difference. This phenomenon
typically generates a train of peaks in the cut detection measure, spanning
the duration of the effect, that can be used as a clue to reveal the presence
of wipes.
t
T
Mattes
• According to this, mattes can be detected in the same way as wipes are,
checking for the presence of a train of peaks in the cut detection measure.
To distinguish mattes the central frame in the transition can be converted to
gray and the relative histogram H(x) analyzed.
H(X)