Video Abstracting: 1. What Is A Video Abstract?
Video Abstracting: 1. What Is A Video Abstract?
Video Abstracting
Page 1
© Communications of ACM, pp. xx - yy, December 1997
system could perform the task of telling you briefly what happened “in the meantime”.
Many more innovative applications could be built around the basic video abstracting technique.
But let us now come to the algorithms and tools we are using to automatically produce a digital
video abstract.
Page 2
© Communications of ACM, pp. xx - yy, December 1997
Page 3
© Communications of ACM, pp. xx - yy, December 1997
change of camera angle usually has no influence on the main background colors.
• In different scenes the audio usually differs significantly. Therefore, a video cut not
accompanied by an audio cut does not establish a scene boundary.
• A third heuristic groups consecutive shots into a scene if the shots can be identified
as representing a dialog (see dialog detection).
Audio cuts. Audio cuts are defined as time instances delimiting time periods with similar sound.
They are employed to explore the similarity of the audio track of different shots. If there is no sig-
nificant change in the audio track close to a video shot boundary, i.e. if the sound is continuing
across a video shot boundary, we consider both shots to belong to the same scene.
Audio cuts are determined by calculating the frequency and intensity spectrum for each time win-
dow of the audio track, predicting its values for the next time window by exponential smoothing,
and declaring an audio cut to be where the current frequency and intensity spectrum deviate con-
siderably from the prediction.
Once the video has been segmented into its basic components, it is essential to identify semanti-
cally rich events, e.g. close-ups of main actors, gun fire, explosions, and text appearing in the
video. They help us to select those sequences of frames for our clips that are important for the
abstract.
Page 4
© Communications of ACM, pp. xx - yy, December 1997
accidental mis-classifications of the face detector by discarding all face-based classes with fewer
than three occurrences of a face, and by allowing up to two drop-outs in the face-tracking process.
In a second step, face-based classes with similar faces are merged by face recognition algorithms
[4] in order to obtain face-based classes that are as large as possible.
Main actors. The same face recognition algorithms are used to identify and merge face-based
classes of the same actor across shots throughout the video, resulting in so-called face-based sets.
There is a face-based set for each main actor. It describes where, when and in what size that actor
appears in the video.
Dialog detection. It is now easy to detect typical shot/reverse-shot dialogs and multi-person dia-
logs. We search for sequences of face-based classes, close together in time, with shot-overlapping
face-based classes of the same actor, and cross-over relations between different actors. For exam-
ple a male and a female actor could appear in an m-f-m-f sequence. An example of a dialog auto-
matically detected in this way is shown in Figure 3.
a face-based class spanning several shots
Page 5
© Communications of ACM, pp. xx - yy, December 1997
A special feature of our trailer generation technique is that the end of the movie is not revealed;
we simply do not include clips from the last 20% of the movie. This guarantees that we don’t take
away the suspense.
Clip Selection
The user of our abstracting system can specify a target length not to be exceeded by the video
abstract. When selecting clips the system has to come up with a compromise between the target
length and the above heuristics. This is done in an iterative way. Initially, all scenes of the first
80% of the movie are in the scene candidate set. All decisions have to be based on physical
parameters of the video because only those can be derived automatically. Thus the challenge is to
determine relevant scenes, and a good clip as a subset of frames of each relevant scene, based on
computable parameters.
We use two different mechanisms to select relevant scenes and clips. The first mechanism extracts
special events and texts from the video, such as gun fire, explosions, cries, close-up shots, dialogs
of main actors, and title text. We claim that these events and texts summarize the video well, and
Page 6
© Communications of ACM, pp. xx - yy, December 1997
are suited to attract the viewer’s attention (see properties (1)-(4) above). The identification of the
relevant sequences of frames is based on the algorithms described above, and is fully automatic.
The percentage of special events to be contained in the abstract can be specified as a parameter by
the user. In our experiments it was set to 50%. If the total length of special event clips selected by
the first mechanism is longer than desired, scenes and clips are chosen uniformly and randomly
from the different types of events. The title text, however, will always be contained in the abstract.
The second mechanism adds filler clips from different parts of the movie to complete the trailer.
To do so, the remaining scenes are divided into several non-overlapping sections of about the
same length. We have used eight sections in our experiments. The number of clips and their total
length within each section are determined. Clips are then selected repeatedly from those sections
with the lowest share in the abstract so far, until the target length of the trailer is reached. This
mechanism ensures good coverage of all parts of the movie even if special events occur only in
some sections.
In general clips must be much shorter than scenes. So how is a clip extracted from a scene? We
have tried out two heuristics. With the first one, we pick those shots with the highest amount of
action, and with the same basic color composition as the average of the movie. More details can
be found in [8]. Action is defined through motion, either object motion or camera motion, and the
amount of motion in a sequence of frames can easily be computed based on motion vectors or on
the edge change ratio. The action criterion is motivated by the fact that actions clips are often
more interesting and carry more content in a short time than calm clips. The idea behind the color
criterion is that colors are an important component for the perception of a video’s mood and color
composition should thus be preserved in the trailer.
The second heuristic takes a completely different approach. It uses the results of our MoCA genre
recognition project. The basic idea of that project is to compute a large number of audio-visual
parameters from an input video and use them to classify the video into a genre such as newscast,
soccer, tennis, talk show, music clip, cartoon, feature film, or commercial. The classification is
based on characteristic parameter profiles, derived beforehand and stored in a database. The
results of this project can now be used to select clips for the trailer in a more sophisticated way:
Those clips closest in parameter values to the characteristic profile of the entire movie are
selected. The advantage of this clip selection process is that it will automatically tailor the selec-
tion process to a specific genre provided that we have a characteristic parameter profile for it.
Clip Assembly
In the assembly stage, the selected video clips and their respective audio tracks are composed into
the final form of the abstract. We have experimented with two degrees of freedom in the composi-
tion process:
• ordering, and
• edits (types of transition) between the clips.
Ordering. Pryluck et. al. showed that the sequencing of clips strongly influences the viewer’s per-
ception of their meaning [9]. Therefore ordering of the clips must be done very carefully. We first
group the video clips into four classes. The first class or event class contains the special events,
currently gun fires and explosions. The second class consists of dialogs, while the filler clips con-
stitute the third class. The extracted text (in the form of bitmaps and ASCII text) falls into the
Page 7
© Communications of ACM, pp. xx - yy, December 1997
Figure 4:This graph shows the temporal distribution of the detected video and
audio events of the movie “Groundhog Day” as well as those which have
been chosen during the clips selection process to be part of the trailer.
Note, since “Groundhog Day” is not an action movie, there are only two
explosions and no gun fire. Each box represents two seconds (2828 in
total). Time passes from left to right and top to bottom.
fourth class. Within each class the original temporal order is preserved.
Dialogs and event clips are assembled in turn into so-called edited groups. The maximum length
of an edited group is a quarter of the length of the total share of special events. The gaps between
Page 8
© Communications of ACM, pp. xx - yy, December 1997
the edited groups are filled with the remaining clips resulting in a preliminary abstract.
The text occurrences in class four usually show the film title and the names of the main actors.
The title bitmap is always added to the trailer, cut to a length of one second. Optionally, the actors’
names can be added to the trailer.
Edits. We apply three different types of video edits in the abstract: hard cuts, dissolves, and
wipes. Their usage is based on general rules derived from knowledge elicited from professional
cutters [6]. This is a research field in its own right. As a preliminary solution we found it reason-
able to concatenate special event clips with every other type of clip by means of hard cuts and
insert soft cuts (dissolves and wipes) between calmer clips only, such as dialogs. Table 1 shows
the possible usage of edits in the different cases. A much more sophisticated approach for auto-
matic video editing of humorous themes can be found in [6].
Event Clips Dialog Clips Other Clips
Event Clips hard cut hard cut hard cut
Dialog Clips hard cut dissolve, wipe, fade hard cut, dissolve, wipe,
fade
Other Clips hard cut hard cut, dissolve, wipe, hard cut, dissolve, wipe,
fade fade
Table 1:Edits in an abstract
Interestingly audio editing is much more difficult. A first attempt to simply concatenate the sound
tracks belonging to the selected clips produced terrible audio. In dialog scenes it is especially
important that audio cuts have priority over video cuts. The construction of the audio track of the
abstract is currently performed as follows:
• The audio of special event clips is used as it is in the original.
• The audio of dialogs respects audio cuts in the original. The audio of every dialog
is cut in length as much as to fill the gaps between the audio of the special events.
Dissolves are the primary means of concatenation.
• The entire audio track of the abstract is underlaid by the title music. During dialogs
and special events the title music is reduced in volume.
We are planning to experiment with speaker recognition and with speech recognition to be able to
use higher-level semantics from the audio stream. The combination of speech recognition and
video analysis is especially promising.
4. Experimental Results
In order to evaluate the MoCA video abstracting system, we ran a series of experiments with
video sequences recorded from German television. We quickly found out that there is no absolute
measure for the quality of an abstract; even experienced movie directors told us that making good
trailers for a feature film is an art, not a science. It is interesting to observe that the shots extracted
by a human for an abstract depend to a large extent on the purpose of the abstract: For example, a
trailer for a movie often emphasizes thrill and action without giving away the end, a preview for a
documentary on television attempts to capture the essential contents as completely as possible,
and a review of last week’s soap opera highlights the most important events of the story. We con-
clude that automatic abstracting should be controlled by a parameter describing the purpose of the
abstract.
Page 9
© Communications of ACM, pp. xx - yy, December 1997
Page 10
© Communications of ACM, pp. xx - yy, December 1997
static scene graph of thumbnail images on a 2D canvas. The scene graph represents the flow of the
story in the form of key frames. It allows to interactively descend into the story by selecting a
story unit of the graph [11].
Acknowledgments
Much of our work on movie content analysis was done jointly with Stephan Fischer whose contri-
butions to the MoCA project we gratefully acknowledge. We would also like to thank Ramesh
Jain of UC San Diego for his assistance in the preparation of this paper.
References
[1] D. Bordwell, K. Thompson: Film Art: An Introduction. 4th ed., McGraw-Hill, 1993.
[2] M. Christel, T. Kanade, M. Mauldin, R. Reddy, M. Sirbu, S. Stevens, and H. Wactlar. Infor-
media Digital Video Library. Communications of the ACM, 38(4):57-58 (1995).
[3] A. Dailianas, R. B. Allen, P. England: Comparison of Automatic Video Segmentation Algo-
rithms. Proc. SPIE 2615, Photonics East 1995: Integration Issues in Large Commercial
Media Delivery Systems, Andrew G. Tescher; V. Michael Bove, Eds., pp. 2-16
[4] S. Lawrence, C. L. Giles, A. C. Tsoi, A. D. Back: Face Recognition: A Convolutional Neural
Network Approach. IEEE Trans. Neural Networks, Special Issue on Neural Network and
Pattern Recognition, 1997, to appear
[5] R. Lienhart. Automatic Text Recognition for Video Indexing. Proc. ACM Multimedia 1996,
Boston, MA, pp. 11-20
[6] F. Nack, A. Parkes: The Application of Video Semantics and Theme Representation in Auto-
mated Video Editing. Multimedia Tools and Applications, Vol. 4, No. 1 (1997), pp. 57-83
[7] S. Pfeiffer, S. Fischer, W. Effelsberg: Automatic Audio Content Analysis. Proc. ACM Multi-
media 1996, Boston, MA, pp. 21-30
[8] S. Pfeiffer, R. Lienhart, S. Fischer, W. Effelsberg: Abstracting Digital Movies Automatically.
In J. Visual Communication and Image Representation, Vol. 7, No. 4 (1996), pp. 345-353
Page 11
© Communications of ACM, pp. xx - yy, December 1997
[9] C. Pryluck, C. Teddlie, R. Sands: Meaning in Film/Video: Order, Time and Ambiguity. J.
Broadcasting 26 (1982), pp. 685-695
[10] H. A. Rowley, S. Baluja, T. Kanade: Human Face Recognition in Visual Scenes. Technical
Report, Carnegie Mellon Uuniversity, CMU-CS-95-158R, School of Computer Science,
November 1995
[11] M. Yeung, B.-L. Yeo, B. Liu: Extracting Story Units form Long Programs for Video Brows-
ing and Navigation. Proc. IEEE Multimedia Computing & Systems 1996, Hiroshima, Japan,
pp. 296-305
[12] R. Zabih, J. Miller, K. Mai; A Feature-Based Algorithm for Detecting and Classifying Scene
Breaks. Proc. ACM Multimedia 1995, San Francisco, CA, pp. 189-200
Page 12