Story Segmentation and Detection of Commercials in Broadcast News Video
Story Segmentation and Detection of Commercials in Broadcast News Video
Figure 5: Image Features for Segmentation. The manually discovered segments (1) are at the top, and aligned
underneath are MPEG optical flow (2), scene breaks (3), black frames (4), all detected faces (5), similar color image
features (6), and similar faces (7) .
Acoustic Environment Change. Changes in background to prepare those stories for search.
noise, recording channel 2, or speaker changes, for example
can cause long term changes in the spectral composition of DETECTING COMMERCIALS USING IMAGE FEATURES
the signal. By classifying acoustically similar segments into Although there is no single image feature that allows one to
a few basic types, the location of these changes can be tell commercials from the news content of a broadcast, we
identified. This segmentation based on acoustic similarity have found that a combination of simple features does a
can also be performed by the CMUseg package. passable job. The two features used in the simplest version
of the commercial detector are the presence of black frames,
Signal-to-Noise Ratio (SNR). Signal to noise ratio attempt and the rate of scene changes.
to capture some of the effects of acoustic environment by
measuring the relative power in the two major spectral peaks Black frames are frequently broadcast for a fraction of a
in a signal. While there are a number of ways to compute the second before, after, and between commercials, and can be
SNR of an acoustic signal, none of them perfect, we have detected by looking for a sequence of MPEG frames with
used the approach to SNR computation described in low brightness. Of course, black frames can occur for other
[Hauptmann97] with a window size of .25 seconds. reasons, including a directorial decision to fade to black or
during video shot outdoors at night. For this reason, black
To date we have only made informal attempts to include this frames are not reliably associated with commercials.
audio signal data in our segmentation heuristics. We will
report the results of this effort in a future paper. Because advertisers try to make commercials seem more
interesting by rapidly cutting between different shots,
Figure 6 shows these audio features as they relate to perfect sections of programming with commercials tend to have
(human) segmentation. more scene breaks than are found in news stories. Scene
breaks are computed by the Informedia system as a matter
We fully exploit the speech recognition transcript and the of course, to enable generation of the key frames that
silences identified in the transcript for segmentation. Some represent a salient section of a story when results are
of the other features computed by CMUseg are also used for presented from a search, and for the film strip view [Figure
adapting the speech recognition (SR) to its acoustic 1] that visually summarizes a story. These scene changes
environment, and for segmenting the raw signal into are detected by hypothesizing a break when the color
sections that the recognizer can accommodate. This histogram changes rapidly over a few frames, and rejecting
segmentation is not topic or news story based, but instead that hypothesis if the optical flow in the image does not
simply identifies short “similar” regions as units for SR. show a random pattern. Pans, tilts and zooms, which are not
These segments are indicative of short “phrase or breath shot breaks, cause color histogram changes, but have
groups”, but are not yet used in the story segmentation smooth optical flow fields associated with them.
processing.
These two sources of information are combined in the
METHOD following, rather ad hoc heuristic:
This section will describe the use of image, acoustic, text
and timing features to segment news shows into stories and 1. Probable black frame events and scene change events
are identified.
2 Telephones and high quality microphones, for example,
produce signals with distinct spectral qualities.
1
Figure 6: Audio Features for Segmentation. At the top are the manually found segments (1), followed by silences
based on spectral analysis (2), speech recognition segments (3) silences in speech (4), signal-to-noise ratio (5), and
maximum amplitude (6).
2. Sections of the show that are surrounded by black commercials on the basis of shot change rate.
frames separated by less that 1.7 times the mean
distance between black frames in the show, and that 4. Initially, a commercial is hypothesized over a period if
have sections meeting that criterion on either side, are either criterion 3 or 4 is met.
marked as probably being commercials on the basis of
black frames. 5. Short hypothesized stories, defined as non-commercial
sections less than 35 seconds long, are merged with
3. Sections of the show that are surrounded by shot their following segment. Then short hypothesized ads,
changes separated by less than the mean distance less than 28 seconds long, are merged into the
between shot changes for the whole show, and are following segment.
surrounded by two sections meeting the same criterion
on either side, are marked as probably occurring in 6. Because scene breaks are somewhat less reliably
detected at the boundaries of advertising sections, black
Hand Identified
Scene Changes
Black Frame
Figure 7: The commercial detection code in Informedia combines image features to hypothesize commercial
locations. The bottom two graphs show the raw signals for black frames and scene changes respectively. The
graph at the top shows the hand-identified positions of the commercials. The graphs running from bottom to top
show successive stages of processing. The numbers in parentheses correspond to the steps outlined in the text.
frame occurrences are used to “clean up” boundaries. there is a time gap longer than a threshold value of 15
Hypothesized starts of advertisements are moved to the seconds in the closed caption transmission, this gap is
time of any black frame occurring up to 4.5 seconds labeled as a possible commercial and a definite story
before. Hypothesized ends of advertisements are moved segmentation boundary. Similarly, if there are multiple
to the time of any black frame appearing up to 4.5 blank lines (three or more in the current
seconds after the hypothesized time. implementation), a story boundary is presumed at that
location. In Figure 2 above, story segmentation
7. Because scene changes can also occur rapidly in non- boundaries would be hypothesized at 0, 29, and 44
commercial content, after the merging steps described seconds.
above, any “advertising” section containing no black
frames at all is relabeled as news. 3. Next the image based commercial detection code
described in the previous section is used to hypothesize
8. Finally, as a sort of inverse of step 6, because some additional commercial boundaries.
sections of news reports, and in particular weather
reports, have rapid scene changes, transitions into and 4. The words in the transcript are now aligned with the
out of advertisements may only be made on a black words from the speech recognition output, and timings
frame, if at all possible. If a commercial does not begin transferred from the speech recognizer words to the
on a black frame, its start is adjusted to any black frame transcript words. After this alignment each word in the
within 90 seconds after its start and preceding its end. transcript is assigned as accurate a start and end time as
Commercial ends are moved back to preceding black possible based on the timing found in the speech.
frames in a similar manner.
5. Speech recognition output is inserted into all regions
Figure 7 shows how this process is used to detect the where it is available, and where there were no captioned
commercials in an example CNN news broadcast. text words received for more than 7 seconds.
2. The captioned transcript is then examined for obvious One metric for segmentation proposed at the recent Topic
pauses that would indicate uncaptioned commercials. If Detection and Tracking Workshop [Yamron97,
Manual
Segmentation
Automatic
Segmentation
Figure 8: A comparison of manual segmentation with the automatic segmentation described in this paper
shows very high correspondence between the segment markers.
Beeferman97] is based on the probability of finding a "And now more from Bernard Shaw… " in the last
segment boundary between two randomly chosen words. segment.
The error probability is a weighted combination of two C "Beginning of COMMERCIAL". This type of
parts, the probability of a false alarm and the probability of a segment covers all major commercials with typical
missed boundary. The error probabilities are defined as: duration of 30 or 45-seconds up to one minute. The
category also covers smaller promotional clips such
N− k as "World View is brought to you in part by Acme
∑δ hyp (i, i + k ) ⋅(1 − δref (i, i + k )) Punctuation Products" or "This is CNN!"
PMiss = i =1
N− k For evaluation purposes, we compared the set of 749
∑ (1 −
i =1
δref (i, i + k )) manually segmented stories from 13 CNN news broadcasts
with the stories segmented from the same broadcasts by the
N− k Informedia segmentation process described above. The only
∑ (1 − δ hyp (i , i + k )) ⋅δref (i , i + k ) modification to the manual segmentation done according to
the instructions above was that multiple consecutive
PFalseAlarm = i =1
N− k commercials were grouped together as one commercial
∑δ
i =1
ref (i , i + k ) block. On these 13 news broadcasts, the automatic
segmentation system averaged 15.16% incorrect
segmentation according to this metric. By comparison,
Where the summations are over all the words in the
human-human agreement between the 3 human segmenters
broadcast and where:
averaged 10.68% error. The respective false alarm and miss
rates are also shown in Table 1.
1 if i and j are from the same story
δ(i , j )= P(err) P(FalseAlarm) P(Miss)
0 otherwise
The choice of k is a critical consideration in order to AutoSegment 15.16% 7.96 % 26.9 %
produce a meaningful and sensitive evaluation. Here it is set
to half the average length of a story. Inter-Human 10.68% 8.26 % 15.35%
Comparison
We asked three volunteers to manually split 13 TV
Broadcast news shows at the appropriate story boundaries No Segmentation 36.91 % 0.0 % 99.78 %
according to the following instructions:
Segment every 62.99 % 99.99 % 0.0 %
“For each story segment, write down the frame number second
of the segmentation break, as well as the type with
which you would classify this segment. The types are: Segment every 60.86 % 91.08 % 9.32 %
1180 frames
P "Beginning of PREVIEW". The beginning of a news (Average Story
show, in which the news anchors introduce Length)
themselves and give the headline of one or more
news stories. If each of 3 anchors had introduced 2 Table 1: Performance of the Automatic Segmentation
stories in sequence, there would be 6 beginning of
Procedure on evening news shows. Average Human
preview markers.
segmentation performance is given for comparison. The
T "Beginning of searchable content: TOPIC". This is results for no segmentation, segmentation every second
the most typical segment boundary that we expect to and segmentation into fixed-width blocks corresponding
be useful to an Informedia user. Every actual news to the average reference story length are given for
story should be marked with such a boundary. Place reference.
the marker at the beginning of news stories, together
with the frame number at that point. Do include DISCUSSION
directly preceding words, like "Thanks, Jack. And Unfortunately, these results cannot be directly compared
now for something completely different. In Bosnia with either the results in [Brown95] or [Merlino97]. Brown
today … " et al used a criterion of recall and precision for information
S "Beginning of searchable content: SUBTOPIC". retrieval. This was only possible with respect to a set of
These subtopics mark the boundaries between information retrieval queries, and given the existence of a
different segments within the same news story. As a human relevance judgement for every query against every
rule, whenever the anchorperson changes but talks document. In our study, the segmentation effectiveness was
about the same basic issue or topic as in the last evaluated on its own, and we have yet to evaluate its effect
segment, this is a subtopic. These subtopics are on the effectiveness of information retrieval.
usually preceded by a phrase like "And now more
from Bernard Shaw at the White House". Keep the [Merlino95] reported a very limited experiment in story
detection with the Broadcast News Navigator. In effect, they information retrieval, story tracking, or information
only reported whether the main news stories were extraction into semantic frames. Some approaches from the
segmented or not, ignoring all minor news stories, information retrieval literature [Kaszkiel97] claim that
commercials, previews, etc. Thus their metrics reflect the overlapping windows within an existing document can
ability of the Broadcast News Navigator System to detect improve the accuracy of the information retrieval. It remains
core news stories (corresponding only to segments labeled T for future work to determine if a modification of this
and S by our human judges). The metrics for the BNN also technique can circumvent the problem of static segmentation
ignored the timing of the story. Thus they did not take into in the broadcast news video domain.
account whether the detected story began and ended at the
right frame compared to the reference story. Segmentation is an important, integral part of the
Informedia digital video library. The success of the
We have achieved very promising results with automatic Informedia project hinges on two critical assumptions: That
segmentation that relies on video, audio and closed- we can extract sufficiently accurate speech recognition
captioned transcript sources. The remaining challenges transcript from the broadcast audio and that we can segment
include: the broadcast into video paragraphs (stories) that are useful
for information retrieval. In previous papers [Hauptmann97,
• The full integration of all the available audio and image Witbrock97, Witbrock98], we have shown that speech
features in addition to the text features. While we have recognition is sufficient for information retrieval of pre-
discussed how various features could be useful, we segmented video news stories. In this paper we now have
have not yet been able to fully integrate all of them. addressed the issue of segmentation and demonstrated that a
fully automatic system can successfully extract story
• Gaining the ability to automatically train segmentation boundaries using available audio, video and closed-
algorithms such as the one described here and to learn captioning cues.
similar or improved segmentation strategies from a
limited set of examples. As different types of broadcast ACKNOWLEDGMENTS
are segmented, we would like the system to This paper is based on work supported by the National
automatically determine relevant features and exploit Science Foundation, DARPA and NASA under NSF
them. Cooperative agreement No. IRI-9411299. We thank
Justsystem Corporation for supporting the preparation of the
• Completely avoiding the use of the closed-captioned paper. Many thanks to Doug Beeferman, John Lafferty and
transcripts for segmentation. While the closed- Dimitris Margaritis for their manual segmentation and the
captioned transcripts provide a good source of use of their scoring program.
segmentation information there is much data that is not
captioned. We would like to adapt our approach to REFERENCES
work without the captioned text, relying entirely on the [Beeferman97] Beeferman, D., Berger, A., and Lafferty.
speech recognizer transcription, the audio signal and the J., Text segmentation using exponential models. In Proc.
video images. Empirical Methods in Natural Language Processing 2
(AAAI) '97, Providence, RI, 1997.
In the near term we plan to use the EM [Dempster77] [Brown95] Brown, M. G., Foote, J. T., Jones, G. J. F.,
algorithm to combining many features into one Spärck-Jones, K. and Young, S. J, Automatic Content-
segmentation strategy, and to learn segmentation from data Based Retrieval of Broadcast News, ACM Multimedia-
for which only a fraction has been hand-labeled. Work is 95, p. 35 - 42, San Francisco, CA 1995.
also currently underway in the Informedia project to
evaluate the effectiveness of the current segmentation [CMUseg97] CMUseg, Carnegie Mellon University
approach when closed-captioning information is not Audio Segmentation Package,
available. ftp://jaguar.ncsl.nist.gov/pub/CMUseg_0.4a.tar.Z , 1997.
[Dempster77] Dempster, A., Laird, N., Rubin, D.,
CONCLUSION Maximum likelihood from incomplete data via the EM
The current results provide a baseline performance figure, algorithm, Journal of the Royal Statistical Society, 39, 1,
demonstrating what can be done with automatic methods pp. 1 – 38, 1977.
when the full spectrum of information is available from
speech, audio, image and closed-caption transcripts. The [Grice75] Grice, H. P. Logic and Conversation. In
initial subjective reaction is that the system performs quite P. Cole (ed.) Syntax and Semantics. Vol. 3. New York:
well in practice using the current approach. The future Academic Press. 41-58 , 1975.
challenge lies in dealing with uncaptioned, speech- [Hampapur94] Hampapur, A., Jain, R., and Weymouth,
transcribed data, since the speech recognition generated T., Digital Video Segmentation, ACM Multimedia 94,
transcript contains a significant word error rate. pp 357 – 364, ACM Int’l Conf on Multimedia, 15 – 20
Oct. 1994, San Francisco, CA.
The adequacy of segmentation depends on what you need to
[Hauptmann95] Hauptmann, A.G. and Smith, M.A. Text,
do with the segments. We are now in a position to evaluate Speech and Vision for Video Segmentation: the
the effectiveness of our segmentation process with respect to
Informedia Project. AAAI Fall Symposium on 11172 Information Technology – Coding of Moving
Computational Models for Integrating Language and Pictures & Associated Audio for Digital Storage,
Vision, Boston MA, Nov 10-12, 1995. International Standards Organization.
[Hauptmann95b] Hauptmann, A.G., Witbrock, M.J., [Nye84] Nye, H. The Use of a One Stage Dynamic
Rudnicky, A.I., and Reed, S. Speech for Multimedia Programming Algorithm for Connected Word
Information Retrieval, UIST-95 Proceedings of the User Recognition, IEEE Transactions on Acoustics, Speech
Interface Software Technology Conference, Pittsburgh, and Signal Processing, Vol. AASP-32, No 2, pp. 262-
November 1995. 271, April 1984.
[Hauptmann97] Hauptmann, A.G. and Witbrock, M.J., [Pentland94] Pentland A., Moghaddam B., and Starner
Informedia News on Demand: Multimedia Information T. View-Based and Modular Eigenspaces for Face
Acquisition and Retrieval, in Maybury, M.T., Ed, Recognition IEEE Conference on Computer Vision &
Intelligent Multimedia Information Retrieval, AAAI Pattern Recognition, Seattle, WA, July 1994.
Press/MIT Press, Menlo Park, 1997 [Placeway96] Placeway, P. and Lafferty, J., “Cheating
[Wactlar96] Wactlar, H.D., Kanade, T., Smith, M.A. with Imperfect Transcripts”, ICSLP-96 Proceedings of
and Stevens, S.M., Intelligent Access to Digital Video: the 1996 International Conference on Spoken Language
Informedia Project. IEEE Computer, 29 (5), 46-52, May Processing, Philadelphia, PA, October 1996.
1996. See also http://www.informedia.cs.cmu.edu/. [Rowley95] Rowley, H., Baluja, S. and Kanade, T.,,
[Hauptmann97b] Hauptmann, A.G., Witbrock, M.J. and Human Face Detection in Visual Scenes. Carnegie
Christel, M.G. Artificial Intelligence Techniques in the Mellon University, School of Computer Science
Interface to a Digital Video Library, Proceedings of the Technical Report CMU-CS-95-158, Pittsburgh, PA,
CHI-97 Computer-Human Interface Conference New 1995.
Orleans LA, March 1997. [Taniguchi95] Taniguchi, Y., Akutsu, A., Tonomura, Y.,
[Hearst93] Hearst, M.A. and Plaunt, C., Subtopic and Hamada, H., An Intuitive and Efficient Access
structuring for full-length document access, in Proc Interface to Real-time Incoming Video based on
ACM SIGIR-93 Int’l Conf. On Research and automatic indexing, ACM Multimedia-95, p. 25 - 33,
Development in Information Retrieval, pp. 59 – 68, San Francisco, CA 1995.
Pittsburgh PA, 1993. [Witbrock97] Witbrock, M.J. and Hauptmann, A.G.
[Hwang94] Hwang, M., Rosenfeld, R., Thayer, E., Using Words and Phonetic Strings for Efficient
Mosur, R., Chase, L., Weide, R., Huang, X., and Alleva, Information Retrieval from Imperfectly Transcribed
F., “Improving Speech Recognition Performance via Spoken Documents, DL97, The Second ACM
Phone-Dependent VQ Codebooks and Adaptive International Conference on Digital Libraries,
Language Models in SPHINX-II.” ICASSP-94, vol. I, Philadelphia, July 23 - 26, 1997.
pp. 549-552, 1994. [Witbrock98] Witbrock, M.J., and Hauptmann, A.G.,
[Kaszkiel97] Kaszkiel, M. and Zobel, J., Passage Speech Recognition in a Digital Video Library, Journal
Retrieval Revisited, pp. 178 – 185, SIGIR-97 of the American Society for Information Science
Proceedings of the 20th Annual International ACM (JASIS), 1998, In press.
SIGIR Conference on Research and Development in [Yamron97] Yamron, J. “Topic Detection and
Information Retrieval, Philadelphia, PA July 27 – 31, Tracking: Segmentation Task”, Topic Detection and
1997. Tracking (TDT) Worksho, 27-28 October 1997, College
[Kobla97] Kobla, V., Doermann, D., and Faloutsos, Park, MD. Also in BNTUW-98, Proceedings of the
D., Video Trails: Representing and Visualizing Structure Broadcast News Transcription and Understanding
in Video Sequences, ACM Multimedia 97, Seattle, WA , Workshop, Leesburg, VA, February 1998.
Nov. 1997. [Yeung96] Yeung, M., and Yeo, B.-L., “Time-
[Mani96] Mani, I., House, D., Maybury, M. and constrained Clustering for Segmentation of Video into
Green, M. “Towards Content-Based Browsing of Story Units” in International Conference on Pattern
Broadcast News Video”, in Maybury, M. T. (editor), Recognition, August 1996.
Intelligent Multimedia Information Retrieval, 1997. [Yeung96b] Yeung, M., Yeo, B.-L., and Liu, B.,
[Maybury96] Maybury, M., Merlino, A., and Rayson, "Extracting Story Units from Long Programs for Video
J., “Segmentation, Content Extraction and Visualization Browsing and Navigation" in International Conference
of Broadcast News Video using Multistream Analysis”, on Multimedia Computing and Systems, June 1996.
in Proceedings of the ACM International Conference on [Zhang95] Zhang, H.J., Low, C.Y., Smoliar S.W.,
Multimedia, Boston, MA, 1996. and Wu, J.H., Video Parsing, Retrieval and Browsing:
[Merlino97] Merlino, A., Morey, D., and Maybury, An Integrated and Content-Based Solution, ACM
M., Broadcast News Navigation using Story Multimedia-95, p. 15 – 24, San Francisco, CA 1995.
Segmentation, ACM Multimedia 1997, November 1997
[MPEG-ISO] International Standard ISO/IEC-CD-