0% found this document useful (0 votes)
36 views12 pages

Story Segmentation and Detection of Commercials in Broadcast News Video

The document summarizes research on story segmentation and detection of commercials in broadcast news video. It describes how the Informedia Digital Library Project segments full-length news broadcasts into individual news stories and commercials. It focuses on step 5 of the library creation process, which splits the news show into story segments using available audio, video and closed-captioning cues. Accurate segmentation is important for effective information retrieval from digital video libraries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views12 pages

Story Segmentation and Detection of Commercials in Broadcast News Video

The document summarizes research on story segmentation and detection of commercials in broadcast news video. It describes how the Informedia Digital Library Project segments full-length news broadcasts into individual news stories and commercials. It focuses on step 5 of the library creation process, which splits the news show into story segments using available audio, video and closed-captioning cues. Accurate segmentation is important for effective information retrieval from digital video libraries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

ADL-98 Advances in Digital Libraries Conference, Santa Barbara, CA, April 22-24, 1998.

Story Segmentation and Detection of Commercials


In Broadcast News Video
Alexander G. Hauptmann Michael J. Witbrock
Department of Computer Science Justsystem Pittsburgh Research Center
Carnegie Mellon University 4616 Henry St.
Pittsburgh, PA 15213-3890, USA Pittsburgh, PA 15213, USA
Tel: 1-412-348-8848 Tel: 1-412-683-9486
E-mail: [email protected] E-mail: [email protected]

ABSTRACT subsystems: the Library Creation System and the Library


The Informedia Digital Library Project [Wactlar96] allows Exploration Client. The library creation system runs every
full content indexing and retrieval of text, audio and video night, automatically capturing, processing and adding
material. Segmentation is an integral process in the current news shows to the library. During the library
Informedia digital video library. The success of the creation phase, the following major steps are performed:
Informedia project hinges on two critical assumptions: that
we can extract sufficiently accurate speech recognition 1. Initially a news show is digitized into MPEG-I format.
transcripts from the broadcast audio and that we can The audio and video tracks from the MPEG are split out
segment the broadcast into video paragraphs, or stories, that and processed separately, with their relative timing
are useful for information retrieval. In previous papers information preserved, so that derived data in the two
[Hauptmann97, Witbrock97, Witbrock98], we have shown streams can be resynchronized to the correct frame
that speech recognition is sufficient for information retrieval numbers.
of pre-segmented video news stories. In this paper we
address the issue of segmentation and demonstrate that a 2. Speech contained in the audio track is transcribed into
fully automatic system can extract story boundaries using text by the Sphinx-II Speech Recognition System
available audio, video and closed-captioning cues. [Hwang94]. The resulting text transcript also contains
timing information for each recognized word, recording
The story segmentation step for the Informedia Digital to within 10 milliseconds when it began and when it
Video Library splits full-length news broadcasts into ended.
individual news stories. During this phase the system also
labels commercials as separate “stories”. We explain how 3. Images from the video are searched for shot boundaries
the Informedia system takes advantage of the closed and representative frames within a shot. Other video
captioning frequently broadcast with the news, how it processing searches for and identifies faces and text
extracts timing information by aligning the closed-captions areas within the image, and the black frames frequently
with the result of the speech recognition, and how the associated with commercials.
system integrates closed-caption cues with the results of
image and audio processing. 4. If closed-captioning is available, the captions are
aligned to the words recognized by the speech
KEYWORDS: Segmentation, video processing, broadcast recognition step. This enables the timing information
news story analysis, closed captioning, digital library, video provided by the speech recognition system to be
library creation, speech recognition. imposed on the closed captions, which are usually a
more accurate reflection of the spoken words.
INTRODUCTION
By integrating technologies from the fields of natural 5. The news show is segmented into individual news
language understanding, image processing, speech stories or paragraphs, allowing for information
recognition and video compression, the Informedia digital retrieval and playback of coherent blocks.
video library system [Wactlar96, Witbrock98] allows
comprehensive access to multimedia data. News-on- 6. Meta-data abstractions of the stories including titles,
Demand [Hauptmann97] is a particular collection in the skims, film-strips, key frames for shots, topic
Informedia Digital Library that has served as a test-bed for identifications and summaries are created
automatic library creation techniques. We have applied [Hauptmann97b].
speech recognition to the creation of a fully content-indexed
library and to interactive querying. 7. The news show and all its meta-data are combined with
previous data and everything is indexed into a library
The Informedia digital video library system has two distinct catalog, which is then made available to the users via
the Informedia client programs for search and relevant documents. In experiments using a very small
exploration [Hauptmann97, Witbrock98, Witbrock97, corpus of 319 news stories, with 59 news headlines
Hauptmann95b]. functioning as queries into the text, the average precision
dropped to 0.538, compared with 0.821 for information
This paper will describe, in detail, step 5 of the library retrieval from perfectly segmented text documents. This
creation process described above. This is the procedure that preliminary research indicated that one may expect a 34.5 %
splits the whole news show into story segments. drop in retrieval effectiveness if one uses a fixed window of
text for segmentation. Another metric showed that precision
THE PROBLEM OF SEGMENTATION dipped to 0.407 with fixed window segmentation, from
The broadcast news is digitized as a continuous stream. 0.821 for perfect text segmentation, a 50.5% decrease.
While the system can separate shows by using a timer to
start daily recording at the beginning of the evening news [Merlino97] presented empirical evidence that the speed
and to stop recording after the news, the portion in between with which a user could retrieve relevant stories that were
is initially one continuous block of video. This block may be well segmented was orders of magnitude greater than the
up to one hour long. When a user asks the system for speed of linear search or even a flat keyword-based search.
information relevant to a query, it is insufficient for it to
respond by simply pointing the user at an entire hour of Achieving high segmentation accuracy remains, therefore,
video. One would expect the system to return a reasonably an important problem in the automatic creation of digital
short section, preferably a section only as long as necessary video and audio libraries, where the content stream is not
to provide the requested information manually segmented into appropriate stories. Without good
segmentation, all other components of a digital video library
Grice’s maxims of conversation [Grice75] state that a will be significantly less useful, because the user will not be
contribution to a conversation should be as informative as conveniently able to find desired material in the index.
required, as correct as possible and relevant to the aims of
the ongoing conversation. A contribution should be clear, RELEVANT RESEARCH
unambiguous and concise. The same requirements hold for [Beeferman97] introduced a new statistical approach to
the results returned in response to a user query. The returned automatically partitioning text into coherent segments. The
segment should be as long as is necessary to be informative, proposed model enlists both long-range and short-range
yet as short as possible in order to avoid irrelevant language models to help it sniff out likely sites of topic
information content. changes in text. To aid its search, the model consults a set of
simple lexical hints it has learned to associate with the
In the current implementation of the News-on-Demand presence of boundaries through inspection of a large corpus
application, a “news story” has been chosen as the of annotated text data. To date, this approach has not been
appropriate unit of segmentation. Entire segmented news extended to cover non-textual information such as video or
stories are the video paragraphs or “document” units audio cues. Beeferman also proposed a new probabilistically
indexed and returned by the information retrieval engine in motivated error metric, intended to replace precision and
response to a query. When browsing the library, these news recall for appraising segmentation algorithms. We use a
stories are the units of video presented in the results list and modified version of this metric later in this paper.
played back when selected by the user.
[Hearst93] introduced the use of text tiles for segmentation
This differs from other work that treats video segmentation of paragraphs by topic. Text tiles are coherent regions that
as a problem of finding scene cuts [Zhang95, Hampapur94] are separated through automatically detected topic shifts.
in video. Generally there are multiple scene cuts that These topic shifts are discovered by comparing adjacent
comprise a news story, and these cuts do not correspond in a blocks of text, several paragraphs long, for similarity, and
direct way to topic boundaries. applying a threshold. The text tiling approach was initially
used to discover dialog topic structure in text, and later
Work by Brown et al [Brown95] clearly demonstrates the modified for use in information retrieval. Unlike the
necessity of good segmentation. Using a British/European segmentation approach presented here, it was based entirely
version of closed-captioning called “teletext”, various news on the words in the text. Our own preliminary experiments
text segmentation strategies for closed-captioned data were showed that the text tiling approach was not easily adaptable
compared. Their straightforward approach looked at fixed to the problem of combining multiple sources of
width text windows of 12, 24, 36, and 48 lines in length1, segmentation information.
overlapping by half the window size. The fixed partitioning
into 36 lines per story worked best, and the 48-line Yeung and her colleagues [Yeung96, Yeung96b] used an
segmentation was almost as good. To measure information entirely image based approach to segmentation. Their video
retrieval effectiveness, the authors modified the standard IR storyboard lays out a two dimensional flow of the scenes,
metrics of precision and recall to measure the degree of indicating where the broadcast returns to the same scene.
overlap between the resulting documents and the objectively Video storyboards rely heavily on the detection of similar
images in scenes. This is similar to the Video Trails idea
1 With the line lengths defined by the output of the teletext presented by [Kobla97]. Video trails use the MPEG encoded
decoder. features and map them into a three dimensional space.
Clusters of scenes in the same region of this space indicate frame level, which is not possible to achieve without
that there is a return to the same scene after a digression to accurate alignment of the transcript words to the video
other scenes. This technique can easily identify the anchor frames.
narrating a story, a segment of reporting from the field, and
the anchorperson returning in a later scene. The approach is, SOURCES OF INFORMATION FOR
however, unable to distinguish the fact that the anchorperson SEGMENTATION
is merely reading two consecutive stories, without The Informedia Digital Video Library System’s News-on-
intervening video. While the story board and video trails Demand Application seeks to exploit information from
ideas are useful, work with large collections of broadcast image processing, closed-captioning, speech recognition and
news shows that text information provides important general audio processing. The following describes the
additional information which should not be ignored. None of features we derive from each of these sources. For each of
the available papers on Video Trails or video storyboards the sources, there are features the Informedia system
have reported segmentation effectiveness in a quantitative actually uses, and features the system could use but does not
way. yet use. These latter features are being investigated, but are
not yet sufficiently well understood to be reliably integrated
Perhaps the most similar research to that presented here is into the production version of the Informedia News-on-
MITRE’s Broadcast News Navigator. The Broadcast News Demand library.
Navigator (BNN) system [Merlino97, Maybury96, Mani96]
has concentrated on automatic segmentation of stories from We are not exploiting specific timing information or logos,
news broadcasts using phrase templates. The BNN system or idiosyncratic structures for particular shows at particular
uses a finite state network with 22 states to identify the times. Such cues could include detection of the CNN logo,
segmentation structure. In contrast to the approach presented the WorldView logo, the theme music of a show, the face of
here, the BNN system is heavily tailored towards a specific a particular anchorperson, and similar features specific to
news show format, namely CNN Prime News. As a result, particular news show formats. Despite omitting these
the system exploits the temporal structure of the show’s potentially useful markers, we have successfully applied our
format, using knowledge about time, such as the fact that a segmentation process to shows such as CNN World View,
particular CNN logo is displayed at the start of the CNN Prime News, CNN Impact, CNN Science and
broadcast. The system also makes extensive use of phrase Technology, Nightline, the McNeil-Lehrer News Hour,
templates to detect segments. For example, using the Earth Matters, and the ABC, CBS and NBC Evening News.
knowledge that, in this show, news anchors introduce
themselves at the beginning and the end of the broadcast, the
system tries to detect phrase templates such as “Hello and
welcome”, “I’m <person-name>”, which signal the
introduction to the show. While a great deal of success has
been achieved so far using heuristics based on stereotypical
features of particular shows (e.g. “still to come on the
NewsHour tonight… ”), the longer term objective of BNN is
to use multi-stream analysis of such features as speaker
change detection, scene changes, appearance of music and
so forth to achieve reliable and robust story segmentation.
The BNN system also aims to provide a deeper level of
understanding of story content than is provided through
simple full text search, by extracting and identifying, for
example, all the named entities (persons, places and
locations) in a story.

Our work differs from especially the latter in that we have


chosen not to use linguistic cues, such as key phrases for the
analysis of the segmentation. This allows our system to be
relatively language independent. We also don’t exploit the Figure 1: Scene breaks in the story whose
format of particular shows by looking for known timing of transcript appears in Figure 2.
stories, special logos or jingles. However, like the BNN
system we do exploit the format provided through the Image Information
closed-captioned transcript. Unlike the BNN, we use a The video images contain a great deal of information that
separately generated speech recognition transcript to provide can be exploited for segmentation.
timing information for each spoken word and to restore
missing sections of the transcript that had not initially been Scene breaks. We define a scene break as the editing cut
captioned. In addition, we make extensive use of timing between individual continuous camera shots. Others have
information to provide a more accurate segmentation. As a referred to these as shot breaks or cuts. Fade and dissolve
result, our segmentation results are relatively accurate at the effects at the boundaries between scenes are also included as
1 >>> I'M HILARY BOWKER IN
1 LONDON. to 10 minutes. Scene breaks, on the other hand, appear, on
2 CONFRONTING DISASTERS average, at intervals of less than 15 seconds. To detect scene
2 AROUND THE GLOBE. breaks in the Informedia Digital Video Library System ,
3 ASIAN SKIES WEIGHED DOWN
3 UNDER A TOXIC BLANKET OF color histogram analysis and Lucas- Kanade optical flow
4 SMOKE. analysis are applied to the MPEG-encoded video
4 IS IT THE CAUSE OF A PLANE [Hauptmann95]. This also enables the software to identify
5 CRASH IN INDONESIA? editing effects such as cuts and pans that mark shot changes.
6 >> I'M SONIA RUSELER IN
An example of the result of this process is shown in Figure
7 WASHINGTON.
13 RARE DESERT FLOODING IN 1. A variation of this approach [Taniguchi95] uses a high
14 ARIZONA THREATENS FAMILIES rate of scene breaks to detect commercials.
15 AND THEIR HOMES.
20 [CLOSED CAPTIONING Black Frames. For technical reasons, commercials are
21 PROVIDED BY BELL ATLANTIC, usually preceded and followed by one or more frames that
21 THE HEART OF COMMUNICATION]
22 are completely black. When we can detect these blank
27 [CLOSED CAPTIONING PERFORMED frames, then we have additional information to aid in the
28 BY MEDIA CAPTIONING SERVICES segmentation of the news text. However, blank, or black
29 CARLSBAD, CA.] frames also occur at other points during regular broadcasts.
44
44
Because of the quality of the MPEG encoded analog signal,
45 it may also not be possible to distinguish a very dark frame
46 >>> AN AIRLINE OFFICIAL IN from a black frame. [Merlino97] also found black frames to
47 INDONESIA SAYS HAZE be useful for story segmentation. Like so many of these
47 PRODUCED BY RAMPANT FOREST cues, black frames are not by themselves a reliable indicator
48 FIRES PROBABLY HAD A ROLE
48 IN AN AIRLINE CRASH THAT
of segment boundaries. However, they provide added
49 KILLED 234 PEOPLE. information to improve the segmentation process.
52 THE AIRBUS A-300, ON A
53 FLIGHT FROM JAKARTA, Frame Similarity. Another source of information is the
55 CRASHED ON APPROACH TO similarity between different scenes. The anchor, especially,
55 MEDON.
58 MEDON IS ON THE INDONESIAN
will reappear at intervals throughout a news program, and
59 ISLAND, SUMATRA, CITE OF each appearance is likely to denote a segmentation boundary
60 MANY OF THE FIRES THAT HAVE of some type. The notion of frame similarity across scenes is
61 SENT A SMOKE CLOUD OVER SIX fundamental to both the Video Trails work [Kobla97] and to
62 COUNTRIES. video storyboards [Yeung96, Yeung96b]. In the Informedia
63 CNN'S MARIA RESSA IS IN THE
64 MALAYSIAN CAPITAL, KUALA system we have two different measures of similarity
65 LUMPUR, AND FILED THIS available, each of the measures looks at key frames, a single
66 REPORT. chosen representative frame for each scene, and compares
71 >>> RESCUERS FOUND ONLY them for similarity throughout the news broadcast.
72 SCATTERED PIECES OF THE
73 AIRBUS 300.
1. Color histogram similarity is computed based on the
breaks. However, we want to avoid inserting scene breaks relative distribution of colors across sub-windows in the
when there is merely object motion or some other dramatic current keyframe. This color similarity is computed
visual effect such an explosion within a single scene. Some between all pairs of key frames in the show. The key
of the more image oriented vision research has actually frame that occurs most frequently in the top 25
referred to the detection of scene breaks as video similarity candidates and has the highest overall
segmentation. This view differs dramatically from our similarity score to others is used as the root frame. It
interpretation of segmentation as the detection of story and its closest matches will be used as the candidates
boundaries. In general, news stories contain multiple scene for segmentation based on image similarity.
breaks, and can range in duration anywhere from 15 seconds
2. Face similarity is computed by first using CMU’s face this example. The transcript also contains text that was not
detection algorithm [Rowley95]. Any faces in all key actually spoken (e.g. “[CLOSED CAPTIONING PROVIDED BY
frames are detected, and then these faces are compared BELL ATLANTIC, THE HEART OF COMMUNICATION]”).
using the eigenface technique developed at MIT
[Pentland94]. Once a matrix of pair-wise similarity
coefficients has been computed, we again select the 3914 AN 5340 KNOWN
3929 AIRLINE 5416 AND
most popular face and its closest matches as candidates 3971 OFFICIALLY 5436 ON
for segmentation boundaries. 4027 ENDED 5455 HIS
4056 ASIA'S 5469 ON
While each of these two methods is somewhat unreliable by 4093 SAYS 5487 HIS
itself, combining evidence from both color histogram and 4122 HEY 5540 SHOULD
4150 IS 5554 ISLAND
face similarity estimates gives a more reliable indication that 4164 PRODUCED 5593 OF
we have detected a frame containing an anchor person. 4209 BY 5600 SUMATRA
4230 RAMPANT 5658 CITE
MPEG optical flow for motion estimation: In the MPEG 4278 FOREST 5695 AN
video stream there is a direct encoding of the optical flow 4319 FIRE 5724 IDEA
4345 IS 5753 THAT
within the current images [MPEG-ISO]. This encoded value 4356 PROBABLY 5768 FIRES
can be extracted and used to provide information about 4421 OUR 5822 THAT
whether there is camera motion or motion of objects within 4443 ROLE 5836 HAVE
the scene. Scenes containing movement, for example, may 4474 IN 5850 SENT
be less likely to occur at story boundaries. 4488 ANY 5881 US
4510 AIRLINE 5905 HOW
4550 CRASH 5929 CLOUD
While we have experimented with all these cues for 4632 THAT 5964 OVER
segmentation, in the Informedia News-On-Demand 4652 KILLED 5992 SIX
production system, we currently only exploit black frames 4700 TWO 6031 COUNTRIES
and scene breaks. Figure 5 shows the various image features 4727 HUNDRED 6133 C.
4768 AND 6152 N.
as well as corresponding manual segmentation. Not all 4793 THIRTY 6161 N.'S
features correlate well to the segments. 4824 FOUR 6180 MARIA
4850 PEOPLE 6213 RESSA
Closed-Captioned Transcripts 4934 THE 6258 IS
The closed caption transcript that is broadcast together with 4950 AIR 6273 IN
the video includes some useful time information. The 4966 WAS 6286 THE
5003 A 6293 LOW
transcript, while almost always in upper case, also contains 5023 THREE 6310 WAGE
syntactic markers, format changes between advertisements 5049 HUNDRED 6338 AND
and the continuous news story flow. Finally useful 5085 ON 6351 CAPITAL
information can be derived from both the presence and 5099 A 6392 QUELL
5107 FLIGHT 6435 INFORMED
absence of closed captioning text at certain times during the
5139 FROM 6506 AND
video. Note that this information almost always contains 5154 JAKARTA 6518 HAS
some errors. 5213 CRASHED 6539 FILED
5254 ON 6575 THIS
For the production Informedia Digital Video Library 5268 APPROACH 6595 RIFT
system, we currently exploit all of these cues in the closed 5307 TO
5324 THE
captioned transcript. A sample closed-caption transcript is
given below in Figure 2.
Figure 3. Sample of a speech recognizer generated
This transcript shows the actual transmitted caption text, as transcript for the captions between 46 and 66 seconds in
well as the time in seconds from the start of the show when Figure 2. Times are in 10 millisecond frames.
each caption line was received. The particular format is
specific to CNN, but similar styles can be found for There is a large gap in the text, starting after about 29
captioning from ABC, CBS and NBC as well. seconds and lasting until the 44-second mark . During this
gap, the captioner transcribed no speech, and a short
There are several things to notice in this closed-captioned commercial was aired advertising CNN and highlights of the
transcript. First of all, there are “>>>” markers that indicate rest of the evening’s shows. Although it is not visible in the
topic changes. Speaker turns are marked with “>>” at the transcript, closed-captioned transcripts lag behind the
beginning of a line. These markers are relatively reliable, actually spoken words by an average of 8 seconds. This
but like anything done by people, are subject to errors. Most delay varies anywhere between 1 and 20 seconds across
of these are errors of omission, but occasionally there are shows. Surprisingly, rebroadcasts of shows with previously
also spurious insertions of these discourse markers. transcribed closed-captioning have been observed at times to
have the closed-captioning occur before the words were
Secondly, notice that the text transcript is incomplete and actually spoken on the broadcast video. Exceptional delays
contains typing errors, although these are relatively few in
4600 an 3914 AN 5500 medon 5324 THE To keep the processing speeds near real time for Informedia,
4600 an 3929 AIRLINE 5500 medon 5340 KNOWN the speech recognition is done with a narrow beam version
4600 an 3971 OFFICIALLY 5500 medon 5416 AND
4600 airline 3971 OFFICIALLY 5800 medon 5436 ON of the recognizer which only considers a subset of the
4600 official 4027 ENDED 5800 is 5455 HIS possible recognition hypotheses at any point in an utterance ,
4600 in 4056 ASIA'S 5800 on 5469 ON
4700 indonesia 4093 SAYS 5800 the 5487 HIS
resulting in less than optimal performance. This
4700 says 4122 HEY 5800 indonesian 5540 SHOULD performance is, however, still sufficient for alignment with a
4700 haze 4150 IS 5900 island 5554 ISLAND closed-captioned transcript. Methods for improving the raw
4700 produced 4164 PRODUCED 5900 sumatra 5593 OF
4700 by 4209 BY 5900 sumatra 5600 SUMATRA speech recognition accuracy when captioned transcripts are
4700 rampant 4230 RAMPANT 5900 cite 5658 CITE available before recognition are outlined in [Placeway96].
4700 forest 4278 FOREST 5900 of 5695 AN
4800 fires 4319 FIRE 6000 many 5695 AN
4800 probably 4345 IS 6000 of 5724 IDEA The basic problem for alignment is to take two strings (or
4800 had 4356 PROBABLY 6000 the 5753 THAT streams) or data, where sections of the data match in both
4800 a 4421 OUR 6000 fires 5768 FIRES
4800 role 4443 ROLE 6000 that 5822 THAT
strings and other sections do not. The alignment process
4800 in 4474 IN 6000 have 5836 HAVE tries to find the best way of matching up the two strings,
4800 an 4488 ANY 6100 sent 5850 SENT allowing for pieces of data to be inserted, deleted or
4800 airline 4510 AIRLINE 6100 a 5881 US
4800 crash 4550 CRASH 6100 smoke 5905 HOW substituted, such that the resulting paired string gives the
4800 that 4632 THAT 6100 cloud 5929 CLOUD best possible match between the two streams. The well-
4900 killed 4652 KILLED 6100 over 5964 OVER
4900 two 4700 TWO 6100 six 5992 SIX
known Dynamic Time Warping procedure (DTW) [Nye84]
4900 thirty 4727 HUNDRED 6200 countries 6031 COUNTRIES will accomplish with a guaranteed least cost distance for two
4900 thirty 4768 AND 6300 cnn's 6133 C text strings. Usually the cost is simply measured as the total
4900 thirty 4793 THIRTY 6300 cnn's 6152 N
4900 four 4824 FOUR 6300 cnn's 6161 N.'S
number of insertions, deletions and substitutions required to
4900 people 4850 PEOPLE 6300 maria 6180 MARIA make the strings identical.
5200 the 4934 THE 6300 ressa 6213 RESSA
5200 airbus 4950 AIR 6300 is 6258 IS
5200 airbus 4966 WAS 6300 in 6273 IN
In Informedia, using a good quality transcript and a speech
5200 a 5003 A 6300 the 6286 THE recognition transcript, the words in both transcripts are
5200 three 5023 THREE 6400 malaysian 6293 LOW aligned using this dynamic time warping procedure. The
5200 hundred 5049 HUNDRED 6400 malaysian 6310 WAGE
5200 on 5085 ON 6400 malaysian 6338 AND time stamps for the words in the speech recognition output
5200 a 5099 A 6400 capital 6351 CAPITAL are simply copied onto the clean transcript words with
5300 flight 5107 FLIGHT 6400 kuala 6392 QUELL
5300 from 5139 FROM 6500 lumpur 6435 INFORMED
which they align. Since misrecognized word suffixes are a
5300 jakarta 5154 JAKARTA 6500 and 6506 AND common source of recognition errors, the distance metric
5500 crashed 5213 CRASHED 6500 filed 6518 HAS between words used in the alignment process is based on the
5500 on 5254 ON 6500 filed 6539 FILED
5500 approach 5268 APPROACH 6500 this 6575 THIS degree of initial sub-string match. Even for very low
5500 to 5307 TO 6600 report 6595 RIFT recognition accuracy, this alignment with an existing
transcript provides sufficiently accurate timing information.
Figure 4 Sample alignment of the closed captions and
the speech recognition in the previous examples The Informedia system uses this information to aid in
segmentation, allowing more accurate segment boundary
can also occur when there are unflushed words in the detection than would be possible merely by relying on the
transcription buffer, which may only be displayed after a closed captioning text and timings.
commercial, even though the words were spoken before the
commercial. The actual speech recognition output for a portion of the
story in Figure 2 is shown in Figure 3. An example
Finally, in the captions shown in Figure 2, there are alignment of the closed-captions to the speech recognition
formatting cues in the form of blank lines after the break at transcript is shown in Figure 4.
44 and 45 seconds into the program, before captioning
resumes at 46 seconds. OTHER AUDIO FEATURES
Amplitude. Looking at the maximal amplitude in the audio
AUDIO INFORMATION signal within a window of one second is a good predictor of
Speech Recognition for Alignment changes in stories. In particular, quiet parts of the signal are
To make video documents retrievable with the type of correlated with new story segments.
search available for text documents, where one can jump
directly to a particular word, one needs to know exactly Silences. There are several ways to detect silences in the
when each word in the transcript is spoken. This information audio track. In the amplitude signal described above, very
is not available from the closed captioning and must be small values of the maximum segment amplitude suggest a
derived by other means. A large vocabulary speech silence. Low values of the total power over the same one
recognition system such as Sphinx-II can provide this second window also indicate a silence. Alternatively, one
information [Hwang94]. Given exact timings for a partially can use the silences detected by the speech recognizer,
erroneous transcription output by the speech recognition which explicitly models and detects pauses in speech by
system one can align the transcript words to the precise using an acoustic model for a “silence phone” [Hwang94].
location where each word was spoken, within 10 Silences are also conveniently detected by the CMUseg
milliseconds. Audio Segmentation package [CMUseg97].
1

Figure 5: Image Features for Segmentation. The manually discovered segments (1) are at the top, and aligned
underneath are MPEG optical flow (2), scene breaks (3), black frames (4), all detected faces (5), similar color image
features (6), and similar faces (7) .
Acoustic Environment Change. Changes in background to prepare those stories for search.
noise, recording channel 2, or speaker changes, for example
can cause long term changes in the spectral composition of DETECTING COMMERCIALS USING IMAGE FEATURES
the signal. By classifying acoustically similar segments into Although there is no single image feature that allows one to
a few basic types, the location of these changes can be tell commercials from the news content of a broadcast, we
identified. This segmentation based on acoustic similarity have found that a combination of simple features does a
can also be performed by the CMUseg package. passable job. The two features used in the simplest version
of the commercial detector are the presence of black frames,
Signal-to-Noise Ratio (SNR). Signal to noise ratio attempt and the rate of scene changes.
to capture some of the effects of acoustic environment by
measuring the relative power in the two major spectral peaks Black frames are frequently broadcast for a fraction of a
in a signal. While there are a number of ways to compute the second before, after, and between commercials, and can be
SNR of an acoustic signal, none of them perfect, we have detected by looking for a sequence of MPEG frames with
used the approach to SNR computation described in low brightness. Of course, black frames can occur for other
[Hauptmann97] with a window size of .25 seconds. reasons, including a directorial decision to fade to black or
during video shot outdoors at night. For this reason, black
To date we have only made informal attempts to include this frames are not reliably associated with commercials.
audio signal data in our segmentation heuristics. We will
report the results of this effort in a future paper. Because advertisers try to make commercials seem more
interesting by rapidly cutting between different shots,
Figure 6 shows these audio features as they relate to perfect sections of programming with commercials tend to have
(human) segmentation. more scene breaks than are found in news stories. Scene
breaks are computed by the Informedia system as a matter
We fully exploit the speech recognition transcript and the of course, to enable generation of the key frames that
silences identified in the transcript for segmentation. Some represent a salient section of a story when results are
of the other features computed by CMUseg are also used for presented from a search, and for the film strip view [Figure
adapting the speech recognition (SR) to its acoustic 1] that visually summarizes a story. These scene changes
environment, and for segmenting the raw signal into are detected by hypothesizing a break when the color
sections that the recognizer can accommodate. This histogram changes rapidly over a few frames, and rejecting
segmentation is not topic or news story based, but instead that hypothesis if the optical flow in the image does not
simply identifies short “similar” regions as units for SR. show a random pattern. Pans, tilts and zooms, which are not
These segments are indicative of short “phrase or breath shot breaks, cause color histogram changes, but have
groups”, but are not yet used in the story segmentation smooth optical flow fields associated with them.
processing.
These two sources of information are combined in the
METHOD following, rather ad hoc heuristic:
This section will describe the use of image, acoustic, text
and timing features to segment news shows into stories and 1. Probable black frame events and scene change events
are identified.
2 Telephones and high quality microphones, for example,
produce signals with distinct spectral qualities.
1

Figure 6: Audio Features for Segmentation. At the top are the manually found segments (1), followed by silences
based on spectral analysis (2), speech recognition segments (3) silences in speech (4), signal-to-noise ratio (5), and
maximum amplitude (6).
2. Sections of the show that are surrounded by black commercials on the basis of shot change rate.
frames separated by less that 1.7 times the mean
distance between black frames in the show, and that 4. Initially, a commercial is hypothesized over a period if
have sections meeting that criterion on either side, are either criterion 3 or 4 is met.
marked as probably being commercials on the basis of
black frames. 5. Short hypothesized stories, defined as non-commercial
sections less than 35 seconds long, are merged with
3. Sections of the show that are surrounded by shot their following segment. Then short hypothesized ads,
changes separated by less than the mean distance less than 28 seconds long, are merged into the
between shot changes for the whole show, and are following segment.
surrounded by two sections meeting the same criterion
on either side, are marked as probably occurring in 6. Because scene breaks are somewhat less reliably
detected at the boundaries of advertising sections, black

Hand Identified

Move to black 2 (8)

Move to black 1 (6)

No short stories (5)

No short ads (4)

Scene Change Rate (3)

Black Frame Rate (2)

Scene Changes

Black Frame

Figure 7: The commercial detection code in Informedia combines image features to hypothesize commercial
locations. The bottom two graphs show the raw signals for black frames and scene changes respectively. The
graph at the top shows the hand-identified positions of the commercials. The graphs running from bottom to top
show successive stages of processing. The numbers in parentheses correspond to the steps outlined in the text.
frame occurrences are used to “clean up” boundaries. there is a time gap longer than a threshold value of 15
Hypothesized starts of advertisements are moved to the seconds in the closed caption transmission, this gap is
time of any black frame occurring up to 4.5 seconds labeled as a possible commercial and a definite story
before. Hypothesized ends of advertisements are moved segmentation boundary. Similarly, if there are multiple
to the time of any black frame appearing up to 4.5 blank lines (three or more in the current
seconds after the hypothesized time. implementation), a story boundary is presumed at that
location. In Figure 2 above, story segmentation
7. Because scene changes can also occur rapidly in non- boundaries would be hypothesized at 0, 29, and 44
commercial content, after the merging steps described seconds.
above, any “advertising” section containing no black
frames at all is relabeled as news. 3. Next the image based commercial detection code
described in the previous section is used to hypothesize
8. Finally, as a sort of inverse of step 6, because some additional commercial boundaries.
sections of news reports, and in particular weather
reports, have rapid scene changes, transitions into and 4. The words in the transcript are now aligned with the
out of advertisements may only be made on a black words from the speech recognition output, and timings
frame, if at all possible. If a commercial does not begin transferred from the speech recognizer words to the
on a black frame, its start is adjusted to any black frame transcript words. After this alignment each word in the
within 90 seconds after its start and preceding its end. transcript is assigned as accurate a start and end time as
Commercial ends are moved back to preceding black possible based on the timing found in the speech.
frames in a similar manner.
5. Speech recognition output is inserted into all regions
Figure 7 shows how this process is used to detect the where it is available, and where there were no captioned
commercials in an example CNN news broadcast. text words received for more than 7 seconds.

DETERMINING STORY BOUNDARIES. 6. A story segment boundary is assumed at all previously


The story boundaries are found in a process of many steps, determined boundaries as well at the times of existing
which include the commercial detection process outlined story break markers (“>>>”) inside the caption text.
above:
7. Empty story segments without text are removed from
1. The time-stamped lines of the closed captioned text are the transcripts. Boundaries within commercials are also
first normalized. This normalization removes all digit removed, creating a single commercial segment from
sequences in the text and maps them into typical multiple sequential commercials.
number expressions, e.g. 8 becomes eight, 18 becomes
eighteen, 1987 becomes nineteen eighty-seven, etc. In 8. Each of the resulting segments is associated with frame
order to be able to better match the text to the speech, number range in the MPEG video, using the precise
common abbreviations and acronyms are also speech recognition time stamps, and the corresponding
transformed into a normalized form (Dr. becomes text, derived from both captions and inserted speech
doctor, IBM becomes I. B. M. etc ) that corresponds to recognition words, is assigned to the segment for
the output style of the speech recognition transcript. In indexing.
the example in Figure 2, the number “234” at 49
seconds would be transformed into “two hundred and RESULTS
thirty-four”, “A-300” at 52 seconds would be The actual automatic segmentation results for the data
transformed into “A three hundred”, and other presented above are shown in Figure 8, with the manually
conversions would be done similarly.” generated reference transcript shown at the top.

2. The captioned transcript is then examined for obvious One metric for segmentation proposed at the recent Topic
pauses that would indicate uncaptioned commercials. If Detection and Tracking Workshop [Yamron97,

Manual
Segmentation

Automatic
Segmentation

Figure 8: A comparison of manual segmentation with the automatic segmentation described in this paper
shows very high correspondence between the segment markers.
Beeferman97] is based on the probability of finding a "And now more from Bernard Shaw… " in the last
segment boundary between two randomly chosen words. segment.
The error probability is a weighted combination of two C "Beginning of COMMERCIAL". This type of
parts, the probability of a false alarm and the probability of a segment covers all major commercials with typical
missed boundary. The error probabilities are defined as: duration of 30 or 45-seconds up to one minute. The
category also covers smaller promotional clips such
N− k as "World View is brought to you in part by Acme
∑δ hyp (i, i + k ) ⋅(1 − δref (i, i + k )) Punctuation Products" or "This is CNN!"
PMiss = i =1
N− k For evaluation purposes, we compared the set of 749
∑ (1 −
i =1
δref (i, i + k )) manually segmented stories from 13 CNN news broadcasts
with the stories segmented from the same broadcasts by the
N− k Informedia segmentation process described above. The only
∑ (1 − δ hyp (i , i + k )) ⋅δref (i , i + k ) modification to the manual segmentation done according to
the instructions above was that multiple consecutive
PFalseAlarm = i =1
N− k commercials were grouped together as one commercial
∑δ
i =1
ref (i , i + k ) block. On these 13 news broadcasts, the automatic
segmentation system averaged 15.16% incorrect
segmentation according to this metric. By comparison,
Where the summations are over all the words in the
human-human agreement between the 3 human segmenters
broadcast and where:
averaged 10.68% error. The respective false alarm and miss
rates are also shown in Table 1.
1 if i and j are from the same story
δ(i , j )=  P(err) P(FalseAlarm) P(Miss)
0 otherwise
The choice of k is a critical consideration in order to AutoSegment 15.16% 7.96 % 26.9 %
produce a meaningful and sensitive evaluation. Here it is set
to half the average length of a story. Inter-Human 10.68% 8.26 % 15.35%
Comparison
We asked three volunteers to manually split 13 TV
Broadcast news shows at the appropriate story boundaries No Segmentation 36.91 % 0.0 % 99.78 %
according to the following instructions:
Segment every 62.99 % 99.99 % 0.0 %
“For each story segment, write down the frame number second
of the segmentation break, as well as the type with
which you would classify this segment. The types are: Segment every 60.86 % 91.08 % 9.32 %
1180 frames
P "Beginning of PREVIEW". The beginning of a news (Average Story
show, in which the news anchors introduce Length)
themselves and give the headline of one or more
news stories. If each of 3 anchors had introduced 2 Table 1: Performance of the Automatic Segmentation
stories in sequence, there would be 6 beginning of
Procedure on evening news shows. Average Human
preview markers.
segmentation performance is given for comparison. The
T "Beginning of searchable content: TOPIC". This is results for no segmentation, segmentation every second
the most typical segment boundary that we expect to and segmentation into fixed-width blocks corresponding
be useful to an Informedia user. Every actual news to the average reference story length are given for
story should be marked with such a boundary. Place reference.
the marker at the beginning of news stories, together
with the frame number at that point. Do include DISCUSSION
directly preceding words, like "Thanks, Jack. And Unfortunately, these results cannot be directly compared
now for something completely different. In Bosnia with either the results in [Brown95] or [Merlino97]. Brown
today … " et al used a criterion of recall and precision for information
S "Beginning of searchable content: SUBTOPIC". retrieval. This was only possible with respect to a set of
These subtopics mark the boundaries between information retrieval queries, and given the existence of a
different segments within the same news story. As a human relevance judgement for every query against every
rule, whenever the anchorperson changes but talks document. In our study, the segmentation effectiveness was
about the same basic issue or topic as in the last evaluated on its own, and we have yet to evaluate its effect
segment, this is a subtopic. These subtopics are on the effectiveness of information retrieval.
usually preceded by a phrase like "And now more
from Bernard Shaw at the White House". Keep the [Merlino95] reported a very limited experiment in story
detection with the Broadcast News Navigator. In effect, they information retrieval, story tracking, or information
only reported whether the main news stories were extraction into semantic frames. Some approaches from the
segmented or not, ignoring all minor news stories, information retrieval literature [Kaszkiel97] claim that
commercials, previews, etc. Thus their metrics reflect the overlapping windows within an existing document can
ability of the Broadcast News Navigator System to detect improve the accuracy of the information retrieval. It remains
core news stories (corresponding only to segments labeled T for future work to determine if a modification of this
and S by our human judges). The metrics for the BNN also technique can circumvent the problem of static segmentation
ignored the timing of the story. Thus they did not take into in the broadcast news video domain.
account whether the detected story began and ended at the
right frame compared to the reference story. Segmentation is an important, integral part of the
Informedia digital video library. The success of the
We have achieved very promising results with automatic Informedia project hinges on two critical assumptions: That
segmentation that relies on video, audio and closed- we can extract sufficiently accurate speech recognition
captioned transcript sources. The remaining challenges transcript from the broadcast audio and that we can segment
include: the broadcast into video paragraphs (stories) that are useful
for information retrieval. In previous papers [Hauptmann97,
• The full integration of all the available audio and image Witbrock97, Witbrock98], we have shown that speech
features in addition to the text features. While we have recognition is sufficient for information retrieval of pre-
discussed how various features could be useful, we segmented video news stories. In this paper we now have
have not yet been able to fully integrate all of them. addressed the issue of segmentation and demonstrated that a
fully automatic system can successfully extract story
• Gaining the ability to automatically train segmentation boundaries using available audio, video and closed-
algorithms such as the one described here and to learn captioning cues.
similar or improved segmentation strategies from a
limited set of examples. As different types of broadcast ACKNOWLEDGMENTS
are segmented, we would like the system to This paper is based on work supported by the National
automatically determine relevant features and exploit Science Foundation, DARPA and NASA under NSF
them. Cooperative agreement No. IRI-9411299. We thank
Justsystem Corporation for supporting the preparation of the
• Completely avoiding the use of the closed-captioned paper. Many thanks to Doug Beeferman, John Lafferty and
transcripts for segmentation. While the closed- Dimitris Margaritis for their manual segmentation and the
captioned transcripts provide a good source of use of their scoring program.
segmentation information there is much data that is not
captioned. We would like to adapt our approach to REFERENCES
work without the captioned text, relying entirely on the [Beeferman97] Beeferman, D., Berger, A., and Lafferty.
speech recognizer transcription, the audio signal and the J., Text segmentation using exponential models. In Proc.
video images. Empirical Methods in Natural Language Processing 2
(AAAI) '97, Providence, RI, 1997.
In the near term we plan to use the EM [Dempster77] [Brown95] Brown, M. G., Foote, J. T., Jones, G. J. F.,
algorithm to combining many features into one Spärck-Jones, K. and Young, S. J, Automatic Content-
segmentation strategy, and to learn segmentation from data Based Retrieval of Broadcast News, ACM Multimedia-
for which only a fraction has been hand-labeled. Work is 95, p. 35 - 42, San Francisco, CA 1995.
also currently underway in the Informedia project to
evaluate the effectiveness of the current segmentation [CMUseg97] CMUseg, Carnegie Mellon University
approach when closed-captioning information is not Audio Segmentation Package,
available. ftp://jaguar.ncsl.nist.gov/pub/CMUseg_0.4a.tar.Z , 1997.
[Dempster77] Dempster, A., Laird, N., Rubin, D.,
CONCLUSION Maximum likelihood from incomplete data via the EM
The current results provide a baseline performance figure, algorithm, Journal of the Royal Statistical Society, 39, 1,
demonstrating what can be done with automatic methods pp. 1 – 38, 1977.
when the full spectrum of information is available from
speech, audio, image and closed-caption transcripts. The [Grice75] Grice, H. P. Logic and Conversation. In
initial subjective reaction is that the system performs quite P. Cole (ed.) Syntax and Semantics. Vol. 3. New York:
well in practice using the current approach. The future Academic Press. 41-58 , 1975.
challenge lies in dealing with uncaptioned, speech- [Hampapur94] Hampapur, A., Jain, R., and Weymouth,
transcribed data, since the speech recognition generated T., Digital Video Segmentation, ACM Multimedia 94,
transcript contains a significant word error rate. pp 357 – 364, ACM Int’l Conf on Multimedia, 15 – 20
Oct. 1994, San Francisco, CA.
The adequacy of segmentation depends on what you need to
[Hauptmann95] Hauptmann, A.G. and Smith, M.A. Text,
do with the segments. We are now in a position to evaluate Speech and Vision for Video Segmentation: the
the effectiveness of our segmentation process with respect to
Informedia Project. AAAI Fall Symposium on 11172 Information Technology – Coding of Moving
Computational Models for Integrating Language and Pictures & Associated Audio for Digital Storage,
Vision, Boston MA, Nov 10-12, 1995. International Standards Organization.
[Hauptmann95b] Hauptmann, A.G., Witbrock, M.J., [Nye84] Nye, H. The Use of a One Stage Dynamic
Rudnicky, A.I., and Reed, S. Speech for Multimedia Programming Algorithm for Connected Word
Information Retrieval, UIST-95 Proceedings of the User Recognition, IEEE Transactions on Acoustics, Speech
Interface Software Technology Conference, Pittsburgh, and Signal Processing, Vol. AASP-32, No 2, pp. 262-
November 1995. 271, April 1984.
[Hauptmann97] Hauptmann, A.G. and Witbrock, M.J., [Pentland94] Pentland A., Moghaddam B., and Starner
Informedia News on Demand: Multimedia Information T. View-Based and Modular Eigenspaces for Face
Acquisition and Retrieval, in Maybury, M.T., Ed, Recognition IEEE Conference on Computer Vision &
Intelligent Multimedia Information Retrieval, AAAI Pattern Recognition, Seattle, WA, July 1994.
Press/MIT Press, Menlo Park, 1997 [Placeway96] Placeway, P. and Lafferty, J., “Cheating
[Wactlar96] Wactlar, H.D., Kanade, T., Smith, M.A. with Imperfect Transcripts”, ICSLP-96 Proceedings of
and Stevens, S.M., Intelligent Access to Digital Video: the 1996 International Conference on Spoken Language
Informedia Project. IEEE Computer, 29 (5), 46-52, May Processing, Philadelphia, PA, October 1996.
1996. See also http://www.informedia.cs.cmu.edu/. [Rowley95] Rowley, H., Baluja, S. and Kanade, T.,,
[Hauptmann97b] Hauptmann, A.G., Witbrock, M.J. and Human Face Detection in Visual Scenes. Carnegie
Christel, M.G. Artificial Intelligence Techniques in the Mellon University, School of Computer Science
Interface to a Digital Video Library, Proceedings of the Technical Report CMU-CS-95-158, Pittsburgh, PA,
CHI-97 Computer-Human Interface Conference New 1995.
Orleans LA, March 1997. [Taniguchi95] Taniguchi, Y., Akutsu, A., Tonomura, Y.,
[Hearst93] Hearst, M.A. and Plaunt, C., Subtopic and Hamada, H., An Intuitive and Efficient Access
structuring for full-length document access, in Proc Interface to Real-time Incoming Video based on
ACM SIGIR-93 Int’l Conf. On Research and automatic indexing, ACM Multimedia-95, p. 25 - 33,
Development in Information Retrieval, pp. 59 – 68, San Francisco, CA 1995.
Pittsburgh PA, 1993. [Witbrock97] Witbrock, M.J. and Hauptmann, A.G.
[Hwang94] Hwang, M., Rosenfeld, R., Thayer, E., Using Words and Phonetic Strings for Efficient
Mosur, R., Chase, L., Weide, R., Huang, X., and Alleva, Information Retrieval from Imperfectly Transcribed
F., “Improving Speech Recognition Performance via Spoken Documents, DL97, The Second ACM
Phone-Dependent VQ Codebooks and Adaptive International Conference on Digital Libraries,
Language Models in SPHINX-II.” ICASSP-94, vol. I, Philadelphia, July 23 - 26, 1997.
pp. 549-552, 1994. [Witbrock98] Witbrock, M.J., and Hauptmann, A.G.,
[Kaszkiel97] Kaszkiel, M. and Zobel, J., Passage Speech Recognition in a Digital Video Library, Journal
Retrieval Revisited, pp. 178 – 185, SIGIR-97 of the American Society for Information Science
Proceedings of the 20th Annual International ACM (JASIS), 1998, In press.
SIGIR Conference on Research and Development in [Yamron97] Yamron, J. “Topic Detection and
Information Retrieval, Philadelphia, PA July 27 – 31, Tracking: Segmentation Task”, Topic Detection and
1997. Tracking (TDT) Worksho, 27-28 October 1997, College
[Kobla97] Kobla, V., Doermann, D., and Faloutsos, Park, MD. Also in BNTUW-98, Proceedings of the
D., Video Trails: Representing and Visualizing Structure Broadcast News Transcription and Understanding
in Video Sequences, ACM Multimedia 97, Seattle, WA , Workshop, Leesburg, VA, February 1998.
Nov. 1997. [Yeung96] Yeung, M., and Yeo, B.-L., “Time-
[Mani96] Mani, I., House, D., Maybury, M. and constrained Clustering for Segmentation of Video into
Green, M. “Towards Content-Based Browsing of Story Units” in International Conference on Pattern
Broadcast News Video”, in Maybury, M. T. (editor), Recognition, August 1996.
Intelligent Multimedia Information Retrieval, 1997. [Yeung96b] Yeung, M., Yeo, B.-L., and Liu, B.,
[Maybury96] Maybury, M., Merlino, A., and Rayson, "Extracting Story Units from Long Programs for Video
J., “Segmentation, Content Extraction and Visualization Browsing and Navigation" in International Conference
of Broadcast News Video using Multistream Analysis”, on Multimedia Computing and Systems, June 1996.
in Proceedings of the ACM International Conference on [Zhang95] Zhang, H.J., Low, C.Y., Smoliar S.W.,
Multimedia, Boston, MA, 1996. and Wu, J.H., Video Parsing, Retrieval and Browsing:
[Merlino97] Merlino, A., Morey, D., and Maybury, An Integrated and Content-Based Solution, ACM
M., Broadcast News Navigation using Story Multimedia-95, p. 15 – 24, San Francisco, CA 1995.
Segmentation, ACM Multimedia 1997, November 1997
[MPEG-ISO] International Standard ISO/IEC-CD-

You might also like