0% found this document useful (0 votes)

38 views13 pages

Character Identification in Feature-Length Films Using Global Face-Name Matching

Character Identification in feature-length films poses a significant challenge to computer methods. We attempt to do a global matching between names and clustered face tracks. The proposed framework is able to create a new experience on character-centered film browsing.

Uploaded by

Aindrila Datta Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views13 pages

Character Identification in Feature-Length Films Using Global Face-Name Matching

Uploaded by

Aindrila Datta Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1276 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO.

7, NOVEMBER 2009

Character Identification in Feature-Length Films

Using Global Face-Name Matching
Yi-Fan Zhang, Student Member, IEEE, Changsheng Xu, Senior Member, IEEE, Hanqing Lu, Senior Member, IEEE,
and Yeh-Min Huang, Member, IEEE

Abstract—Identification of characters in films, although very in- Character identification in feature-length films, although very
tuitive to humans, still poses a significant challenge to computer intuitive to humans, still poses a significant challenge to com-
methods. In this paper, we investigate the problem of identifying puter methods. This is due to the fact that characters may show
characters in feature-length films using video and film script. Dif-
ferent from the state-of-the-art methods on naming faces in the variation of their appearance including scale, pose, illumination,
videos, most of which used the local matching between a visible expression and wearing in a film. People recognition based on
face and one of the names extracted from the temporally local video their faces is a well-known difficult problem [1]; meanwhile,
transcript, we attempt to do a global matching between names and giving identities to the recognized faces also needs to tackle the
clustered face tracks under the circumstances that there are not ambiguity of identities. The objective of this work is to iden-
enough local name cues that can be found. The contributions of
our work include: 1) A graph matching method is utilized to build tify the faces of characters in the film and label them with their
face-name association between a face affinity network and a name names. Based on our work, users can easily use the name as
affinity network which are, respectively, derived from their own a query to select the characters of interest and view the related
domains (video and script). 2) An effective measure of face track video clips. This character-centered browsing is able to not only
distance is presented for face track clustering. 3) As an application, bring us a new viewing experience, but also provide an alterna-
the relationship between characters is mined using social network
analysis. The proposed framework is able to create a new experi- tive for video summarization and digestion.
ence on character-centered film browsing. Experiments are con- In this paper, we present a novel approach for character iden-
ducted on ten feature-length films and give encouraging results. tification in feature-length films. In films, the names of charac-
Index Terms—Face identification, movie analysis, social network ters seldom directly appear in the subtitle, while the film script
analysis, video browsing. which contains names does not have time stamps to align with
the video. There are not enough temporally local name cues that
can be found for local face-name matching. Hence, we attempt
I. INTRODUCTION to do a global matching between the faces detected from the
video and the names extracted from the film script, which is dif-
ferent from the state-of-the-art methods on naming faces in the
ITH the flourishing development of the movie industry,
W a huge amount of movie data is being generated ev-
eryday. It becomes very important for a media creator or distrib-
videos. Based on the results of character identification, an ap-
plication for character-centered film browsing is also presented
which allows users to use the name as a query to search related
utor to provide better media content description, indexing and
video clips and digest the film content.
organization, so that users can easily browsing, skimming and
retrieving the content of interest. In a film, characters are the A. Related Work
focus center of interests from the audience. Their occurrences The crux of the problem on associating faces with names is
provide meaningful presentation of the video content. Hence, to exploit the relations between videos or images and the as-
characters are one of the most important content to be indexed, sociated texts in order to label the faces with names under less
and thus character identification becomes a critical step on film or even no manual intervention. Extensive research efforts have
semantic analysis. been concentrated on this problem. Name-it [2] is the first pro-
posal on face-name association in news videos based on the
co-occurrence between the detected faces and names extracted
Manuscript received August 04, 2008; revised April 12, 2009. First published from the video transcript. A face is labeled with the name which
August 18, 2009; current version published October 16, 2009. This work was
supported in part by the National Natural Science Foundation of China (Grant frequently co-occurs with it. Named Faces system [3] built a
No. 60833006) and in part by the Beijing Municipal Laboratory of Multimedia database of named faces by recognizing the people names over-
and Intelligent Software Technology. The associate editor coordinating the re-
laid on the video frames using video optical character recogni-
view of this manuscript and approving it for publication was Dr. Jie Yang.
Y.-F. Zhang, C. Xu, and H. Lu are with the National Lab of Pattern Recog- tion (VOCR). Yang et al. [4], [5] employed the closed caption
nition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, and speech transcript, and built models for predicting the prob-
China, and also with the China-Singapore Institute of Digital Media, Singapore,
ability that a name in the text matches to a face on the video
119615 (e-mail: [email protected]; [email protected]; [email protected].
cn). frame. They improved their methods in [6] by using multiple
Y.-M. Huang is with the National Cheng-Kung University, Tainan, Taiwan instance learning for partial labeled faces to reduce the effort of
(e-mail: [email protected]). collecting data by users. In [7], the speech transcript was also
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. used to find people frequently appearing in the news videos.
Digital Object Identifier 10.1109/TMM.2009.2030629 Similarly, for face identification in news images, the problem
1520-9210/$26.00 © 2009 IEEE
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1277

Fig. 1. Framework of character identification using global face-name matching.

was also addressed as clustering or classifying the faces to the cross-linguistic problem and the inconsistencies between subti-
people specific appearance models, supervised by the name cues tles and scripts may bring more difficulties in alignment.
extracted from the image captions or associated news articles
[8]–[10]. B. Overview of Our Approach
Although some of the methods showed promising face iden- In a film, the interactions among the characters resemble them
tification results in the video, most of them are used in news into a relationship network, which makes a film be treated as a
videos which can easily get candidate names for the faces from small society [21]. Every character has his/her social position
the simultaneous appearing captions or temporally local tran- and keeps a certain relationship with others. In the video, faces
scripts. Unlike the news videos which are presented by the third can stand for characters and the co-occurrence of the faces in a
person such as the anchor or the reporter, in the films, the names scene can represent an interaction between characters. Hence,
of characters are seldom directly appearing in the subtitle, which the statistical properties of faces can preserve the mutual rela-
makes it difficult to get the local name cues. Hence, many ef- tionship in the character network. As the same way in the film
forts on film analysis were devoted to major characters detection script, the spoken lines of different characters appearing in the
or automatic cast listing but not assigning real names to them. same scene also represents an interaction. Thus, the names in
Arandjelovic and Zisserman [11] used face image as a query front of the spoken lines can also build a name affinity net-
to retrieve particular characters. Affine warping and illumina- work. Both the statistical properties of the faces and the names
tion correcting were utilized to alleviate the effects of pose and motivate us to seek a correspondence between the face affinity
illumination variations. In [12], multiple face exemplars were network and the name affinity network. The name affinity net-
obtained from face tracks to improve the face matching results. work can be straightforwardly built from the script. For the
For automatic cast listing, faces of the major characters in a fea- face affinity network, we first detect face tracks in the video
ture film can be generated automatically using clustering based and cluster them into groups corresponding to the characters.
on the appropriate invariance in the facial features [13]–[15]. During the clustering, the Earth Mover’s Distance (EMD) is uti-
Due to the uncontrolled conditions in films with a wide vari- lized to measure the face track distance. Since we try to keep as
ability on faces, approaches only depending on faces are not same as possible with the name statistics in the script, we se-
always reliable. Therefore, multi-modal approaches fused with lect the speaking face tracks to build the face affinity network,
facial features and speaker voice models were proposed [16], which is based on the co-occurrence of the speaking face tracks.
[17]. However, these approaches cannot automatically assign For name and face association, it is formulated as a problem
real names to the characters. To handle this, Everingham et al. of matching vertices between two graphs. A spectral method,
[18] proposed to employ a readily available textual source, the which has been used in 2-D/3-D registration and object recog-
film script, which contains the character names in front of their nition, is introduced here to build name-face association. Es-
spoken lines. However, the film script does not have time in- pecially, during the matching process, priors can be incorpo-
formation to achieve face name matching. Hence, they used the rated for improvement. After assigning names to faces, we also
film script together with the subtitle for text video alignment and determine the leading characters and find cliques based on the
thus obtained certain annotated face exemplars. The rest of the affinity network using social network analysis [22]. A platform
faces were then classified into these exemplars for identification. is presented for character-centered film browsing, which en-
Their approach was also followed by several works which aimed ables users to easily use the name as a query to search related
for video parsing [19] and human action annotation [20]. How- video clips and digest the film content. The whole framework of
ever, in their approach [18], the subtitle text and time-stamps our proposed approach is shown in Fig. 1.
were extracted by OCR, which required extra computation cost Compared with the previous work, the contributions of our
on spelling error correction and text verification. Sometimes, the work include: 1) A graph matching method is introduced to
1278 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009

Fig. 2. Face images mapped into the embedding space described by the first two components of (a) PCA and (b) LLE. The same color points are the faces of the
same character. In panel (b), the ellipses label three face tracks belonging to two characters.

build face-name association. 2) An EMD-based measure of face B. Face Representation by LLE

track distance is presented for face track clustering. 3) Based
on character identification, the relationship between characters After face detection, each face is geometrically aligned into
is mined and a platform is provided for character-centered film a standard form to remove the variation in translation, scale and
browsing. Although the graph matching method and the EMD in-plane rotation, and normalized into a 64 64 gray-scale
measure are derived from the existing work, to the best of our image. Hence, each face can be represented as a 64 64 di-
knowledge, they have not been applied in the face naming mensional gray-scale feature vector. In our work, as multi-view
problem and face track distance measurement yet. We choose or varied expressional faces of the same person should be con-
them reasonably and integrate them in a novel solution to sidered as similar (i.e., to be treated as the neighboring points
address the existing challenge problem: people identification in an intrinsic manifold), it is required to employ a compact
in real-world videos. The global matching solution is different representation which can characterize this neighborhood rela-
from the previous methods which are based on local matching. tionship. Locally linear embedding (LLE) is such a nonlinear
dimensionality reduction technique proposed by Roweis and
II. FACE CLUSTERING Saul [25] which can map high dimensional data that are pre-
We first use a multi-view face tracker [23] to detect and track sumed to lie on a nonlinear manifold, onto a single global coor-
faces on each frame of the video. One feature-length film may dinate system of lower dimensionality, while still preserving the
contain up to 100 000 faces detected on all the frames, which neighborhood relationship. LLE succeeds in recovering the un-
are derived from a few thousands of tracks. Using face track as derlying manifold, whereas linear embedding methods, such as
the granularity can reduce the volume of data to be processed PCA or multi-dimensional scaling (MDS), would map faraway
and preserve multi-view face exemplars in a track. The detected data points to nearby points in their spaces. Hence, LLE is em-
face tracks are stored with the information of the face position, ployed here to project high dimensional face features into the
scale and the start and end frame number of the track. Then, we embedding space which can still preserve their intrinsic struc-
detect speaking face tracks among the face tracks. Finally, we tures. The two panels of Fig. 2 show the first two components
cluster them into groups corresponding to characters and build discovered by PCA and LLE. Note that while the linear projec-
the face affinity network. tion by PCA has a somewhat uniform distribution, the LLE has
a distinctly spiny structure.
A. Speaking Face Track Detection
For speaking face track detection, we first determine the C. Distance Measure Between Face Tracks
speaking face on the frame level in each face track. On each
frame of the face track, the mouth region-of-interest (ROI) is One face track is a set which may contain about 20–500 faces.
located according to the face region. SIFT points are extracted Due to the variance of pose and expression, a face track may
and matched between the current face image and the previous present multiple face exemplars. Matching different face tracks
one. Then, we use the matched SIFT points to calculate the from the same person, just requires that certain faces of the two
transformation model to align the current face to the previous sets can be matched, while others are not necessarily as near
face image plane. The change in the aligned mouth ROI can as possible, but should also not be too far away in the feature
be used to judge whether the face is speaking. Here, we use space. Hence, it is necessary to find a distance measure which
normalized sum of absolute difference (NSAD) [24] to describe has the following properties: 1) it allows for partial matches,
the change in the mouth ROI. Thus, we get a vector of NSAD which is especially important in measuring between two sets
for each face track and use it to label the frame whether the face with different sizes; 2) the severe dissimilarity will be punished.
is speaking. If a face track has more than 10% frames labeled as The face tracks from different persons may also have a few faces
speaking, it will be determined as a speaking face track. More that look like the same due to the angle of view, illumination
technical details can be found in [24]. or image resolution. These partial similarities of the two sets
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1279

Fig. 3. Distance matrices of face tracks from six characters measured by the (a) minimum distance and the (b) EMD. In the figure, the lower the intensity, the
smaller the distance.

should be excluded by punishing the more severe dissimilarities clustering by spectral clustering and K-Means. The precision of
of the other faces. spectral clustering is 69.6%, and the precision of K-Means is
In most of the previous work [15], [18], minimum distances 58.1%. Since in building signatures, the dominant clusters are
were employed to evaluate the dissimilarity of face tracks: represented by their centers, it requires each cluster to be com-
pact enough. We also calculate the mean intra-cluster distance
(1) of the clustering results. For spectral clustering, it is 3.41; for
K-Means, it is 4.42. The experiments showed that spectral clus-
tering is more suitable in our work. After spectral clustering,
where denotes the Euclidean distance; and are
we can extract dominant clusters from all the detected faces.
two face tracks; and and are the features of two faces
Here it may be argued why we do not use majority voting within
belonging to them, respectively. The problem brought by min-
each face track to determine its cluster label. The reason is that
imum distance is that it only cares about the partial matching
majority voting cannot punish the dissimilarity either.
but does not punish the dissimilarities. In addition, the Eu-
The face tracks can be represented as follows: Let
clidean distance used may not fit in certain situations. From
be the signature of the
panel (b) in Fig. 2, we can see that, although the LLE preserves
first face track with clusters, where is the
the neighborhood relationship in the high dimensional space
cluster center and is the number of faces belonging to this
and has a distinctly spiny structure, their similarities still cannot
cluster; be the signature of the
be simply measured by Euclidean distance. For example, if
second face track with clusters. The EMD between
using Euclidean norm, the distance between face track B and
two face tracks is defined as follows:
face track C is smaller than A and B, while actually A and B
belong to the same person.
The EMD is a metric to evaluate the dissimilarity between (2)
two distributions [26]. It reflects the minimal amount of work
that must be performed to transform one distribution into the where is the ground distance between cluster centers and
other by moving “distribution mass” around. The EMD pun- . Note that the distance is calculated on the feature vectors
ishes the dissimilarity by increasing the amount of transporta- of points derived from spectral clustering. The is the flow
tion work. It is represented as a distribution on certain dominant between and . The denominator of the equation is the
clusters, which are called signature. The signatures do not nec- normalization factor which is the total weight of the smaller
essarily have the same mass, thus it allows for partial matches. signature. Calculation of the EMD is based on the solution of
Hence, the EMD is adequate to measure face track distance. a linear programming problem subjecting to four constraints
Here we come to face the key problem on extracting the dom- which can be found in [26]. To compare with the EMD, we
inant clusters. How can those varied faces which actually belong also use the minimum distance to measure the distance between
to one person be clustered into the same or near ground distance face tracks. The two panels of Fig. 3 illustrate the distance ma-
clusters? In panel (b) of Fig. 2, it is obvious that straightfor- trices of the face tracks measured by the (a) minimum distance
wardly using the K-Means algorithm, whose resulting clusters and the (b) EMD, respectively. The face tracks are collected
are always convex sets, cannot work well. Here we employ spec- from one episode which contains six characters, and sorted by
tral clustering [27] to do clustering on all the faces in the LLE these characters. In panel (b), we can find six distinguished clus-
space. The reason we choose spectral clustering is that it can pre- ters. The intra-cluster distance is significantly smaller than the
serve the local neighborhoods and solve very general problems inter-cluster distance. In panel (a), although the intra-cluster dis-
like interwinds spirals. The number of clusters is set by prior tance is small, some face tracks from different characters also
knowledge derived from the film script. This will be described have small distances. The minimum distance makes them be
in Section II-D in detail. We have compared the results of face treated as the same person due to the partial similarities.
1280 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009

D. Constrained K-Means Clustering nition application. A Gaussian mixture density is a weighted

sum of component densities given by
After computing the EMD between face tracks, a constrained
K-Means clustering is performed to group the scattered face
tracks which belong to the same character. Here, to exploit the (4)
properties of video, the temporal overlapping of the face tracks
is implemented as a “cannot link” constraint when clustering:
where is the th unimodal Gaussian densities; the mixture
the two face tracks which share the common frames cannot be
weights satisfy the constraint ; and is a 24-di-
clustered together. The target number of clusters on face tracks
mensional Mel-frequency cepstral coefficients (MFCCs) feature
is the same as we set in spectral clustering on the faces.
vector. We remove DC-mean of the features and normalize the
is determined as follows: based on observation, we found that
features by their cepstral mean in considering the background
most of the speeches accompany the appearances of the faces
noise in the film. Under the assumption of independent feature
in the video. Hence, we count the number of distinct speaker
vectors, the likelihood of a model for a sequence of feature
names appearing in the script and set as this number. Here
vectors is computed as follows:
the “voice-over” or “off screen voice” in the films is not con-
sidered because they are labeled as “V. O.” or “O. S.” in the (5)
script and can be excluded by preprocess. We also ignore those
characters whose spoken lines are less than three in the script,
because these minor supporting roles appear in limited time and where is computed as in (4). The speaker whose model
their face tracks can be considered as noise contrasting to the gives the maximum likelihood is determined as the target
huge face track amounts of the others. To clean the noise from speaker.
the clustering results, a pruning method is employed in the next For classification, as we have learned discriminative func-
step. tions from face and voice features, we adopt late fusion [28] to
combine these results and yield a final classification score. Let
E. Cluster Pruning be the th face track cluster whose cluster center is , and
the corresponding speech voice model is . Let be the face
In this step, we refine the clustering results by pruning the track to be classified and be the feature vector of the corre-
marginal points which have low confidence belonging to the sponding audio clip. The final discriminative function is defined
current cluster. The confidence is calculated as follows: as follows:

(6)
(3)
where is the EMD between and the cluster center
where is the EMD between the face track and its ; is the likelihood of ’s voice model for ;
cluster center ; is the number of -nearest neighbors of ; and and are set as 0.4 and 0.6, respectively. The face track
and is the number of -nearest neighbors which belong to will be classified into the cluster whose function score
the same cluster with . The point whose confidence is lower is maximal. As described in Section II-D, there exist some face
than a threshold is regarded as the marginal point. tracks belonging to the characters we have ignored. To clean
We collect all the marginal points pruned from the clusters these noises, we set a threshold . If the function score is
and do a re-classification which incorporates the speaker voice lower than , the face track is refused to classify to any
features for enhancement. The reason that we do not combine of the clusters and will be left unlabeled.
the speaker voice features with the face features earlier in the
face track clustering is that sometimes the environment or back- III. FACE-NAME ASSOCIATION
ground sounds in the films are noisy. Directly fusing the speaker
voice feature and the facial feature may affect the clustering re- In the video and in the film script, the faces and the names
sult. For testing, in the face track clustering step, we concate- can both stand for the characters. By treating all the characters
nate the speaker voice feature and the facial feature into one as a small society, we can, respectively, build a name affinity
feature vector to generate face track clusters. The precision of network and a face affinity network in their own domains
the clustering result is 64.3%, while the result of using facial (script and video). For face-name association, we want to seek
feature only is 72.1%. It showed that the early feature fusion de- a matching between the two networks. From the social network
grades the clustering result. Hence, we employ the speaker voice point of view, our work is to find the structural equivalence
features to only reclassify the marginal points which are not actors between the two networks. Two actors, respectively,
confident by the facial features. As the marginal points are all from the two networks are defined to be structural equivalence
speaking face tracks and the speaking frames have been detected if they have the same profile of relationship to other actors in
in Section II-A, we can obtain speech data of each face track by their own networks.
segmenting corresponding clips from the film audio track. For
each face track cluster, 30-s speech data are collected to train A. Name Affinity Network Building
a speaker voice model. Gaussian mixture models (GMMs) are A film script contains the spoken lines of characters together
employed here as it has been proved successful in speaker recog- with the scene information and some brief descriptions. Fig. 4
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1281

TABLE I
EXAMPLE OF NAME AFFINITY MATRIX IN “NOTTING HILL”

TABLE II
EXAMPLE OF FACE AFFINITY MATRIX IN “NOTTING HILL”

Fig. 4. Script examples of the film “Notting Hill”.

B. Face Affinity Network Building

shows a part of the script of the film “Notting Hill”. The first Supposing that we have obtained face track clusters, we
line is the scene title (i.e., “INT. KITCHEN—DAY”), which give each cluster a numerical label from 1 to corresponding
has a standard format in the text. “INT.” means interior, while to the unnamed characters. Same as in Section III-A, the
its opposite “EXT.” means exterior. “DAY” and “NIGHT” (in face affinity network is also built based on face co-occurrence.
other scene titles) indicate the scene time. Following the scene Hence, we also need to get the face occurrence status of every
title, there are some descriptions on the environment and the cluster in each scene of the video.
actions of characters. In front of each spoken line, there is the Here we briefly introduce the video scene segmentation em-
speaker name. Hence, we can parse the script text and use a ployed in our work. First, we detect the interlaced repetitive
name entity recognition software1 to extract every name in front pattern of shots in the film. This pattern often occurs in the
people conversation, in which the camera repetitively shoots
of the spoken lines. By recognizing the scene titles, we can com-
from one speaker to the other. These interlaced repetitive shots
pute the name occurrence counts of every character within each
are grouped into one shot. Then among the rest of the shots,
scene. The status of name occurrence in every scene can be for- the most visually similar adjacent ones are gradually merged
mulated as an occurrence matrix , where together. The merging order of the shots is recorded as the dis-
is the number of names and is the number of scenes. The entry continuity degree between the shots. The later the two shots are
of the matrix denotes the name count of the th character in merged, the higher the degree of discontinuity they are. Hence,
the th scene. The th row vector de- the scene segmentation points can be inserted in the boundary
notes the occurrence status of the th character in all the scenes. between two shots which have the high degree of discontinuity.
Based on the name occurrence matrix , the The technical details can be found in [29]. By setting a discon-
name affinity network which is presented by a matrix tinuity degree threshold , we can obtain a case of scene par-
can be constructed. The affinity value tition. To align with the scene partition in the film script, we
between two names is represented by their co-occurrence. The change to get the same number of scenes in the video with
co-occurrence between name and in the th scene is defined the script.
as follows: Based on the scene segmentation results, we can compute the
face occurrence matrix on each scene, where
(7) is the number of faces, and is the number of scenes. Here
has the same size with because the number of
Hence, the entry in the matrix which represents the face clusters is set the same as the number of distinct names
affinity value between name and in the entire script can be in the script. Finally, the face affinity network which is repre-
computed as follows: sented by a matrix can also be constructed
by following (8). Table II demonstrates the face affinity matrix
of some face clusters derived from the video of the film “Notting
(8) Hill”. All the values are normalized into the interval [0,1].

C. Vertices Matching Between Two Graphs

The diagonal value of the matrix is the occurrence count of
the th name in the entire script. Table I demonstrates the name We have obtained the name affinity network and the
affinity matrix of some true names choosing from the script of face affinity network . They both can be represented as an
the film “Notting Hill”. All the values are normalized into the undirected, weighted graph, respectively:
interval [0,1].
1http://www.alias-i.com/lingpipe (9)
1282 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009

In , the vertices represent can be seen as the pairwise feature of two as-
names, the edges denote the rela- signments. A pair of correct assignments are probably to
tionship between vertices, the weights denote agree with each other and get high value of . Based
the strength of the edge , and the weights on this definition, is nonnegative and symmetric
denote the self-feature of the vertex (i.e., its occurrence fea- . If two assignments are incompatible
ture). In , the vertices represent to the one-to-one matching constraint [i.e., ,
faces, and the edges and the weights also represent the ], we set . Now the correspondence
relationship between faces. Therefore, the face-name associa- problem is reduced to finding a cluster of assignments
tion problem can be formulated as a graph matching problem, that maximizes the intra-cluster score while meeting
which targets on finding the correct correspondence between the one-to-one matching constraint. The intra-cluster score is
the vertices of the two graphs. Note that this matching process given as follows:
should subject to the one-to-one constrain, due to the reason that
one name can match at most one face and vice-versa. (12)
1) Problem Formulation: Given two graphs , con-
taining vertices, and , also containing vertices, there
are possible correspondence pairs (or assignment) 2) Spectral Matching Method: Spectral methods are com-
, where and . Our aim is monly used for finding the main clusters of a graph. A spec-
to find a correct correspondence mapping . tral technique was introduced by Leordeanu and Hebert [30] for
In the graph, each vertex has a weight which can be seen as its correspondence problem using pairwise constraints. They build
self-feature. In our case, it is the face or name occurrence feature. an affinity matrix of a graph whose vertices represent
However, the occurrence feature is not discriminative enough the potential correspondences and the weights on the edges rep-
to build correct correspondence between the vertices of the two resent pairwise agreements between potential correspondences.
graphs. From Tables I and II, we can see that some diagonal To find the cluster which has the maximal inter-cluster score
values are similar. Consequently, the relationship with other ver- [see (12)], they define an indicator vector , where its el-
tices should be taken into account and treated as the features of ement is the confidence of the th assignment belonging
the current vertex. In an ideal situation, if the scene segmentation to cluster . The norm of is fixed to 1. They aim to get the
in the video is as exact as the film script and the speaking face optimal solution , where . As we
track clustering can achieve 100% precision, the name graph and know, is a symmetric and nonnegative matrix. By the
the face graph should be exactly the same. The face graph can be Rayleigh quotient theorem, will be maximized when
seen as a transform from the name graph by adding noise which is the principal eigenvector of . Since has nonnega-
is introduced by speaking face track clustering and scene seg- tive elements, by the Perron–Frobenius theorem, the elements of
mentation. Nevertheless, those two still reserve the relationship will be in the interval [0,1]. Hence, we can calculate the prin-
and the statistic properties of the characters, such as A has more cipal eigenvector to determine the correct correspondences.
affinities with B than with C, B never has co-occurrence with D, Inspiring from the method in [30], we first initialize the list
etc. Hence, we need to find a method using the relationship and with the set of possible assignments. Then we use the in-
statistic properties to build the correct correspondence which can dividual feature and pairwise feature defined
accommodate a certain noise. above to build the affinity matrix which contains
We store the candidate assignments in a list all possible assignments and is symmetric and nonnegative.
. For each assignment , we can find a measure- From , the principal eigenvector can be calculated. We
ment on how well matching : start by first accepting the most correct assignment whose
eigenvector value is maximum. Next we reject all other
assignments which are in conflict with subjecting to the
(10) one-to-one matching constrain. Then we accept the next most
correct assignment and reject the ones in conflict with it. This
procedure will be repeated until all assignments are either
where is the sensitivity parameter for accommodating noise accepted or rejected. The accepted assignments are the final
or we can say the deformation between the two graphs. results of name-face association.
can be seen as the individual feature of an assignment. A correct 3) Spectral Matching With Priors: The method introduced
assignment often gets high value of . above is conducted in a totally unsupervised fashion. However,
For each pair of assignments , where , sometimes we can have certain prior knowledge such as we have
, we can also find a measurement on how known a correct assignment of a name and a face beforehand.
compatible the two assignments are. For example, on one hand, The question is whether we can get benefit from such kind of
and have an affinity ; on the other hand, and priors on spectral matching. The known assignment can be ob-
also have an affinity . If the pair of assignments are tained by the alignment of the film script and the subtitle using
both correct, the affinity values and should be sim- the method described in [18]. Although the global matching
ilar. Hence, is defined as follows: method we proposed in this paper does not need the timing in-
formation from the subtitle to generate local name cues, we want
to investigate, if they are available, whether the local name cues
can improve the global matching result. We first obtain the sub-
(11) title text from the video by OCR and then align the film script
with the subtitle text by a dynamic time warping algorithm [31].
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1283

The result is that each script line is tagged with time stamps from A. Character Relationship Mining
the subtitle. Then the speaking face tracks are labeled with the
names which have the corresponding time stamps. In each face To facilitate the character-centered browsing, the character
track cluster which we have built in Section II, certain tracks are relationship is mined first, which includes the determination of
labeled with a name. Due to the errors of the two text sources leading characters and cliques. Since the name affinity network
alignment and the imprecise of speaking face track detection, and the face affinity network can both describe the relationship
some tracks may be mislabeled with wrong names. However, of characters and the name affinity network is more accurate, the
there is no mechanism for error correction in [18]. The face relationship mining is conducted on the name affinity network.
tracks from the same character may be labeled with different From the social network analysis point of view, the leading
names. Hence, for each face track cluster, it should be assigned character can be considered as the one who has high centrality
the majority name. The probability that the cluster is as- in the name affinity network . The centrality
signed the majority name is defined as of a character is defined as . Then the leading
characters can be determined using the method of detecting the
(13) centrality gap among the characters [21]. We sort the centralities
of characters in a descending order: .
where is the number of face tracks which are assigned the Then we calculate the centrality difference between two
name in the cluster , and is the total number of face adjacent ones. The maximum difference will be set as
tracks in the cluster . To obtain the most believable assign- the centrality gap which can distinguish the leading characters
ment of face and name, we select the one which has the highest and the others. The ones whose centrality is are
probability value, and consider it as the known assignment. determined as the leading ones.
Given a known assignment , where Clique is a subset of a network in which the actors are more
, the relationship of other assignments closely and intensely tied to one another than they are to other
with [i.e., ] is therefore more reliable. For members of the network. For clique detection, we use agglomer-
a name , it has a set of possible as- ative hierarchical clustering [32]. The individuals are first
signments: , initialized as cliques. An empty clique List is also located.
and thus has a set of pairwise features with : The major steps are contained in the following procedure:
. Among , we use a
“The Best Takes All” operation: ALGORITHM (Agglomerative Hierarchical Clustering)
if
(14) 1) Begin initialize: , , ,
otherwise.
2) ,
This operation will be done for all the names .
According to the symmetry characteristic, will 3) do
be set as the same value as . In addition, since
4) find nearest cliques, say and
is the known correct assignment, subjecting
to the one-to-one matching constraint, those assignments of 5) if
the form or are in conflict
with it. In the matrix , the entries or 6) then ,
corresponding to these conflicting assignments are all set to 7) else break
0. After all of these operations, the affinity matrix will
incorporate the prior knowledge and be transformed to be more 8) until
sparse. Experimental results showed that incorporating priors
9) return L
can facilitate the graph matching.
where is defined as follows:
IV. APPLICATIONS
Until now, we have associated a name to each speaking face (15)
track cluster; thus, all the speaking face tracks can be identified.
For the rest of the non-speaking face tracks we have detected and are the numbers of the characters in and
before, we can also classify them into the nearest speaking face . In each step, the new merged clique is saved into the list .
track clusters depending on the EMD defined in (2), and asso- We classify the result cliques into dyad which has two members,
ciate names to them. Note that this classification process is only triad which has three members and the large clique. They will be
relied on facial features. listed in a summary of the film for character-centered browsing.
Based on the result of character identification, there are many
applications, such as character-based video retrieval, personal-
B. Character-Centered Browsing
ized video summarization, intelligent playback and video se-
mantic mining, etc. Here we provide a platform for character- Now we will provide a platform to support character-cen-
centered film browsing on which users can use character names tered film browsing. We have identified the faces of characters.
to search and digest the film content. Hence, we can use character names to annotate the scenes in
1284 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009

TABLE III TABLE V

EXAMPLES OF FILM CHARACTER SUMMARY FILM INFORMATION

TABLE VI
SPEAKING FACE TRACK DETECTION

TABLE IV
QUERY EXAMPLES

A. Face Clustering
As the preliminary, we start with the speaking face track de-
tection. To assess its accuracy, we segment three 30-min clips,
respectively, from the films F1, F3, and F8. The statistics of the
experiment are shown in Table VI, where the columns “Face trk”
and “Sp. trk” are the total numbers of face tracks and speaking
the video. The annotation structure of one scene is defined as
face tracks contained in the clips.
follows:
After speaking face track detection, we cluster them into
groups corresponding to the characters. For each cluster, the
cluster pruning mechanism is then used to refine the results.
The point whose confidence [see (3)] is lower than the
Users can use the names of characters or cliques in the query to confidence threshold is determined as the marginal
view the related video scenes. For the convenience of users, a point and pruned. Hence, is the parameter to control
summary on characters of the film is listed automatically which the purity of each cluster. The higher the value of is,
contains the characters (lead and others) and cliques. Taking the the more points will be pruned. To demonstrate the results of
film “Notting Hill” as an example, the summary is shown in our method on face track clustering, we change the value of
Table III. from 0.5 to 0.1 and obtain a clustering precision/recall
Based on the summary, the query can be represented by using curve (see Fig. 5). The term “recall” is used here to indicate
a short sentence, e.g., “1st scene of Anna”, or “All scenes of the proportion of the points not pruned against the total points.
William”. For each query, we need to extract the keywords to The calculation of precision and recall are given as follows. To
infer the intents of the users. The query is formulated as avoid possible confusion with the traditional definition, we use
follows: “ ” for distinguishing:

(16) (17)

where , (18)
. If there is no ordinal number
in the query, the default value is . Some examples of queries For comparison, we also use the minimum distance measure
are given in Table IV. instead of our method during clustering. The result is shown in
Fig. 5. The result of spectral clustering on faces before applying
EMD in Section II-C is also shown as the baseline. It can be seen
V. EXPERIMENT
that our method is more effective to characterize the similarity
To evaluate our character identification approach, the experi- between face tracks and get better clustering results.
ments are conducted on ten feature-length films: “Notting Hill”, After the cluster pruning, we collect the pruned marginal
“Pretty Woman”, “Sleepless in Seattle”, “You’ve Got Mail”, points and do a re-classification. As these are all speaking
“Devil Wears Prada”, “Legally Blond”, “Revolutionary Road”, face tracks, the speaker voice features are fused with the facial
“The Shawshank Redemption”, “Léon”, and “Mission: Impos- features for classification. In Section II-E, we have set a score
sible”. The information of these films are shown in Table V. threshold to discard noise. By changing the value of
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1285

TABLE VII
NAME-FACE ASSOCIATION

TABLE VIII
NAME-FACE ASSOCIATION WITH PRIOR

Fig. 5. Precision/recall curves of face track clustering.

B. Face-Name Association
We have obtained different clusters of face tracks corre-
sponding to different characters. For assigning names to these
clusters, a spectral matching method is employed to achieve
vertices matching between name and face networks. The results
on the ten films are shown in Table VII. It can be seen that
the accuracy of the thriller and action film (F9 and F10) is
lower than others. It is due to the more severe variation of the
face pose and the illumination in the thriller and action films.
In F9, the characters sometimes wear masks. In a scene of
Fig. 6. Precision/recall curves of face track classification. F10, the hero even disguises his face as the other character.
These matters affect the face clustering and bring noise in
the face affinity matrix. Thus, more errors occur in face-name
from 0.75 to 0.1, we can also get a precision/recall association. Since the proposed method can incorporate priors
curve for face track classification (see Fig. 6). Similarly, here to improve the matching, we give one known assignment of
the term “recall” means the proportion of face tracks which are a name and a face, which is generated in Section III-C3, as a
classified. The calculation of the precision and recall are given prior for each film. The results (see Table VIII) validate the
as follows. Also, we use “ ” to distinguish from the traditional effectiveness of adding priors in the matching process.
definition: A comparison with the existing local matching approach
was carried out. The approach [18] proposed by Everingham
et al. was evaluated on the same dataset. We implemented
(19)
the approach strictly obeying the original description in [18].
The alignment of the film script and the subtitle had been
(20) implemented in Section III-C3 to obtain local name cues. The
speaking face tracks were then labeled with a temporally local
Before face-name association stage, our work is only conducted name and set as exemplars. Other face tracks were classified
on speaking face tracks. Thus, after face-name association, we to these exemplars for labeling. A precision/recall curve was
also classify the non-speaking face tracks into the clusters we obtained to demonstrate the performances. The term “recall”
have built. The classification of non-speaking face tracks is means the proportion of tracks which are assigned a name, and
based on facial features only. The results are also demonstrated “precision” is the proportion of correctly labeled tracks [18]. To
in Fig. 6. As expected, the performance of multi-modal features compare with this approach, we use the names assigned to the
(face + voice) is better than single feature (face) on the marginal clusters to label the face tracks in the clusters. Similarly, we also
points. obtain a precision/recall curve by changing the score threshold
1286 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009

TABLE IX
RELATIONSHIP MINING

TABLE X
USER EVALUATION OF CHARACTER-CENTERED BROWSING

Fig. 7. Precision/recall curves of character identification.

defined in Section II-E to discard some face tracks the test. They are postgraduate students and research staff from
without labeling. To demonstrate the improvement brought by 24 to 40 years old. They each were asked to use five queries in
the prior knowledge introduced in face name matching, the the form of the examples shown in Table IV to browse the related
character identification results with the prior are also illustrated clips in the films. Then they each gave a score to the browsing re-
as a precision/recall curve. The three curves are shown in Fig. 7. sults on three attributes: completeness, acceptance, and novelty.
It can be seen that the global matching method we proposed is Completeness is to measure whether the user has watched what
comparable to the local matching method [18], while using less he/she wants. Acceptance is to measure how much the browsing
information source (film script) than it (film script + subtitle). style is accepted. Novelty is to measure whether it exceeds the
At the high levels of recall, our method even performs better. browsing expectation of the user and brings him/her new expe-
This mainly relies on the effectiveness of the face track distance rience. The score is based on the following scale: 5-very good,
measure in clustering and the employment of the multi-modal 4-good, 3-neutral, 2-bad, 1-very bad. The scores from all sub-
features in cluster pruning. From the curve of “Global matching jects are given in Table X. The results indicate that most of the
+ prior”, we can also find that incorporating certain local users are interested in the character-centered browsing and ac-
name cues as the prior knowledge does improve the character cept this new browsing style. From the automatically generated
identification results, though our method actually does not rely character summary of the film, they can grasp the structure of
on it. Since our method only needs the film script as the text characters in the film, and use it to select and digest the char-
information source, it can be applied under the circumstances acter-related contents. This provides a new alternative for film
that not enough time information can be found. In addition, the contents organization and summarization. One user suggested
method in [18] was restricted to the frontal faces, while our that the video annotation may be extended from scene level to
method deals with the multi-view faces. shot level, which can make it more accurate and complete.

C. Relationship Mining VI. CONCLUSIONS

The performance of relationship mining is shown in Table IX. In this paper, we have proposed a novel framework for char-
The columns “No. of leads” and “No. of cliques” are manually acter identification in feature-length films. Different from the
labeled ground truth. As this process is conducted on the name previous work on naming faces in the videos, most of which
affinity network which is derived from the film script, the social relied on local matching, we have presented a global matching
network analysis on this clean data performs well. A few cliques method. A graph matching method has been utilized to build
are not detected due to the reason that besides the co-occurrence, name-face association between the name affinity network and
more semantical information is needed to detect them. For ex- the face affinity network which are, respectively, derived from
ample, in F3, the hero and the heroine never meet each other their own domains (script and video). As an application, we have
until the last, but semantically they are considered as a clique as mined the relationship between characters and provided a plat-
they fall in love with each other at last. form for character-centered film browsing.
In the future, we will improve our current work along three
D. Character-Centered Browsing directions. 1) In face-name association, some useful information
Based on the results of character identification and relation- such as gender and context information will be integrated to
ship mining, we annotate the scenes with character names and refine the matching result. 2) Currently film content search and
clique names in the format we defined in Section IV-B. To eval- browsing is on the scene level to keep the integrity of the story.
uate the performance of the character-centered browsing, we in- We will extend video annotation and organization on the shot
vited ten subjects (six males and four females) to participate in level to achieve better accuracy and completeness in responding
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1287

to the query of the user. 3) We will explore to generate a movie [21] C.-Y. Weng, W.-T. Chu, and J.-L. Wu, “Rolenet: Treat a movie as
trailer related to a certain character or a group of characters. a small society,” in Proc. Int. Workshop Multimedia Information Re-
trieval, 2007, pp. 51–60.
[22] J. Scott, Social Network Analysis: A Handbook. Newbury Park, CA:
Sage, 1991.
ACKNOWLEDGMENT [23] Y. Li, H. Z. Ai, C. Huang, and S. H. Lao, “Robust head tracking with
particles based on multiple cues fusion,” in Proc. HCI/ECCV, 2006, pp.
The authors would like to thank S. Chen, Y. Wu, and C. Zang 29–39.
for a number of helpful discussions and sharing necessary codes. [24] Y. Wu, W. Hu, T. Wang, Y. Zhang, J. Cheng, and H. Lu, “Robust
The authors are also grateful to X.-Y. Chen for experimental speaking face identification for video analysis,” in Proc. Pacific Rim
Conf. Multimedia, 2007, pp. 665–674.
data preparation and labeling. [25] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally
linear embedding,” Science, vol. 290, pp. 2323–2326, 2000.
[26] Y. Rubner, C. Tomasi, and L. J. Guibas, “A metric for distributions with
REFERENCES applications to image databases,” in Proc. IEEE Int. Conf. Computer
Vision, 1998, pp. 59–66.
[1] W. Zhao, R. Chelappa, P. J. Phillips, and A. Rosenfeld, “Face recog- [27] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and
nition: A literature survey,” ACM Compu. Surv., vol. 35, no. 4, pp. an algorithm,” Adv. Neural Inf. Process. Syst. 14, pp. 849–856, 2001.
399–458, 2003. [28] C. Snoek, M. Worring, and A. Smeulders, “Early versus late fusion in
[2] S. Satoh and T. Kanade, “Name-it: Association of face and name in semantic video analysis,” in Proc. 13th Annu. ACM Int. Conf. Multi-
video,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recog- media, 2005, pp. 399–402.
nition, 1997, pp. 368–373. [29] T. Mei, X.-S. Hua, L. Yang, and S. Li, “Videosense—towards effective
[3] R. Houghton, “Named faces: Putting names to faces,” IEEE Intell. online video advertising,” in Proc. 15th Annu. ACM Int. Conf. Multi-
Syst., vol. 14, no. 5, pp. 45–50, 1999. media, 2007, pp. 1075–1084.
[4] J. Yang and A. G. Hauptmann, “Naming every individual in news video [30] M. Leordeanu and M. Hebert, “A spectral technique for correspon-
monologues,” in Proc. 12th Annu. ACM Int. Conf. Multimedia, 2004, dence problems using pairwise constraints,” in Proc. 10th IEEE Int.
pp. 580–587. Conf. Computer Vision, 2005, pp. 1482–1489.
[5] J. Yang, A. Hauptmann, and M.-Y. Chen, “Finding person x: Corre- [31] C. S. Myers and L. R. Rabiner, “A comparative study of several dy-
lating names with visual appearances,” in Proc. Int. Conf. Image and namic time-warping algorithms for connected word recognition,” Bell
Video Retrieval, 2004, pp. 270–278. Syst. Tech. J., vol. 60, pp. 1389–1409, 1981.
[6] J. Yang, R. Yan, and A. G. Hauptmann, “Multiple instance learning for [32] R. Duda, P. Hart, and D. Stork, Pattern Classification. New York:
labeling faces in broadcasting news video,” in Proc. 13th Annu. ACM Wiley, 2000.
Int. Conf. Multimedia, 2005, pp. 31–40.
[7] D. Ozkan and P. Duygulu, “Finding people frequently appearing
in news,” in Proc. Int. Conf. Image and Video Retrieval, 2006, pp.
173–182.
[8] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Miller,
and D. Foryth, “Names and faces in the news,” in Proc. IEEE Int. Conf.
Computer Vision and Pattern Recognition, 2004, vol. 2, pp. 848–854. Yi-Fan Zhang (S’09) received the B.E. degree from
[9] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, “Automatic Southeast University, Nanjing, China, in 2004. He is
face naming with caption-based supervision,” in Proc. IEEE Int. Conf. currently pursuing the Ph.D. degree at National Lab-
Computer Vision and Pattern Recognition, 2008. oratory of Pattern Recognition, Institute of Automa-
[10] V. Jain, E. Learned-Miller, and A. McCallum, “People-LDA: An- tion, Chinese Academy of Sciences, Beijing, China.
choring topics to people using face recognition,” in Proc. IEEE Int. In 2007, he was an intern student in the Institute
for Infocomm Research, Singapore. Currently he is
Conf. Computer Vision, 2007.
an intern student in China-Singapore Institute of Dig-
[11] O. Arandjelovic and A. Zisserman, “Automatic face recognition for
ital Media. His research interests include multimedia,
film character retrieval in feature-length films,” in Proc. IEEE Int. Conf.
video analysis, and pattern recognition.
Computer Vision and Pattern Recognition, 2005, pp. 860–867.
[12] J. Sivic, M. Everingham, and A. Zisserman, “Person spotting: Video
shot retrieval for face sets,” in Proc. Int. Conf. Image and Video Re-
trieval, 2005, pp. 226–236.
[13] A. W. Fitzgibbon and A. Zisserman, “On affine invariant clustering and Changsheng Xu (M’97–SM’99) is a Professor in
automatic cast listing in movies,” in Proc. Eur. Conf. Computer Vision, the Institute of Automation, Chinese Academy of
2002, vol. 3, pp. 304–320. Sciences, and Executive Director of China-Sin-
[14] O. Arandjelovic and R. Cipolla, “Automatic cast listing in feature- gapore Institute of Digital Media. He was with
length films with anisotropic manifold space,” in Proc. IEEE Int. Conf. Institute for Infocomm Research, Singapore, from
Computer Vision and Pattern Recognition, 2006, pp. 1513–1520. 1998 to 2008. He was with the National Lab
[15] Y. Gao et al., “Cast indexing for videos by ncuts and page ranking,” in of Pattern Recognition, Institute of Automation,
Proc. Int. Conf. Image and Video Retrieval, 2007, pp. 441–447. Chinese Academy of Sciences, Beijing, China,
[16] Z. Liu and Y. Wang, “Major cast detection in video using both speaker from 1996 to 1998. His research interests include
and face information,” IEEE Trans. Multimedia, vol. 9, no. 1, pp. multimedia content analysis, indexing and retrieval,
digital watermarking, computer vision, and pattern
89–101, 2007.
recognition. He published over 170 papers in those areas.
[17] Y. Li, S. Narayanan, and C.-C. J. Kuo, “Content-based movie analysis
Dr. Xu is an Associate Editor of ACM/Springer Multimedia Systems Journal.
and indexing based on audiovisual cues,” IEEE Trans. Circuits Syst. He served as Program Co-Chair of 2009 ACM Multimedia Conference,
Video Technol., vol. 14, no. 8, pp. 1073–1085, 2004. Short Paper Co-Chair of ACM Multimedia 2008, General Co-Chair of 2008
[18] M. Everingham, J. Sivic, and A. Zisserman, ““Hello! My name is . . . Pacific-Rim Conference on Multimedia and 2007 Asia-Pacific Workshop on
Buffy” automatic naming of characters in TV video,” in Proc. British Visual Information Processing (VIP2007), Program Co-Chair of VIP2006,
Machine Vision Conf., 2006, pp. 889–908. Industry Track Chair, and Area Chair of 2007 International Conference on
[19] T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, “Movie/script: Align- Multimedia Modeling. He also served as Technical Program Committee
ment and parsing of video and text transcription,” in Proc. 10th Eur. Member of major international multimedia conferences, including ACM
Conf. Computer Vision, 2008, pp. 158–171. Multimedia Conference, International Conference on Multimedia & Expo,
[20] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning real- Pacific-Rim Conference on Multimedia, and International Conference on
istic human actions from movies,” in Proc. IEEE Int. Conf. Computer Multimedia Modeling. He received the 2008 Best Editorial Member Award of
Vision and Pattern Recognition, 2008. ACM/Springer Multimedia Systems Journal. He is a member of ACM.
1288 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009

Hanqing Lu (M’05–SM’06) received the Ph.D. Yueh-Min Huang (M’98) received the M.S. and
degree from Huazhong University of Sciences and Ph.D. degrees in electrical engineering from the
Technology, Wuhan, China, in 1992. University of Arizona, Tucson, in 1988 and 1991,
Currently, he is a Professor in the Institute of respectively.
Automation, Chinese Academy of Sciences, Beijing, He is a Professor and Chairman of the Department
China. His research interests include image simi- of Engineering Science, National Cheng-Kung Uni-
larity measure, video analysis, object recognition, versity, Tainan, Taiwan. His research interests include
and tracking. He has published more than 100 papers multimedia communications, wireless networks, arti-
in those areas. ficial intelligence, and e-Learning. He has coauthored
two books and has published about 200 refereed pro-
fessional research papers.
Dr. Huang has received many research awards, such as the Best Paper Award
of 2007 IEA/AIE Conference; the Awards of Acer Long-Term Prize in 1996,
1998, and 1999; and Excellent Research Awards of National Microcomputer and
Communication Contests in 2006. He has been invited to give talks or served
frequently in the program committee at national and international conferences.
He is in the editorial board of the Journal of Wireless Communications and Mo-
bile Computing, Journal of Security and Communication Networks, and Inter-
national Journal of Communication Systems. He is a member of the IEEE Com-
munication, Computer, and Circuits and Systems Societies.

Electrical and Electronics Measurements and Instrumentation by Prithwiraj Purkait PDF
83% (6)
Electrical and Electronics Measurements and Instrumentation by Prithwiraj Purkait PDF
651 pages
Face Recognition Using Opencv
100% (5)
Face Recognition Using Opencv
70 pages
tcsvt19 Survey
No ratings yet
tcsvt19 Survey
18 pages
Speaker Naming in Movies
No ratings yet
Speaker Naming in Movies
11 pages
Respicio ICIDS09 Unpaginated
No ratings yet
Respicio ICIDS09 Unpaginated
13 pages
Entropy 26 00149
No ratings yet
Entropy 26 00149
33 pages
Face Recognition in Movie Trailers Via Mean Sequence Sparse Representation-Based Classification
No ratings yet
Face Recognition in Movie Trailers Via Mean Sequence Sparse Representation-Based Classification
8 pages
Video-Based Face Recognition and Face-Tracking Using Sparse
No ratings yet
Video-Based Face Recognition and Face-Tracking Using Sparse
10 pages
76.deep Cross-Modal Face Naming For People News Retrieval
No ratings yet
76.deep Cross-Modal Face Naming For People News Retrieval
5 pages
Advances in Face Detection Techniques in Video
No ratings yet
Advances in Face Detection Techniques in Video
6 pages
Rashomon 2 PDF
No ratings yet
Rashomon 2 PDF
10 pages
Where's Waldo: Matching People in Images of Crowds
No ratings yet
Where's Waldo: Matching People in Images of Crowds
8 pages
A Very Fast Sparse Clustering To Cluster Faces
No ratings yet
A Very Fast Sparse Clustering To Cluster Faces
14 pages
Specific Comic Character Detection Using Local Feature Matching
No ratings yet
Specific Comic Character Detection Using Local Feature Matching
5 pages
Cll3v9zhn00eq07zgh71gbz3c - Af014643 f993 6d32 Dfa1 Ba28614f1330 Project Quickwater Scene Boundary Detection Guidelines
No ratings yet
Cll3v9zhn00eq07zgh71gbz3c - Af014643 f993 6d32 Dfa1 Ba28614f1330 Project Quickwater Scene Boundary Detection Guidelines
4 pages
An Effective Approach For Video Copy Detectionand Identification of Misbehaving
No ratings yet
An Effective Approach For Video Copy Detectionand Identification of Misbehaving
9 pages
1 - 4. An Approach For Video Summarization Based On Unsupervised Learning Using Deep Semantic Features and Keyframe Extraction
No ratings yet
1 - 4. An Approach For Video Summarization Based On Unsupervised Learning Using Deep Semantic Features and Keyframe Extraction
8 pages
A Philosopher's Understanding of Quantum Mechanics (Vermaas)
100% (1)
A Philosopher's Understanding of Quantum Mechanics (Vermaas)
308 pages
Efficient Video Indexing For Fast-Motion Video
No ratings yet
Efficient Video Indexing For Fast-Motion Video
16 pages
Proposal and Implementation of A Novel Scheme For Image and Emotion Recognition Using Hadoop
No ratings yet
Proposal and Implementation of A Novel Scheme For Image and Emotion Recognition Using Hadoop
6 pages
Human Face Detection and Recognition Using Web-Cam
No ratings yet
Human Face Detection and Recognition Using Web-Cam
9 pages
Face Recognition System Using Self Organizing Feature Map and Appearance Based Approach
No ratings yet
Face Recognition System Using Self Organizing Feature Map and Appearance Based Approach
6 pages
Face Recognition Ieee Paper Based On Image Processing
100% (1)
Face Recognition Ieee Paper Based On Image Processing
9 pages
A Survey On The Automatic Indexing of Video Data 1999 Journal of Visual Communication and Image Representation
No ratings yet
A Survey On The Automatic Indexing of Video Data 1999 Journal of Visual Communication and Image Representation
35 pages
Face Recognition of Cartoon Characters
No ratings yet
Face Recognition of Cartoon Characters
5 pages
2005 - Designing Multiple Classi Er Systems
No ratings yet
2005 - Designing Multiple Classi Er Systems
10 pages
Introduction To Programming Using Python 1st Edition Ebook
50% (2)
Introduction To Programming Using Python 1st Edition Ebook
13 pages
Implementation of Video Tagging
No ratings yet
Implementation of Video Tagging
3 pages
A Survey On Video-Based Face Recognition Approaches
No ratings yet
A Survey On Video-Based Face Recognition Approaches
8 pages
Dynamic Video Narratives: Carlos D. Correa University of California, Davis Kwan-Liu Ma University of California, Davis
No ratings yet
Dynamic Video Narratives: Carlos D. Correa University of California, Davis Kwan-Liu Ma University of California, Davis
9 pages
Efficient and Robust Detection of Duplicate Videos in A Database
No ratings yet
Efficient and Robust Detection of Duplicate Videos in A Database
4 pages
ASET Abstract Reasoning Sample Test2
100% (1)
ASET Abstract Reasoning Sample Test2
15 pages
Review On Label Refinement of Web Facial Images
No ratings yet
Review On Label Refinement of Web Facial Images
4 pages
Indexing Narrative Structure and Semantics in Motion Pictures WLTH A Probabilistic Framework
No ratings yet
Indexing Narrative Structure and Semantics in Motion Pictures WLTH A Probabilistic Framework
4 pages
Attendance System Based On The Face Recognition of Webcam's Image of The Classroom
No ratings yet
Attendance System Based On The Face Recognition of Webcam's Image of The Classroom
11 pages
Human Detection in Videos: Muhammad Usman Ghani Khan, Atif Saeed
No ratings yet
Human Detection in Videos: Muhammad Usman Ghani Khan, Atif Saeed
9 pages
Phase-II - PPT - Template-Updated by Riza
No ratings yet
Phase-II - PPT - Template-Updated by Riza
18 pages
MPEG-4 Based Animation With Face Feature Tracking
No ratings yet
MPEG-4 Based Animation With Face Feature Tracking
10 pages
IJCRT2006237
No ratings yet
IJCRT2006237
5 pages
Review of Identification of Face-Name in Videos: Abstract
No ratings yet
Review of Identification of Face-Name in Videos: Abstract
4 pages
CO Distribution Cycle
No ratings yet
CO Distribution Cycle
10 pages
From Still Image To Video-Based Face Recognition: An Experimental Analysis
No ratings yet
From Still Image To Video-Based Face Recognition: An Experimental Analysis
6 pages
A Survey On Efficient Facial Feature Extraction Technique For Image Annotation
No ratings yet
A Survey On Efficient Facial Feature Extraction Technique For Image Annotation
4 pages
Face Identification
No ratings yet
Face Identification
62 pages
Cspa Face Recognition
No ratings yet
Cspa Face Recognition
6 pages
1.1) Background
No ratings yet
1.1) Background
3 pages
Logit Model For Binary Data
No ratings yet
Logit Model For Binary Data
50 pages
Hybrid Features Extraction For Adaptive Face Images Retrieval
No ratings yet
Hybrid Features Extraction For Adaptive Face Images Retrieval
10 pages
Enhanced Face Recognition Using Euclidean Distance Classification and PCA
No ratings yet
Enhanced Face Recognition Using Euclidean Distance Classification and PCA
5 pages
A Robust, Low-Cost Approach To Face Detection and Face Recognition
No ratings yet
A Robust, Low-Cost Approach To Face Detection and Face Recognition
6 pages
Research Proposal: Author: Muhammad Ghifary Proposed Research Topic
No ratings yet
Research Proposal: Author: Muhammad Ghifary Proposed Research Topic
3 pages
A. 2. Benefits of The Project: 1.3 Problem Statement
No ratings yet
A. 2. Benefits of The Project: 1.3 Problem Statement
2 pages
Face Recognition Technology
No ratings yet
Face Recognition Technology
27 pages
Chap 6
No ratings yet
Chap 6
24 pages
Face Recognition Report
No ratings yet
Face Recognition Report
3 pages
Fast Face Recognition Using Eigen Faces
No ratings yet
Fast Face Recognition Using Eigen Faces
4 pages
Bio Fasokey: Keywords: Forged, Biometry
No ratings yet
Bio Fasokey: Keywords: Forged, Biometry
6 pages
MS Excel 280 Short Keys Guide Book
No ratings yet
MS Excel 280 Short Keys Guide Book
36 pages
Face Detection Tracking Opencv
No ratings yet
Face Detection Tracking Opencv
6 pages
Applied Ai Book Preview 2018
No ratings yet
Applied Ai Book Preview 2018
68 pages
Mohr Circle
No ratings yet
Mohr Circle
14 pages
Calculation Cover Sheet Date: Author: Project: Calc No: Title
No ratings yet
Calculation Cover Sheet Date: Author: Project: Calc No: Title
6 pages
ECCS P123 Preface Table of Contents
No ratings yet
ECCS P123 Preface Table of Contents
7 pages
Face Recognition 1
No ratings yet
Face Recognition 1
6 pages
6.ABC Analysis
No ratings yet
6.ABC Analysis
32 pages
Explanatory Research Design Handout Prof - Panke
No ratings yet
Explanatory Research Design Handout Prof - Panke
1 page
Chapter 1
No ratings yet
Chapter 1
20 pages
41 50
No ratings yet
41 50
18 pages
Traupal Notes
No ratings yet
Traupal Notes
41 pages
Peter Brass-Advanced Data Structures-Cambridge University Press (2008) - Removed
No ratings yet
Peter Brass-Advanced Data Structures-Cambridge University Press (2008) - Removed
5 pages
Estimating On A Number Line To 1000 - Horizontal
No ratings yet
Estimating On A Number Line To 1000 - Horizontal
7 pages
Schenk 2010
No ratings yet
Schenk 2010
16 pages
Test of Arithmetic Progression
No ratings yet
Test of Arithmetic Progression
2 pages
Osl Languagespec
No ratings yet
Osl Languagespec
101 pages
T4 ENG Questions
No ratings yet
T4 ENG Questions
6 pages
GAN-based Synthetic Medical Image Augmentation
No ratings yet
GAN-based Synthetic Medical Image Augmentation
10 pages
Non Homogenous BVP Notes
No ratings yet
Non Homogenous BVP Notes
25 pages
Newtons Laws of Motion PDF
No ratings yet
Newtons Laws of Motion PDF
43 pages
Vibhor Kumar CV
No ratings yet
Vibhor Kumar CV
2 pages
Maths Assignment
No ratings yet
Maths Assignment
7 pages
DBM 20023 Engineering Mathematics 2: Application of Differentiation
No ratings yet
DBM 20023 Engineering Mathematics 2: Application of Differentiation
7 pages
Iccs Brochure Vagai
No ratings yet
Iccs Brochure Vagai
2 pages
Natural Language Processing with Python and spaCy: A Practical Introduction
From Everand
Natural Language Processing with Python and spaCy: A Practical Introduction
Yuli Vasiliev
No ratings yet
A Simplified Approach to It Architecture with Bpmn: A Coherent Methodology for Modeling Every Level of the Enterprise
From Everand
A Simplified Approach to It Architecture with Bpmn: A Coherent Methodology for Modeling Every Level of the Enterprise
David W. Enstrom
No ratings yet
Animation at Work
From Everand
Animation at Work
Rachel Nabors
No ratings yet
Video Modeling: Visual-Based Strategies to Help People on the Autism Spectrum
From Everand
Video Modeling: Visual-Based Strategies to Help People on the Autism Spectrum
Steve Lockwood
4/5 (18)
An Executive Guide Biometrics
From Everand
An Executive Guide Biometrics
alasdair gilchrist
No ratings yet
Artificial Intelligence Frame: Fundamentals and Applications
From Everand
Artificial Intelligence Frame: Fundamentals and Applications
Fouad Sabry
No ratings yet
Speech Recognition: Fundamentals and Applications
From Everand
Speech Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Character Identification in Feature-Length Films Using Global Face-Name Matching

Uploaded by

Character Identification in Feature-Length Films Using Global Face-Name Matching

Uploaded by

1276 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO.

Character Identification in Feature-Length Films

Fig. 1. Framework of character identification using global face-name matching.

build face-name association. 2) An EMD-based measure of face B. Face Representation by LLE

D. Constrained K-Means Clustering nition application. A Gaussian mixture density is a weighted

Fig. 4. Script examples of the film “Notting Hill”.

B. Face Affinity Network Building

C. Vertices Matching Between Two Graphs

TABLE III TABLE V

Fig. 5. Precision/recall curves of face track clustering.

Fig. 7. Precision/recall curves of character identification.

C. Relationship Mining VI. CONCLUSIONS

You might also like