Character Identification in Feature-Length Films Using Global Face-Name Matching
Character Identification in Feature-Length Films Using Global Face-Name Matching
7, NOVEMBER 2009
Abstract—Identification of characters in films, although very in- Character identification in feature-length films, although very
tuitive to humans, still poses a significant challenge to computer intuitive to humans, still poses a significant challenge to com-
methods. In this paper, we investigate the problem of identifying puter methods. This is due to the fact that characters may show
characters in feature-length films using video and film script. Dif-
ferent from the state-of-the-art methods on naming faces in the variation of their appearance including scale, pose, illumination,
videos, most of which used the local matching between a visible expression and wearing in a film. People recognition based on
face and one of the names extracted from the temporally local video their faces is a well-known difficult problem [1]; meanwhile,
transcript, we attempt to do a global matching between names and giving identities to the recognized faces also needs to tackle the
clustered face tracks under the circumstances that there are not ambiguity of identities. The objective of this work is to iden-
enough local name cues that can be found. The contributions of
our work include: 1) A graph matching method is utilized to build tify the faces of characters in the film and label them with their
face-name association between a face affinity network and a name names. Based on our work, users can easily use the name as
affinity network which are, respectively, derived from their own a query to select the characters of interest and view the related
domains (video and script). 2) An effective measure of face track video clips. This character-centered browsing is able to not only
distance is presented for face track clustering. 3) As an application, bring us a new viewing experience, but also provide an alterna-
the relationship between characters is mined using social network
analysis. The proposed framework is able to create a new experi- tive for video summarization and digestion.
ence on character-centered film browsing. Experiments are con- In this paper, we present a novel approach for character iden-
ducted on ten feature-length films and give encouraging results. tification in feature-length films. In films, the names of charac-
Index Terms—Face identification, movie analysis, social network ters seldom directly appear in the subtitle, while the film script
analysis, video browsing. which contains names does not have time stamps to align with
the video. There are not enough temporally local name cues that
can be found for local face-name matching. Hence, we attempt
I. INTRODUCTION to do a global matching between the faces detected from the
video and the names extracted from the film script, which is dif-
ferent from the state-of-the-art methods on naming faces in the
ITH the flourishing development of the movie industry,
W a huge amount of movie data is being generated ev-
eryday. It becomes very important for a media creator or distrib-
videos. Based on the results of character identification, an ap-
plication for character-centered film browsing is also presented
which allows users to use the name as a query to search related
utor to provide better media content description, indexing and
video clips and digest the film content.
organization, so that users can easily browsing, skimming and
retrieving the content of interest. In a film, characters are the A. Related Work
focus center of interests from the audience. Their occurrences The crux of the problem on associating faces with names is
provide meaningful presentation of the video content. Hence, to exploit the relations between videos or images and the as-
characters are one of the most important content to be indexed, sociated texts in order to label the faces with names under less
and thus character identification becomes a critical step on film or even no manual intervention. Extensive research efforts have
semantic analysis. been concentrated on this problem. Name-it [2] is the first pro-
posal on face-name association in news videos based on the
co-occurrence between the detected faces and names extracted
Manuscript received August 04, 2008; revised April 12, 2009. First published from the video transcript. A face is labeled with the name which
August 18, 2009; current version published October 16, 2009. This work was
supported in part by the National Natural Science Foundation of China (Grant frequently co-occurs with it. Named Faces system [3] built a
No. 60833006) and in part by the Beijing Municipal Laboratory of Multimedia database of named faces by recognizing the people names over-
and Intelligent Software Technology. The associate editor coordinating the re-
laid on the video frames using video optical character recogni-
view of this manuscript and approving it for publication was Dr. Jie Yang.
Y.-F. Zhang, C. Xu, and H. Lu are with the National Lab of Pattern Recog- tion (VOCR). Yang et al. [4], [5] employed the closed caption
nition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, and speech transcript, and built models for predicting the prob-
China, and also with the China-Singapore Institute of Digital Media, Singapore,
ability that a name in the text matches to a face on the video
119615 (e-mail: [email protected]; [email protected]; [email protected].
cn). frame. They improved their methods in [6] by using multiple
Y.-M. Huang is with the National Cheng-Kung University, Tainan, Taiwan instance learning for partial labeled faces to reduce the effort of
(e-mail: [email protected]). collecting data by users. In [7], the speech transcript was also
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. used to find people frequently appearing in the news videos.
Digital Object Identifier 10.1109/TMM.2009.2030629 Similarly, for face identification in news images, the problem
1520-9210/$26.00 © 2009 IEEE
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1277
was also addressed as clustering or classifying the faces to the cross-linguistic problem and the inconsistencies between subti-
people specific appearance models, supervised by the name cues tles and scripts may bring more difficulties in alignment.
extracted from the image captions or associated news articles
[8]–[10]. B. Overview of Our Approach
Although some of the methods showed promising face iden- In a film, the interactions among the characters resemble them
tification results in the video, most of them are used in news into a relationship network, which makes a film be treated as a
videos which can easily get candidate names for the faces from small society [21]. Every character has his/her social position
the simultaneous appearing captions or temporally local tran- and keeps a certain relationship with others. In the video, faces
scripts. Unlike the news videos which are presented by the third can stand for characters and the co-occurrence of the faces in a
person such as the anchor or the reporter, in the films, the names scene can represent an interaction between characters. Hence,
of characters are seldom directly appearing in the subtitle, which the statistical properties of faces can preserve the mutual rela-
makes it difficult to get the local name cues. Hence, many ef- tionship in the character network. As the same way in the film
forts on film analysis were devoted to major characters detection script, the spoken lines of different characters appearing in the
or automatic cast listing but not assigning real names to them. same scene also represents an interaction. Thus, the names in
Arandjelovic and Zisserman [11] used face image as a query front of the spoken lines can also build a name affinity net-
to retrieve particular characters. Affine warping and illumina- work. Both the statistical properties of the faces and the names
tion correcting were utilized to alleviate the effects of pose and motivate us to seek a correspondence between the face affinity
illumination variations. In [12], multiple face exemplars were network and the name affinity network. The name affinity net-
obtained from face tracks to improve the face matching results. work can be straightforwardly built from the script. For the
For automatic cast listing, faces of the major characters in a fea- face affinity network, we first detect face tracks in the video
ture film can be generated automatically using clustering based and cluster them into groups corresponding to the characters.
on the appropriate invariance in the facial features [13]–[15]. During the clustering, the Earth Mover’s Distance (EMD) is uti-
Due to the uncontrolled conditions in films with a wide vari- lized to measure the face track distance. Since we try to keep as
ability on faces, approaches only depending on faces are not same as possible with the name statistics in the script, we se-
always reliable. Therefore, multi-modal approaches fused with lect the speaking face tracks to build the face affinity network,
facial features and speaker voice models were proposed [16], which is based on the co-occurrence of the speaking face tracks.
[17]. However, these approaches cannot automatically assign For name and face association, it is formulated as a problem
real names to the characters. To handle this, Everingham et al. of matching vertices between two graphs. A spectral method,
[18] proposed to employ a readily available textual source, the which has been used in 2-D/3-D registration and object recog-
film script, which contains the character names in front of their nition, is introduced here to build name-face association. Es-
spoken lines. However, the film script does not have time in- pecially, during the matching process, priors can be incorpo-
formation to achieve face name matching. Hence, they used the rated for improvement. After assigning names to faces, we also
film script together with the subtitle for text video alignment and determine the leading characters and find cliques based on the
thus obtained certain annotated face exemplars. The rest of the affinity network using social network analysis [22]. A platform
faces were then classified into these exemplars for identification. is presented for character-centered film browsing, which en-
Their approach was also followed by several works which aimed ables users to easily use the name as a query to search related
for video parsing [19] and human action annotation [20]. How- video clips and digest the film content. The whole framework of
ever, in their approach [18], the subtitle text and time-stamps our proposed approach is shown in Fig. 1.
were extracted by OCR, which required extra computation cost Compared with the previous work, the contributions of our
on spelling error correction and text verification. Sometimes, the work include: 1) A graph matching method is introduced to
1278 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009
Fig. 2. Face images mapped into the embedding space described by the first two components of (a) PCA and (b) LLE. The same color points are the faces of the
same character. In panel (b), the ellipses label three face tracks belonging to two characters.
Fig. 3. Distance matrices of face tracks from six characters measured by the (a) minimum distance and the (b) EMD. In the figure, the lower the intensity, the
smaller the distance.
should be excluded by punishing the more severe dissimilarities clustering by spectral clustering and K-Means. The precision of
of the other faces. spectral clustering is 69.6%, and the precision of K-Means is
In most of the previous work [15], [18], minimum distances 58.1%. Since in building signatures, the dominant clusters are
were employed to evaluate the dissimilarity of face tracks: represented by their centers, it requires each cluster to be com-
pact enough. We also calculate the mean intra-cluster distance
(1) of the clustering results. For spectral clustering, it is 3.41; for
K-Means, it is 4.42. The experiments showed that spectral clus-
tering is more suitable in our work. After spectral clustering,
where denotes the Euclidean distance; and are
we can extract dominant clusters from all the detected faces.
two face tracks; and and are the features of two faces
Here it may be argued why we do not use majority voting within
belonging to them, respectively. The problem brought by min-
each face track to determine its cluster label. The reason is that
imum distance is that it only cares about the partial matching
majority voting cannot punish the dissimilarity either.
but does not punish the dissimilarities. In addition, the Eu-
The face tracks can be represented as follows: Let
clidean distance used may not fit in certain situations. From
be the signature of the
panel (b) in Fig. 2, we can see that, although the LLE preserves
first face track with clusters, where is the
the neighborhood relationship in the high dimensional space
cluster center and is the number of faces belonging to this
and has a distinctly spiny structure, their similarities still cannot
cluster; be the signature of the
be simply measured by Euclidean distance. For example, if
second face track with clusters. The EMD between
using Euclidean norm, the distance between face track B and
two face tracks is defined as follows:
face track C is smaller than A and B, while actually A and B
belong to the same person.
The EMD is a metric to evaluate the dissimilarity between (2)
two distributions [26]. It reflects the minimal amount of work
that must be performed to transform one distribution into the where is the ground distance between cluster centers and
other by moving “distribution mass” around. The EMD pun- . Note that the distance is calculated on the feature vectors
ishes the dissimilarity by increasing the amount of transporta- of points derived from spectral clustering. The is the flow
tion work. It is represented as a distribution on certain dominant between and . The denominator of the equation is the
clusters, which are called signature. The signatures do not nec- normalization factor which is the total weight of the smaller
essarily have the same mass, thus it allows for partial matches. signature. Calculation of the EMD is based on the solution of
Hence, the EMD is adequate to measure face track distance. a linear programming problem subjecting to four constraints
Here we come to face the key problem on extracting the dom- which can be found in [26]. To compare with the EMD, we
inant clusters. How can those varied faces which actually belong also use the minimum distance to measure the distance between
to one person be clustered into the same or near ground distance face tracks. The two panels of Fig. 3 illustrate the distance ma-
clusters? In panel (b) of Fig. 2, it is obvious that straightfor- trices of the face tracks measured by the (a) minimum distance
wardly using the K-Means algorithm, whose resulting clusters and the (b) EMD, respectively. The face tracks are collected
are always convex sets, cannot work well. Here we employ spec- from one episode which contains six characters, and sorted by
tral clustering [27] to do clustering on all the faces in the LLE these characters. In panel (b), we can find six distinguished clus-
space. The reason we choose spectral clustering is that it can pre- ters. The intra-cluster distance is significantly smaller than the
serve the local neighborhoods and solve very general problems inter-cluster distance. In panel (a), although the intra-cluster dis-
like interwinds spirals. The number of clusters is set by prior tance is small, some face tracks from different characters also
knowledge derived from the film script. This will be described have small distances. The minimum distance makes them be
in Section II-D in detail. We have compared the results of face treated as the same person due to the partial similarities.
1280 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009
(6)
(3)
where is the EMD between and the cluster center
where is the EMD between the face track and its ; is the likelihood of ’s voice model for ;
cluster center ; is the number of -nearest neighbors of ; and and are set as 0.4 and 0.6, respectively. The face track
and is the number of -nearest neighbors which belong to will be classified into the cluster whose function score
the same cluster with . The point whose confidence is lower is maximal. As described in Section II-D, there exist some face
than a threshold is regarded as the marginal point. tracks belonging to the characters we have ignored. To clean
We collect all the marginal points pruned from the clusters these noises, we set a threshold . If the function score is
and do a re-classification which incorporates the speaker voice lower than , the face track is refused to classify to any
features for enhancement. The reason that we do not combine of the clusters and will be left unlabeled.
the speaker voice features with the face features earlier in the
face track clustering is that sometimes the environment or back- III. FACE-NAME ASSOCIATION
ground sounds in the films are noisy. Directly fusing the speaker
voice feature and the facial feature may affect the clustering re- In the video and in the film script, the faces and the names
sult. For testing, in the face track clustering step, we concate- can both stand for the characters. By treating all the characters
nate the speaker voice feature and the facial feature into one as a small society, we can, respectively, build a name affinity
feature vector to generate face track clusters. The precision of network and a face affinity network in their own domains
the clustering result is 64.3%, while the result of using facial (script and video). For face-name association, we want to seek
feature only is 72.1%. It showed that the early feature fusion de- a matching between the two networks. From the social network
grades the clustering result. Hence, we employ the speaker voice point of view, our work is to find the structural equivalence
features to only reclassify the marginal points which are not actors between the two networks. Two actors, respectively,
confident by the facial features. As the marginal points are all from the two networks are defined to be structural equivalence
speaking face tracks and the speaking frames have been detected if they have the same profile of relationship to other actors in
in Section II-A, we can obtain speech data of each face track by their own networks.
segmenting corresponding clips from the film audio track. For
each face track cluster, 30-s speech data are collected to train A. Name Affinity Network Building
a speaker voice model. Gaussian mixture models (GMMs) are A film script contains the spoken lines of characters together
employed here as it has been proved successful in speaker recog- with the scene information and some brief descriptions. Fig. 4
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1281
TABLE I
EXAMPLE OF NAME AFFINITY MATRIX IN “NOTTING HILL”
TABLE II
EXAMPLE OF FACE AFFINITY MATRIX IN “NOTTING HILL”
In , the vertices represent can be seen as the pairwise feature of two as-
names, the edges denote the rela- signments. A pair of correct assignments are probably to
tionship between vertices, the weights denote agree with each other and get high value of . Based
the strength of the edge , and the weights on this definition, is nonnegative and symmetric
denote the self-feature of the vertex (i.e., its occurrence fea- . If two assignments are incompatible
ture). In , the vertices represent to the one-to-one matching constraint [i.e., ,
faces, and the edges and the weights also represent the ], we set . Now the correspondence
relationship between faces. Therefore, the face-name associa- problem is reduced to finding a cluster of assignments
tion problem can be formulated as a graph matching problem, that maximizes the intra-cluster score while meeting
which targets on finding the correct correspondence between the one-to-one matching constraint. The intra-cluster score is
the vertices of the two graphs. Note that this matching process given as follows:
should subject to the one-to-one constrain, due to the reason that
one name can match at most one face and vice-versa. (12)
1) Problem Formulation: Given two graphs , con-
taining vertices, and , also containing vertices, there
are possible correspondence pairs (or assignment) 2) Spectral Matching Method: Spectral methods are com-
, where and . Our aim is monly used for finding the main clusters of a graph. A spec-
to find a correct correspondence mapping . tral technique was introduced by Leordeanu and Hebert [30] for
In the graph, each vertex has a weight which can be seen as its correspondence problem using pairwise constraints. They build
self-feature. In our case, it is the face or name occurrence feature. an affinity matrix of a graph whose vertices represent
However, the occurrence feature is not discriminative enough the potential correspondences and the weights on the edges rep-
to build correct correspondence between the vertices of the two resent pairwise agreements between potential correspondences.
graphs. From Tables I and II, we can see that some diagonal To find the cluster which has the maximal inter-cluster score
values are similar. Consequently, the relationship with other ver- [see (12)], they define an indicator vector , where its el-
tices should be taken into account and treated as the features of ement is the confidence of the th assignment belonging
the current vertex. In an ideal situation, if the scene segmentation to cluster . The norm of is fixed to 1. They aim to get the
in the video is as exact as the film script and the speaking face optimal solution , where . As we
track clustering can achieve 100% precision, the name graph and know, is a symmetric and nonnegative matrix. By the
the face graph should be exactly the same. The face graph can be Rayleigh quotient theorem, will be maximized when
seen as a transform from the name graph by adding noise which is the principal eigenvector of . Since has nonnega-
is introduced by speaking face track clustering and scene seg- tive elements, by the Perron–Frobenius theorem, the elements of
mentation. Nevertheless, those two still reserve the relationship will be in the interval [0,1]. Hence, we can calculate the prin-
and the statistic properties of the characters, such as A has more cipal eigenvector to determine the correct correspondences.
affinities with B than with C, B never has co-occurrence with D, Inspiring from the method in [30], we first initialize the list
etc. Hence, we need to find a method using the relationship and with the set of possible assignments. Then we use the in-
statistic properties to build the correct correspondence which can dividual feature and pairwise feature defined
accommodate a certain noise. above to build the affinity matrix which contains
We store the candidate assignments in a list all possible assignments and is symmetric and nonnegative.
. For each assignment , we can find a measure- From , the principal eigenvector can be calculated. We
ment on how well matching : start by first accepting the most correct assignment whose
eigenvector value is maximum. Next we reject all other
assignments which are in conflict with subjecting to the
(10) one-to-one matching constrain. Then we accept the next most
correct assignment and reject the ones in conflict with it. This
procedure will be repeated until all assignments are either
where is the sensitivity parameter for accommodating noise accepted or rejected. The accepted assignments are the final
or we can say the deformation between the two graphs. results of name-face association.
can be seen as the individual feature of an assignment. A correct 3) Spectral Matching With Priors: The method introduced
assignment often gets high value of . above is conducted in a totally unsupervised fashion. However,
For each pair of assignments , where , sometimes we can have certain prior knowledge such as we have
, we can also find a measurement on how known a correct assignment of a name and a face beforehand.
compatible the two assignments are. For example, on one hand, The question is whether we can get benefit from such kind of
and have an affinity ; on the other hand, and priors on spectral matching. The known assignment can be ob-
also have an affinity . If the pair of assignments are tained by the alignment of the film script and the subtitle using
both correct, the affinity values and should be sim- the method described in [18]. Although the global matching
ilar. Hence, is defined as follows: method we proposed in this paper does not need the timing in-
formation from the subtitle to generate local name cues, we want
to investigate, if they are available, whether the local name cues
can improve the global matching result. We first obtain the sub-
(11) title text from the video by OCR and then align the film script
with the subtitle text by a dynamic time warping algorithm [31].
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1283
The result is that each script line is tagged with time stamps from A. Character Relationship Mining
the subtitle. Then the speaking face tracks are labeled with the
names which have the corresponding time stamps. In each face To facilitate the character-centered browsing, the character
track cluster which we have built in Section II, certain tracks are relationship is mined first, which includes the determination of
labeled with a name. Due to the errors of the two text sources leading characters and cliques. Since the name affinity network
alignment and the imprecise of speaking face track detection, and the face affinity network can both describe the relationship
some tracks may be mislabeled with wrong names. However, of characters and the name affinity network is more accurate, the
there is no mechanism for error correction in [18]. The face relationship mining is conducted on the name affinity network.
tracks from the same character may be labeled with different From the social network analysis point of view, the leading
names. Hence, for each face track cluster, it should be assigned character can be considered as the one who has high centrality
the majority name. The probability that the cluster is as- in the name affinity network . The centrality
signed the majority name is defined as of a character is defined as . Then the leading
characters can be determined using the method of detecting the
(13) centrality gap among the characters [21]. We sort the centralities
of characters in a descending order: .
where is the number of face tracks which are assigned the Then we calculate the centrality difference between two
name in the cluster , and is the total number of face adjacent ones. The maximum difference will be set as
tracks in the cluster . To obtain the most believable assign- the centrality gap which can distinguish the leading characters
ment of face and name, we select the one which has the highest and the others. The ones whose centrality is are
probability value, and consider it as the known assignment. determined as the leading ones.
Given a known assignment , where Clique is a subset of a network in which the actors are more
, the relationship of other assignments closely and intensely tied to one another than they are to other
with [i.e., ] is therefore more reliable. For members of the network. For clique detection, we use agglomer-
a name , it has a set of possible as- ative hierarchical clustering [32]. The individuals are first
signments: , initialized as cliques. An empty clique List is also located.
and thus has a set of pairwise features with : The major steps are contained in the following procedure:
. Among , we use a
“The Best Takes All” operation: ALGORITHM (Agglomerative Hierarchical Clustering)
if
(14) 1) Begin initialize: , , ,
otherwise.
2) ,
This operation will be done for all the names .
According to the symmetry characteristic, will 3) do
be set as the same value as . In addition, since
4) find nearest cliques, say and
is the known correct assignment, subjecting
to the one-to-one matching constraint, those assignments of 5) if
the form or are in conflict
with it. In the matrix , the entries or 6) then ,
corresponding to these conflicting assignments are all set to 7) else break
0. After all of these operations, the affinity matrix will
incorporate the prior knowledge and be transformed to be more 8) until
sparse. Experimental results showed that incorporating priors
9) return L
can facilitate the graph matching.
where is defined as follows:
IV. APPLICATIONS
Until now, we have associated a name to each speaking face (15)
track cluster; thus, all the speaking face tracks can be identified.
For the rest of the non-speaking face tracks we have detected and are the numbers of the characters in and
before, we can also classify them into the nearest speaking face . In each step, the new merged clique is saved into the list .
track clusters depending on the EMD defined in (2), and asso- We classify the result cliques into dyad which has two members,
ciate names to them. Note that this classification process is only triad which has three members and the large clique. They will be
relied on facial features. listed in a summary of the film for character-centered browsing.
Based on the result of character identification, there are many
applications, such as character-based video retrieval, personal-
B. Character-Centered Browsing
ized video summarization, intelligent playback and video se-
mantic mining, etc. Here we provide a platform for character- Now we will provide a platform to support character-cen-
centered film browsing on which users can use character names tered film browsing. We have identified the faces of characters.
to search and digest the film content. Hence, we can use character names to annotate the scenes in
1284 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009
TABLE VI
SPEAKING FACE TRACK DETECTION
TABLE IV
QUERY EXAMPLES
A. Face Clustering
As the preliminary, we start with the speaking face track de-
tection. To assess its accuracy, we segment three 30-min clips,
respectively, from the films F1, F3, and F8. The statistics of the
experiment are shown in Table VI, where the columns “Face trk”
and “Sp. trk” are the total numbers of face tracks and speaking
the video. The annotation structure of one scene is defined as
face tracks contained in the clips.
follows:
After speaking face track detection, we cluster them into
groups corresponding to the characters. For each cluster, the
cluster pruning mechanism is then used to refine the results.
The point whose confidence [see (3)] is lower than the
Users can use the names of characters or cliques in the query to confidence threshold is determined as the marginal
view the related video scenes. For the convenience of users, a point and pruned. Hence, is the parameter to control
summary on characters of the film is listed automatically which the purity of each cluster. The higher the value of is,
contains the characters (lead and others) and cliques. Taking the the more points will be pruned. To demonstrate the results of
film “Notting Hill” as an example, the summary is shown in our method on face track clustering, we change the value of
Table III. from 0.5 to 0.1 and obtain a clustering precision/recall
Based on the summary, the query can be represented by using curve (see Fig. 5). The term “recall” is used here to indicate
a short sentence, e.g., “1st scene of Anna”, or “All scenes of the proportion of the points not pruned against the total points.
William”. For each query, we need to extract the keywords to The calculation of precision and recall are given as follows. To
infer the intents of the users. The query is formulated as avoid possible confusion with the traditional definition, we use
follows: “ ” for distinguishing:
(16) (17)
where , (18)
. If there is no ordinal number
in the query, the default value is . Some examples of queries For comparison, we also use the minimum distance measure
are given in Table IV. instead of our method during clustering. The result is shown in
Fig. 5. The result of spectral clustering on faces before applying
EMD in Section II-C is also shown as the baseline. It can be seen
V. EXPERIMENT
that our method is more effective to characterize the similarity
To evaluate our character identification approach, the experi- between face tracks and get better clustering results.
ments are conducted on ten feature-length films: “Notting Hill”, After the cluster pruning, we collect the pruned marginal
“Pretty Woman”, “Sleepless in Seattle”, “You’ve Got Mail”, points and do a re-classification. As these are all speaking
“Devil Wears Prada”, “Legally Blond”, “Revolutionary Road”, face tracks, the speaker voice features are fused with the facial
“The Shawshank Redemption”, “Léon”, and “Mission: Impos- features for classification. In Section II-E, we have set a score
sible”. The information of these films are shown in Table V. threshold to discard noise. By changing the value of
ZHANG et al.: CHARACTER IDENTIFICATION IN FEATURE-LENGTH FILMS USING GLOBAL FACE-NAME MATCHING 1285
TABLE VII
NAME-FACE ASSOCIATION
TABLE VIII
NAME-FACE ASSOCIATION WITH PRIOR
B. Face-Name Association
We have obtained different clusters of face tracks corre-
sponding to different characters. For assigning names to these
clusters, a spectral matching method is employed to achieve
vertices matching between name and face networks. The results
on the ten films are shown in Table VII. It can be seen that
the accuracy of the thriller and action film (F9 and F10) is
lower than others. It is due to the more severe variation of the
face pose and the illumination in the thriller and action films.
In F9, the characters sometimes wear masks. In a scene of
Fig. 6. Precision/recall curves of face track classification. F10, the hero even disguises his face as the other character.
These matters affect the face clustering and bring noise in
the face affinity matrix. Thus, more errors occur in face-name
from 0.75 to 0.1, we can also get a precision/recall association. Since the proposed method can incorporate priors
curve for face track classification (see Fig. 6). Similarly, here to improve the matching, we give one known assignment of
the term “recall” means the proportion of face tracks which are a name and a face, which is generated in Section III-C3, as a
classified. The calculation of the precision and recall are given prior for each film. The results (see Table VIII) validate the
as follows. Also, we use “ ” to distinguish from the traditional effectiveness of adding priors in the matching process.
definition: A comparison with the existing local matching approach
was carried out. The approach [18] proposed by Everingham
et al. was evaluated on the same dataset. We implemented
(19)
the approach strictly obeying the original description in [18].
The alignment of the film script and the subtitle had been
(20) implemented in Section III-C3 to obtain local name cues. The
speaking face tracks were then labeled with a temporally local
Before face-name association stage, our work is only conducted name and set as exemplars. Other face tracks were classified
on speaking face tracks. Thus, after face-name association, we to these exemplars for labeling. A precision/recall curve was
also classify the non-speaking face tracks into the clusters we obtained to demonstrate the performances. The term “recall”
have built. The classification of non-speaking face tracks is means the proportion of tracks which are assigned a name, and
based on facial features only. The results are also demonstrated “precision” is the proportion of correctly labeled tracks [18]. To
in Fig. 6. As expected, the performance of multi-modal features compare with this approach, we use the names assigned to the
(face + voice) is better than single feature (face) on the marginal clusters to label the face tracks in the clusters. Similarly, we also
points. obtain a precision/recall curve by changing the score threshold
1286 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009
TABLE IX
RELATIONSHIP MINING
TABLE X
USER EVALUATION OF CHARACTER-CENTERED BROWSING
defined in Section II-E to discard some face tracks the test. They are postgraduate students and research staff from
without labeling. To demonstrate the improvement brought by 24 to 40 years old. They each were asked to use five queries in
the prior knowledge introduced in face name matching, the the form of the examples shown in Table IV to browse the related
character identification results with the prior are also illustrated clips in the films. Then they each gave a score to the browsing re-
as a precision/recall curve. The three curves are shown in Fig. 7. sults on three attributes: completeness, acceptance, and novelty.
It can be seen that the global matching method we proposed is Completeness is to measure whether the user has watched what
comparable to the local matching method [18], while using less he/she wants. Acceptance is to measure how much the browsing
information source (film script) than it (film script + subtitle). style is accepted. Novelty is to measure whether it exceeds the
At the high levels of recall, our method even performs better. browsing expectation of the user and brings him/her new expe-
This mainly relies on the effectiveness of the face track distance rience. The score is based on the following scale: 5-very good,
measure in clustering and the employment of the multi-modal 4-good, 3-neutral, 2-bad, 1-very bad. The scores from all sub-
features in cluster pruning. From the curve of “Global matching jects are given in Table X. The results indicate that most of the
+ prior”, we can also find that incorporating certain local users are interested in the character-centered browsing and ac-
name cues as the prior knowledge does improve the character cept this new browsing style. From the automatically generated
identification results, though our method actually does not rely character summary of the film, they can grasp the structure of
on it. Since our method only needs the film script as the text characters in the film, and use it to select and digest the char-
information source, it can be applied under the circumstances acter-related contents. This provides a new alternative for film
that not enough time information can be found. In addition, the contents organization and summarization. One user suggested
method in [18] was restricted to the frontal faces, while our that the video annotation may be extended from scene level to
method deals with the multi-view faces. shot level, which can make it more accurate and complete.
to the query of the user. 3) We will explore to generate a movie [21] C.-Y. Weng, W.-T. Chu, and J.-L. Wu, “Rolenet: Treat a movie as
trailer related to a certain character or a group of characters. a small society,” in Proc. Int. Workshop Multimedia Information Re-
trieval, 2007, pp. 51–60.
[22] J. Scott, Social Network Analysis: A Handbook. Newbury Park, CA:
Sage, 1991.
ACKNOWLEDGMENT [23] Y. Li, H. Z. Ai, C. Huang, and S. H. Lao, “Robust head tracking with
particles based on multiple cues fusion,” in Proc. HCI/ECCV, 2006, pp.
The authors would like to thank S. Chen, Y. Wu, and C. Zang 29–39.
for a number of helpful discussions and sharing necessary codes. [24] Y. Wu, W. Hu, T. Wang, Y. Zhang, J. Cheng, and H. Lu, “Robust
The authors are also grateful to X.-Y. Chen for experimental speaking face identification for video analysis,” in Proc. Pacific Rim
Conf. Multimedia, 2007, pp. 665–674.
data preparation and labeling. [25] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally
linear embedding,” Science, vol. 290, pp. 2323–2326, 2000.
[26] Y. Rubner, C. Tomasi, and L. J. Guibas, “A metric for distributions with
REFERENCES applications to image databases,” in Proc. IEEE Int. Conf. Computer
Vision, 1998, pp. 59–66.
[1] W. Zhao, R. Chelappa, P. J. Phillips, and A. Rosenfeld, “Face recog- [27] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and
nition: A literature survey,” ACM Compu. Surv., vol. 35, no. 4, pp. an algorithm,” Adv. Neural Inf. Process. Syst. 14, pp. 849–856, 2001.
399–458, 2003. [28] C. Snoek, M. Worring, and A. Smeulders, “Early versus late fusion in
[2] S. Satoh and T. Kanade, “Name-it: Association of face and name in semantic video analysis,” in Proc. 13th Annu. ACM Int. Conf. Multi-
video,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recog- media, 2005, pp. 399–402.
nition, 1997, pp. 368–373. [29] T. Mei, X.-S. Hua, L. Yang, and S. Li, “Videosense—towards effective
[3] R. Houghton, “Named faces: Putting names to faces,” IEEE Intell. online video advertising,” in Proc. 15th Annu. ACM Int. Conf. Multi-
Syst., vol. 14, no. 5, pp. 45–50, 1999. media, 2007, pp. 1075–1084.
[4] J. Yang and A. G. Hauptmann, “Naming every individual in news video [30] M. Leordeanu and M. Hebert, “A spectral technique for correspon-
monologues,” in Proc. 12th Annu. ACM Int. Conf. Multimedia, 2004, dence problems using pairwise constraints,” in Proc. 10th IEEE Int.
pp. 580–587. Conf. Computer Vision, 2005, pp. 1482–1489.
[5] J. Yang, A. Hauptmann, and M.-Y. Chen, “Finding person x: Corre- [31] C. S. Myers and L. R. Rabiner, “A comparative study of several dy-
lating names with visual appearances,” in Proc. Int. Conf. Image and namic time-warping algorithms for connected word recognition,” Bell
Video Retrieval, 2004, pp. 270–278. Syst. Tech. J., vol. 60, pp. 1389–1409, 1981.
[6] J. Yang, R. Yan, and A. G. Hauptmann, “Multiple instance learning for [32] R. Duda, P. Hart, and D. Stork, Pattern Classification. New York:
labeling faces in broadcasting news video,” in Proc. 13th Annu. ACM Wiley, 2000.
Int. Conf. Multimedia, 2005, pp. 31–40.
[7] D. Ozkan and P. Duygulu, “Finding people frequently appearing
in news,” in Proc. Int. Conf. Image and Video Retrieval, 2006, pp.
173–182.
[8] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Miller,
and D. Foryth, “Names and faces in the news,” in Proc. IEEE Int. Conf.
Computer Vision and Pattern Recognition, 2004, vol. 2, pp. 848–854. Yi-Fan Zhang (S’09) received the B.E. degree from
[9] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, “Automatic Southeast University, Nanjing, China, in 2004. He is
face naming with caption-based supervision,” in Proc. IEEE Int. Conf. currently pursuing the Ph.D. degree at National Lab-
Computer Vision and Pattern Recognition, 2008. oratory of Pattern Recognition, Institute of Automa-
[10] V. Jain, E. Learned-Miller, and A. McCallum, “People-LDA: An- tion, Chinese Academy of Sciences, Beijing, China.
choring topics to people using face recognition,” in Proc. IEEE Int. In 2007, he was an intern student in the Institute
for Infocomm Research, Singapore. Currently he is
Conf. Computer Vision, 2007.
an intern student in China-Singapore Institute of Dig-
[11] O. Arandjelovic and A. Zisserman, “Automatic face recognition for
ital Media. His research interests include multimedia,
film character retrieval in feature-length films,” in Proc. IEEE Int. Conf.
video analysis, and pattern recognition.
Computer Vision and Pattern Recognition, 2005, pp. 860–867.
[12] J. Sivic, M. Everingham, and A. Zisserman, “Person spotting: Video
shot retrieval for face sets,” in Proc. Int. Conf. Image and Video Re-
trieval, 2005, pp. 226–236.
[13] A. W. Fitzgibbon and A. Zisserman, “On affine invariant clustering and Changsheng Xu (M’97–SM’99) is a Professor in
automatic cast listing in movies,” in Proc. Eur. Conf. Computer Vision, the Institute of Automation, Chinese Academy of
2002, vol. 3, pp. 304–320. Sciences, and Executive Director of China-Sin-
[14] O. Arandjelovic and R. Cipolla, “Automatic cast listing in feature- gapore Institute of Digital Media. He was with
length films with anisotropic manifold space,” in Proc. IEEE Int. Conf. Institute for Infocomm Research, Singapore, from
Computer Vision and Pattern Recognition, 2006, pp. 1513–1520. 1998 to 2008. He was with the National Lab
[15] Y. Gao et al., “Cast indexing for videos by ncuts and page ranking,” in of Pattern Recognition, Institute of Automation,
Proc. Int. Conf. Image and Video Retrieval, 2007, pp. 441–447. Chinese Academy of Sciences, Beijing, China,
[16] Z. Liu and Y. Wang, “Major cast detection in video using both speaker from 1996 to 1998. His research interests include
and face information,” IEEE Trans. Multimedia, vol. 9, no. 1, pp. multimedia content analysis, indexing and retrieval,
digital watermarking, computer vision, and pattern
89–101, 2007.
recognition. He published over 170 papers in those areas.
[17] Y. Li, S. Narayanan, and C.-C. J. Kuo, “Content-based movie analysis
Dr. Xu is an Associate Editor of ACM/Springer Multimedia Systems Journal.
and indexing based on audiovisual cues,” IEEE Trans. Circuits Syst. He served as Program Co-Chair of 2009 ACM Multimedia Conference,
Video Technol., vol. 14, no. 8, pp. 1073–1085, 2004. Short Paper Co-Chair of ACM Multimedia 2008, General Co-Chair of 2008
[18] M. Everingham, J. Sivic, and A. Zisserman, ““Hello! My name is . . . Pacific-Rim Conference on Multimedia and 2007 Asia-Pacific Workshop on
Buffy” automatic naming of characters in TV video,” in Proc. British Visual Information Processing (VIP2007), Program Co-Chair of VIP2006,
Machine Vision Conf., 2006, pp. 889–908. Industry Track Chair, and Area Chair of 2007 International Conference on
[19] T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, “Movie/script: Align- Multimedia Modeling. He also served as Technical Program Committee
ment and parsing of video and text transcription,” in Proc. 10th Eur. Member of major international multimedia conferences, including ACM
Conf. Computer Vision, 2008, pp. 158–171. Multimedia Conference, International Conference on Multimedia & Expo,
[20] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning real- Pacific-Rim Conference on Multimedia, and International Conference on
istic human actions from movies,” in Proc. IEEE Int. Conf. Computer Multimedia Modeling. He received the 2008 Best Editorial Member Award of
Vision and Pattern Recognition, 2008. ACM/Springer Multimedia Systems Journal. He is a member of ACM.
1288 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009
Hanqing Lu (M’05–SM’06) received the Ph.D. Yueh-Min Huang (M’98) received the M.S. and
degree from Huazhong University of Sciences and Ph.D. degrees in electrical engineering from the
Technology, Wuhan, China, in 1992. University of Arizona, Tucson, in 1988 and 1991,
Currently, he is a Professor in the Institute of respectively.
Automation, Chinese Academy of Sciences, Beijing, He is a Professor and Chairman of the Department
China. His research interests include image simi- of Engineering Science, National Cheng-Kung Uni-
larity measure, video analysis, object recognition, versity, Tainan, Taiwan. His research interests include
and tracking. He has published more than 100 papers multimedia communications, wireless networks, arti-
in those areas. ficial intelligence, and e-Learning. He has coauthored
two books and has published about 200 refereed pro-
fessional research papers.
Dr. Huang has received many research awards, such as the Best Paper Award
of 2007 IEA/AIE Conference; the Awards of Acer Long-Term Prize in 1996,
1998, and 1999; and Excellent Research Awards of National Microcomputer and
Communication Contests in 2006. He has been invited to give talks or served
frequently in the program committee at national and international conferences.
He is in the editorial board of the Journal of Wireless Communications and Mo-
bile Computing, Journal of Security and Communication Networks, and Inter-
national Journal of Communication Systems. He is a member of the IEEE Com-
munication, Computer, and Circuits and Systems Societies.