0% found this document useful (0 votes)
4 views

3. Octopus Aggressive Search of Multi-Modality Data Using Multifaceted Knowledge Base

The document presents Octopus, an aggressive search mechanism for multimedia retrieval that utilizes a multifaceted knowledge base to enhance the retrieval of multi-modality data. Unlike traditional content-based retrieval systems, Octopus allows users to input seed objects of any modality as hints, facilitating the retrieval of semantically relevant results without the need for highly representative samples. The system employs a layered graph model to capture the relationships between media objects, improving retrieval performance through user feedback and interaction.

Uploaded by

snadiazehra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

3. Octopus Aggressive Search of Multi-Modality Data Using Multifaceted Knowledge Base

The document presents Octopus, an aggressive search mechanism for multimedia retrieval that utilizes a multifaceted knowledge base to enhance the retrieval of multi-modality data. Unlike traditional content-based retrieval systems, Octopus allows users to input seed objects of any modality as hints, facilitating the retrieval of semantically relevant results without the need for highly representative samples. The system employs a layered graph model to capture the relationships between media objects, improving retrieval performance through user feedback and interaction.

Uploaded by

snadiazehra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Octopus: Aggressive Search of Multi-Modality Data Using

Multifaceted Knowledge Base


Jun Yang 1,2 Qing Li 1 Yueting Zhuang 2
1 2
Department of Computer Engineering and Department of Computer Science
Information Technology, City University of Hong Kong, Zhejiang University
83 Tat Chee Avenue, Kowloon, HKSAR, China Hangzhou, China, 310027
852-27889695 86-571-87951903
{itjyang, itqli}@cityu.edu.hk [email protected]

ABSTRACT into the retrieval system, because the sample is exactly what he is
An important trend in Web information processing is the support looking for. If that is not the case, the users of CBR systems are
of multimedia retrieval. However, the most prevailing paradigm still plagued by the task of finding representative samples to
for multimedia retrieval, content-based retrieval (CBR), is a rather formulate effective queries. Quite often, the user has only a vague
conservative one whose performance depends on a set of idea about the desired results in some details. On the other hand,
specifically defined low-level features and a carefully chosen even if the user has clear mind about what he would like to find,
sample object. In this paper, an aggressive search mechanism he may not be able to clarify it to the system due to the lack of a
called Octopus is proposed which addresses the retrieval of multi- “right-to-target” sample object at hand.
modality data using multifaceted knowledge. In particular, The difficulty of finding good samples reveals a recognized
Octopus promotes a novel scenario in which the user supplies problem in CBR systems—the lack of data semantics, which is of
seed objects of arbitrary modality as the hint of his information essential importance in judging the quality of retrieval results.
need, and receives a set of multi-modality objects satisfying his However, what are used by most CBR systems are low-level
need. The foundation of Octopus is a multifaceted knowledge features of media objects1, such as color histogram for images,
base constructed on a layered graph model (LGM), which motion vectors for videos. Although these features reflect the data
describes the relevance between media objects from various semantics to a certain degree, it is no doubt that they are
perspectives. Link analysis based retrieval algorithm is proposed inadequate to capture precisely the semantics of media objects.
based on the LGM. A unique relevance feedback technique is Providing good samples is a natural requirement of using low-
developed to update the knowledge base by learning from user level features: the system relies on the representative features of
behaviors, and to enhance the retrieval performance in a the sample to approximate the underlying semantics desired by the
progressive manner. A prototype implementing the proposed user. (There are also CBIR systems that use stretches or templates
approach has been developed to demonstrate its feasibility and to formulate queries [5], which can be generally regarded as
capability through illustrative examples. samples.) Moreover, since the low-level features are also media-
specific, the sample object must be of the same modality as the
Categories and Subject Descriptors desired results. The media objects retrieved by CBR systems are
H.3.3 [Information Storage and Retrieval]: Information Search perceptually similar to (looks like or sounds like) the sample, but
and Retrieval – query formulation, relevance feedback, query may not satisfy the requirement of the user who judges the
models, search process. relevance of an object at the semantic level.
Therefore, we regard the CBR systems as conservative systems,
General Terms whose performance depends on a set of specifically defined
Algorithms, Management, Design. features and carefully chosen sample object. Table 1 provides a
summary of the CBR approaches vis-à-vis their drawbacks. In
Keywords particular, to remedy these drawbacks, we propose a more
Multi-modality, multimedia retrieval, multifaceted knowledge
aggressive mechanism—Octopus—for search of multi-modality
base, layered graph model, link analysis, relevance feedback.
data. It is characterized as aggressive based on the following two
properties:
1. INTRODUCTION
1. It exploits the knowledge on multiple aspects regarding the
A close examination of content-based multimedia retrieval (CBR) relevance between media objects. Based on such
systems reveals one of their common implications—the sample multifaceted knowledge, the retrieval results are not
object used to formulate a query is virtually an eligible result of necessarily similar to the sample perceptually, but related to
the query, usually the most relevant one. This observation leads to it in a more sophisticated and semantics-flavored manner.
the following paradox. Suppose the user needs only one result, if
he is able to find a good sample, he needs not bother to input it

1
In this paper, a media object is an object of any modality, such
Copyright is held by the author/owner(s). as an image, video, text, etc. Meanwhile, if not indicated
WWW 2002, May 7-11, 2002, Honolulu, Hawaii, USA.
explicitly, we use “object” and “media object” interchangeably.
ACM 1-58113-449-5/02/0005.

54
Table 1: CBR paradigm, drawbacks, and suggested remedies to multimedia retrieval
CBR paradigm Drawbacks Octopus
Interaction highly representative sample difficulty of finding suitable multi-modality objects serving as
objects samples hints
multifaceted knowledge (user
Data index low-level features inadequate to capture semantics
behaviors, structure, content)

single-modality objects that are multi-modality, semantically


Results no semantically relevant results
perceptually similar to the sample related objects

2. It explores the relationships between media objects of Octopus can enrich the knowledge stored in LGM by learning
different modalities, such that it becomes possible, for from user-system interactions, such that Octopus has a hill-
example, that an audio clip is retrieved from a sample climbing nature (indicated by the loop in Figure 1) that allows its
image. performance to be progressively enhanced based on the
knowledge learned from previous queries and feedbacks.
The objective of Octopus is to promote a novel scenario for
multimedia retrieval: The user starts the search by supplying a set We do not provide any quantitative performance evaluation in this
of seed objects as the hints of his information need, which can be paper, mainly due to the lack of benchmark for such multi-
of any modality (even different with the desired objects), and modality search. Actually, the main contribution of this paper is
which are not necessarily the eligible results by themselves. From not on the performance improvement, but to bring out a novel
the seeds, the system figures out the user’s need and returns a set retrieval scenario that is not even possible with previous retrieval
of multi-modality objects that potentially satisfy his need. The approaches. Some characteristic queries and their results obtained
user can give further hints by identifying the results (of any using our prototype system are displayed to demonstrate the
modality) that are close to his need, based on which the system variety and flexibility of search in this scenario.
improves the estimation of his need and refines the results
The rest of this paper is organized as follows. In Section 2, we
accordingly. Therefore, the most prominent advantage of Octopus
present a formal description of the layered graph model as the
lies over traditional CBR systems in that externally it relieves the
core of the multifaceted knowledge base. The link analysis based
users from the task of providing highly representative samples,
algorithms for multi-modality data retrieval and relevance
and internally it employs a broader range of knowledge to retrieve
feedback are elaborated in Section 3. In Section 4, we introduce a
semantically relevant results.
prototype implementing the proposed approach and demonstrate
its retrieval capability by illustrative examples. In Section 5, we
discuss how our approach relates to the previous works on
multimedia retrieval and link structure analysis. Finally we give
the conclusion and suggest the future work in Section 6.
Multi-Modality
Search
2. MULTIFACETED KNOWLEDGE BASE
In this section, we introduce a layered graph model as the core of
Relevance
Link Analysis the multifaceted knowledge base, along with a description of the
Feedback
knowledge acquisition process.

2.1 Layered Graph Model (LGM)


As the foundation of the retrieval functionality, the multifaceted
Multifaceted KB knowledge base accommodates a broad range of knowledge
(Layered Graph regarding the relevance between media objects. In this paper, we
Model)
use the term “media object” to refer to an object of various
Figure 1: Overview of Octopus Mechanism modalities, such as an image, a video clip, and a textual
document. Some media objects can be regarded as composite
To support all the functionalities required by such a scenario, a objects that are composed from many “primitive” objects, e.g., a
suite of unique models, algorithms, and strategies are developed video clip is essentially a sequence of images.
in Octopus. As shown in Figure 1, the foundation of the whole
mechanism is a multifaceted knowledge base describing the In our approach, the relevance between two media objects can be
relevance between multi-modality objects. The kernel of the evaluated mainly from three different perspectives: (1) Users’
knowledge base is a layered graph model (LGM), which interpretation of the two objects, which can be deduced from user
characterizes the inter-object relevance estimated from three interactions, e.g., designating them as the positive examples of the
perspectives as (1) history of user behaviors, (2) structural same query. (2) Structural relationships between them, e.g., there
relationships between media objects, and (3) content of media is a hyperlink between them. (3) The similarity between two
objects. Link structure analysis, an established technique in web- objects in terms of their content, which can be estimated based on
based applications, is adapted for the retrieval of multi-modality their low-level features. To accommodate the knowledge on the
data based on LGM. The relevance feedback method used in three aspects, we develop a layered graph model (LGM) as the
core of the multifaceted knowledge base, with each layer

55
modeling knowledge on one aspect. The formal definition of graphic structure is also expensive in computation and storage,
LGM is given as follows. especially when the number of nodes and links get large.
Definition. The layered graph model (LGM) consists of three
superimposed knowledge layers, which from top to bottom are
2.2 Knowledge Acquisition
In the following, we describe the knowledge acquisition process
user layer, structure layer, and content layer. A knowledge layer
on each knowledge layer, i.e., how to construct the three types of
is an undirected graph G=(V, E), where V is a finite set of
links in LGM.
vertices and E is a finite set of edges. Each element in V
corresponds to a media object Oi ∈ O, where O is the collection • User Layer. User link reflects the user belief that two media
of media objects in the database. E is a ternary relation defined objects are relevant in some sense, and the weight of a user link
on V×V×R, where R represents real numbers. Each edge in E has indicates the degree of confidence of such belief. A
the form of <Oi, Oj, r>, denoting a link between Oi and Oj with r straightforward way of obtaining user links is to let the user create
as the weight of the link. The graph corresponds to a |V|×|V| all the links manually, which is nevertheless a time-consuming
adjacency matrix2 M=[mij], where each element mij=r if there is and labor-intensive process. Alternatively, the links can be
an edge <Oi, Oj, r> between Oi and Oj, and mij=0 if there is no acquired implicitly by learning from user-system interactions in
the retrieval process, specifically, relevance feedback. Consider a
edge between them. Obviously, M is a symmetric matrix (mij= typical scenario in CBR systems: a user starts a query with object
mji), and its elements on the diagonal are set to zero (mii=0). The A as the sample object, and among the results returned by the
vertices of the three layers correspond to the same set of media system he designates objects B and C as relevant examples to the
objects, while the links in each layer denote the relevance query. In this case, we may create new links between A and B, A
between two media objects defined from one of the three and C, or even B and C. As the user interactions proceed, the
perspectives mentioned above. coverage and the quality of user links are progressively improved.
The advantage of this strategy lies in that it exploits the
Figure 2 illustrates the LGM. Note that the order of the three
interactions of the entire population of users for knowledge
layers is fixed, which reflects the degree of reliability of the inter-
acquisition, and thereby relieves the significant human labors. A
object relevance suggested by the links in each layer. The user
detailed algorithm for the updates of user links using the above
layer is on the top, because user judgment is very reliable (not
strategy is presented in Section 3.4.
always reliable considering the subjective errors and biases) in
suggesting the relevance between media objects. Structure link is • Structure Layer. Structure links can be interpreted as spatial
also a strong indicator of the relevance between objects, but is not neighborhood, hyperlink, or composition relationships between
as reliable as user links. The content layer is at the bottom, since two objects, depending on the physical environment where the
the similarity calculated based on low-level features does not have data are collected. For example, for a typical organization of web
any well-defined mapping with object relevance perceived at pages in Figure 3(a), we can create the structure links as shown in
semantic level. (b). The textual content of a page is regarded as a single text
object. An image or a video clip is regarded as in the page either if
User it is embedded in the page or if it is pointed by a hyperlink on it.
Layer Legend
All the media objects within a page are interconnected by
text video structure links (e.g., objects A, B, and C are connected to each
image audio other). A hyperlink is mapped to structure links from the source
object to all the objects in the destination page (e.g., A is linked
Structure
Layer The same legend is used
with D and E, while E is linked with A, B, and C). For simplicity,
for all the other figures. the weights of all structure links are set to 1. The same strategy for
structure link construction can be applied to other forms of
hypermedia (e.g., a digital encyclopedia). Further, it can be even
Content adapted to non-hypermedia data collections (e.g., e-books), by
Layer interpreting the spatial vicinity as a hyperlink. Note that compared
with the previous link analysis approaches, here we adopt a
simplification of representing all the structure links as undirected
Figure 2: The layered graph model (LGM) links, in order to be consistent with user links and content links.
Different from the convention of storing the index of each object
with itself, the LGM stores the knowledge as the links between Page 1 Page 2
media objects. An advantage of such link-based knowledge
D
representation is that the retrieval can be restricted in a relatively A
small locality connected via links instead of in the whole Text A Text D

database, and therefore it can effectively reduce the search space


and afford more sophisticated retrieval algorithms. However, the Video Image Image
B C E B C E

(a) web pages with hyperlinks (b) structure links


2
The adjacency matrix defined here is slightly different from its
mathematical definition, in which each component is a binary Figure 3: Structure links construction in a web environment
value indicating the existence of the corresponding edge.

56
• Content Layer. A content link reveals the similarity between 3.1 Seed Generation
the content of two objects, defined on primitive 3 and media- Seed objects play the similar role as query examples in the CBR
specific features, such as color histogram for images, motion paradigm—formulating user queries. Nevertheless, the differences
vector for video clips, with a weight indicating the degree of between them are fundamental. On one hand, seed objects are not
similarity. Obviously, content links only exist between objects of necessarily eligible results of the query, and therefore they need
the same modality, and if no restriction is imposed, they can exist not to be highly representative; on the other hand, seed objects
between any pair of such objects, which are interconnected into can be of any modality, which may not be the same as that of the
several complete sub-graphs (one for each modality). However, desired objects.
since the content links with low similarity are unreliable and
noisy, we apply a cut-off threshold on the link weights to remove The user generates the seed objects either by selecting them from
the low-weighted links. In practice, when a new object is the database or by introducing (creating) new objects. In the latter
registered into the database, it is compared with all other objects case, the new object is automatically registered into the database
of the same modality with it, and links are created between it and with its content links and structure links (if any) with existing
those that have a content similarity above the threshold with it. objects created (see Section 2.2). Obviously, there are no user
links connected to the new object before it is involved in any user
3. LINK ANALYSIS BASED RETRIEVAL interactions. Note that this query formulation paradigm naturally
subsumes the query-by-example and query-by-keyword
AND RELEVANCE FEEDBACK paradigms, since the seed can be a media object (e.g., an image)
As illustrated in Figure 4, the retrieval process of Octopus can be or a piece of text consisting of several keywords.
described as a circle: the desired objects are retrieved through the
upper half-circle, and the user evaluations are collected and
incorporated into the knowledge base though the lower half-
3.2 Candidates Spanning
Since the seed objects provide the hints to the user’s need, it is
circle, which initiates a new circle to refine the previously
reasonable to assume that the desired objects are related to the
retrieved results based on the updated knowledge. Consequently,
seeds in a certain manner, specifically, through a path in the
this process has a hill-climbing nature in the sense that the
LGM. The path can be made up of links belonging to different
retrieval performance is enhanced incrementally as the loop is
layers in the LGM. Based on this assumption, we identify a
repeated.
collection of candidate objects by spanning from the seed objects
5 through the links in the LGM. This operation equals to the
Refinement construction of a small sub-graph around the seeds in the LGM.
Search The candidate set C must satisfy the following two criteria:

Candidates (1) C must be rich in containing the objects that are highly
2 3 relevant to the seed objects.
spanning distillation
(2) C is relatively small, so that it can afford the computational
cost of the distillation and feedback algorithms applied on it
subsequently.
generation
Seeds
register
Layered Results Both requirements favor the use of short paths in spanning, since
1 Graph Model short paths imply high relevance between the seeds and the
user
candidates, and are less probable to produce large candidate set.
update 4 Consequently, we place a threshold on the maximum length of the
union
evaluation path (viz. number of links) between a seed and a candidate. The
threshold is usually very small (e.g., 2 or 3), depending on the
Feedback
examples
scale of the data collection and density of links. Only the objects
that are reachable from the seeds through links less than the
Feedback threshold are identified as the candidates for further processing.
However, even after this threshold is applied, the size of the
Figure 4: Overview of link analysis based retrieval algorithm candidate set is still very unpredictable, mainly because of the
In this section, we describe the whole retrieval process in five varying number of links each object has, especially the structure
steps (see Figure 4): (1) generating the seed objects as the hints of links and content links. Some web pages may have hundreds or
the user’s information need, (2) spanning the seeds to a collection even thousands of hyperlinks pointing to it (e.g., the official site
of candidate objects via the links in the LGM, (3) distilling the of ACM), which may result in high density of structure links.
results by ranking the candidates based on link structure analysis, Moreover, the number of content links is likely to be high when
(4) updating the LGM by incorporating the user evaluations on the corresponding object has many similar objects, and vice versa.
the current results, and (5) refining the retrieval results based on Sometimes the number of candidates is so large that the
the updated LGM and the user evaluations. subsequent processing is unaffordable and meaningless due to the
low quality of the candidates.

3
We use the term “primitive” instead of “low-level”, since the
primitive feature for text object is keyword, which is not
traditionally considered as low-level features.

57
a e
d
seed
User c c
Layer

b
f e e d

Structure
Layer e d
f
b
Seed set f

Content
Layer
f
b
Candidates set

(a) Vertical perspective (b) Horizontal perspective

Figure 5: Candidates spanning

Consequently, we put a second threshold on the total number of


candidates. If the candidates generated by spanning go beyond the Spanning (S, l, t)
S: the seed set
threshold, the exceeding ones are discarded. But, what are the
l: the maximum length of the path
criteria to choose the appropriate victims? Put in other words,
t: the upper bound on the number of candidates
how to rank the candidates so that the most promising ones will
p: the string representing the pattern of a path
not be discarded? In our algorithm, the ranking of candidates is
nextpath(p): subroutine that returns the path that lexicographically
determined by the shortest path through which the candidate is
succeeds the path p, e.g., nextpath(‘US’) = ‘UC’.
reached from the seed. In particular, two factors of the path are
return: the candidate set
considered: the length of the path as well as the type of links that
constitute that path. The first factor captures the intuition that the Set C to empty set
closer two objects are, the more relevant their relationship is. The For i=1 to l
second factor takes into account of the priorities of the three types Set p to the first path of length i in lexicographical order
of links. Consider two paths of the same length. If one path goes While p is not the last path of length i in lexicographic order
through the user layer while the other is at the content layer, it is Let L as the set of objects reachable from the objects in
natural to conclude that the two objects connected by the first path S through path p
are more relevant than those by the second path. From this If |C¡ ÈL| < t, Then
observation, we formulate the following three heuristic rules for
C := C¡ È L
ranking:
Else
(1) A candidate c1 reached through a path shorter than that of Randomly select (t-|C|) objects from L and add
another candidate c2 is ranked higher than c2. them into C
Return C
(2) If two candidates are reached through two paths of the same End If
length, they are ranked according to the lexicographic order. p := nextpath(p)
(3) The candidates whose relative order cannot be decided by (1) Next
and (2) are ranked randomly. Return C
Algorithm 1. Spanning from seed set to candidate set
Suppose we use U, S, and C to denote user link, structure link,
and content link respectively (with the order U precedes S which, Although the candidates are ranked by the heuristic rules, the
in turn, precedes C), and represent a path by the types of its links. ranking is rather tentative and rough. For example, it is very
Then, the rank of paths determined by the above heuristic rules is arguable to rank the candidates with path ‘C’ higher than the
as follows: candidates with path ‘UU’. Moreover, the weights of links are not
considered in ranking these candidate objects. In the distillation
U — S — C — UU — US — UC — SU — … — CS — CC —UUU process, this tentative ranking is discarded and all the candidates
— UUS… are re-sorted by analyzing the link structure using a more
Figure 5 gives a vertical view and a horizontal view of the sophisticated algorithm.
candidates spanning process. From the horizontal view, one can
see how the spanning goes through different paths and jumps 3.3 Results Distillation
between layers: path a is of pattern ‘U’, path b is ‘C’, path c is In this phase, the link structure of the sub-graph that corresponds
‘UU’, path d is ‘US’, path e is ‘SU’, and path f is ‘SC’. (The to the candidate objects is analyzed, in order to determine the
objects shown in a column represent the same object at different relevance of each candidate object to the query. Based on our
layers.) basic premise that a link conveys relevance between two objects,
we make a further assumption that a candidate object is more
The algorithm for candidates spanning together with the ranking relevant to the query, if (1) it connects with a larger number of
is shown as follows: relevant candidates, or (2) it connects with relevant candidates

58
through links of higher weights, and (3) it connects to candidates Distillation(C)
that are more relevant to the query. C: the candidate set
r = [ri]: the overall relevance vector with each element ri being
Since the LGM has three layers, the distillation is performed in
the overall relevance score of object Oi in C
two steps: firstly, the candidates are ranked by analyzing the link
structure at each single layer, and then, the ranking of different wU, wS, wC: the weight for the user layer, the structure layer, and
layers are merged to give the final ranking. The single-layer the content layer
ranking algorithm works iteratively. Suppose each candidate return: the overall relevance vector for C
object Oi has a relevance score ri, which is initialized to 1.0. In
rU := Rank (C, “user”)
each round, we update ri by setting it to the sum of the product of
rS := Rank (C, “structure”)
the link weight and the relevance scores of the objects linking rC := Rank (C, “content”)
with Oi and then normalizing it. Note that such an update nicely For each object Oi in C
captures our assumption—the object with a large number of links, ri := wU·riU+ wS ·riS + wC ·riC
high-weighted links, and links with relevant objects will get a Next
high relevance score. The process repeats until every ri converges Return r
to a fixed value, which gives the final relevance scores of the
corresponding object. The detailed algorithm is shown as follows: Algorithm 3. Ranking candidates by combing multiple layers

Rank (C, s) 3.4 Knowledge Update


C: the candidate set If the user is not fully satisfied with the results generated in the
s={“user”, “structure”, “content”}: the knowledge layer distillation phase, he can give further hints by labeling the current
r=[ri]: the relevance vector with each element ri as the relevance results as either relevant or irrelevant examples. Upon the
score of object Oi in C acceptance of such user evaluations, the system initiates a two-
stage process: firstly, it incorporates the knowledge deduced from
M=[mij]: the adjacency matrix of the sub-graph corresponding to
user evaluations into the LGM; and then, it refines the previous
C at the layer s results based on the updated LGM and the user evaluations. The
return: the relevance vector for C first stage has a long-term influence since it updates the
knowledge base, while the second stage focuses on short-term
Initialize all the elements of vector r to 1.0
effect as the user satisfaction in the current retrieval session.
While the vector r has not been converged
For each object Oi in C The user evaluations are incorporated into the LGM by updating
ri := ¡ {j=1,…,|C|}
Æ (rj · mij) the user links. The underlying principle of link update is rather
Next intuitive: for a relevant example, we link it with every seed object,
Normalize R such that ? ri2=1 or increase the weight of the existing link between them; for
irrelevant examples, we take the opposite action. The algorithm
Return r
for user link update is presented as follows:
Algorithm 2. Ranking candidates at a single layer
Update (S, F+, F–)
The above algorithm updates the vector r by repeating the S: the original seed set
operation M×r→r, until it converges. At that time, the elements of F+: the set of relevant examples
r give the final relevance score of each object to the query, F–: the set of irrelevant examples
according to which the candidates can be sorted. Many previous MU =[mij]: the adjacency matrix of the user layer
works on link analysis [13] [19] have proved the convergence of r s, t: positive real numbers
(i.e., termination of the algorithm), and r is actually the principal
eigenvector of the matrix M. For each object Oi in S
After applying the above ranking algorithm on each of the three For each object Oj in P
layers in the LGM, we need to merge the three ranking lists into a mij := mij + s
uniform one. However, since the three layers deal with the Next
knowledge on different aspects, it is nearly impossible to design a For each object Ok in N
“fair” strategy for the combination of results. We suggest a mik := mik – t
heuristic strategy for this task by linearly combining the relevance
If mik < 0, then mik := 0
scores (of a candidate) obtained from different layers to compute
the overall relevance score, which is shown in Algorithm 3. Next
Return
Intuitively, the three weights used in the algorithm has the relation
of wU> wS> wC, which reflects the priorities of the three layers.
The candidate objects are ranked according to their overall Algorithm 4. Update knowledge base from user evaluations
relevance scores generated by this algorithm before they are Note that mij not only defines the weight of a link, but also
presented to the user.
governs the existence of the link. When mij is increased from zero
to a positive value, a link between Oi and Oj is created; when mij
is decreased to zero, the link is removed. The parameter t is
usually set to a value larger than s, so that a link on which users
have contradictory opinions will not receive a confidence weight.

59
By incorporating the up-to-date user evaluations into the LGM as 3.6 An Algorithmic Overview
user links, Octopus allows the future queries to benefit from these
previously conducted user interactions, such that the retrieval So far we have completed the whole loop of retrieval process
performance can be progressively improved. Compared with the shown in Figure 4. We integrate all the aforementioned algorithms
evolving user layer, the structure and content layer of the LGM into the following “main routine” to present an algorithmic
are passive, which do not change after their initial construction. overview of the main flow of the Octopus mechanism.

3.5 Result Refinement Octopus (S):


The objective of the refinement process is to refine the retrieval S: the set of seed objects
results based on the user’s evaluation made on the previous sort(C, r): a subroutine that sorts the elements in set C according
results. As shown in Figure 6, the refinement process undergoes to vector r, which gives the relevance score of each
the similar three steps (seed generation, spanning, and distilling) element in C.
at two levels (positive and negative) in parallel, with the results return: R, the set of ranked results
finally merged. Firstly, the original set of seed objects is
combined with the relevant examples, resulting in a set of positive C := Spanning (S, l, t)
seeds S+; meanwhile, the irrelevant examples are regarded as r := Distillation (C)
R := sort(C, r)
negative seeds S–. Then, the positive and negative seeds are
While the user is not satisfied with R
spanned into two groups of candidate objects, called positive
Let F+ and F– be relevant and irrelevant examples of the
candidates C+ and negative candidates C–, respectively. Finally,
current session
both groups of candidates are ranked using link analysis in the
distillation process, and the results are merged to give the final Update (S, F+, F–)
ranking list. (By merge, we mean the integration of the relevance r := Refinement (S, F+, F–)
scores instead of combination of objects.) S := (S ∪ F+) –F–
positive seeds Let C+ be the set of objects corresponding to r
spanning Positive distill
(relevant examples +
original seeds)
candidates
ation R := sort (r, C+)
Refined
merge
results
Algorithm 6. The main flow of Octopus mechanism
ation
spanning distill
negative seeds Negative
(irrelevant examples) candidates 4. PROTOTYPING AND ILLUSTRATIVE
Figure 6: Result refinement process EXAMPLES
A preliminary prototype system is implemented based on the
The algorithm for the refinement process is presented below. The Octopus mechanism. The modalities currently supported are text,
idea behind this algorithm is very intuitive: the refined results image, and video; audio is left out simply because we do not have
should be closely linked with the relevant examples and at the any audio processing algorithms at hand. The primitive features
same time far away from the irrelevant ones. Again, the refined and similarity functions utilized for these media are shown in
results are ranked according to the relevance vector returned as Table 2. To guarantee high efficiency, the maximum path length
the outcome of this algorithm. permitted for candidate spanning (see Algorithm 3) is set to 2, and
the total number of candidates is restricted to 100.
Refinement (S, F+, F–)
S: the seed set Table 2: Primitive features and similarity metric used in the
prototype system
F+: the set of relevant examples
F–: the set of irrelevant examples Primitive features Similarity metric
return: the overall relevance vector for the refined results keywords
Text cosine distance
(TF*IDF weighting)
S+:= (S ∪ F+) –F– 256-d HSV color histogram,
S -:= F– Image 64-d LAB color coherence, Euclidean distance
C+:= Spanning (S+, l+, t+) 32-d Tamura directionality.
key-frame similarity as
C -:= Spanning (S -, l -, t -) shot boundary detection,
shot similarity, average
r+:= Distillation (C+) Video using first frame of each shot
pair-wise shot similarity
r-:= Distillation (C -) as key-frame
as video similarity
For each object Oi in C+
If O is in C -, then
i We do not conduct any quantitative evaluation on the retrieval
r(Oi) := r+(Oi) - r-(Oi) performance mainly due to the lack of benchmark for such multi-
Else modality search. There does not exist, for example, a criterion to
r(Oi) := r+(Oi) evaluate the quality of some text and images retrieved by a video
End If clip as the seed. Moreover, there are too many human factors
Return r involved in this cooperative mechanism, such as the selection of
seeds and evaluation of results, which further complicate the task
Algorithm 5. Results refinement based on user evaluations of performance evaluation. In fact, providing performance

60
Legend

user link seed


object
structure link
relevant
content link example

A
video segment of "Cast Away"
B

movie star Tom Hanks

C
F D
I

movie "Cast Away"

movie "You've Got Mail" H

movie star Meg Ryan

(a) movie site

A B C D E F G H I

"Tom
A B C D E F G H I A B C D E F G H I
Hanks"

(b) query by keyword "Tom Hanks" (c) query by Meg Ryan 's photo (d) query by video segment of "Cast Away"

Figure 7: Illustrative examples

improvement over CBR approaches is not the main objective of cast. There is also a video clip of the movie “Cast Away” (object
Octopus; instead, its emphasis is on a novel scenario for multi- E) available in a separate page (it is not shown explicitly, but is
modality retrieval, which is not possible with previous pointed by a hyperlink on the page). All the hyperlinks are shown
approaches. Some characteristic queries and results are shown in Figure 7(a), based on which we can construct all the structure
below to demonstrate the variety and flexibility of the retrieval in links using the strategy introduced in Section 2.2.
this new scenario.
Figures 7(b)-(d) illustrate how Octopus works for three different
Figure 7(a) shows some pages of a website about movies, whose types of queries. In the first case, the user input Tom Hanks’ name
content is rich in multimedia objects. There are two major types of as the query, intending to find some materials about him. Since
pages in this site: page of movie stars such as Tom Hanks and the query is an isolated text object that does not previously exist
Meg Ryan, as well as page for movies like “You’ve Got Mail” in the LGM, it has neither user links nor structure links.
and “Cast Away”. The star’s page contains his/her photo and Therefore, in the candidate spanning process, we firstly rely on
biography (text), while the movie’s page has an introduction to the content links to find three text objects (introductions to “Cast
the movie, along with a picture showing one of the movie scenes. Away” and “You’ve Got Mail”, and his biography) in which Tom
The page of each star points to the pages of the movies in which Hanks’ name recurs several times. All the other objects are
he/she had played a role, e.g., the pages of Tom Hanks and Meg reached from these three text objects through structure links. So,
Ryan both point to the movie “You’ve Got Mail”. Meanwhile, the although this query starts with a traditional “search-by-keyword”
page for a movie points to the pages of the stars who are in the mode, it results in a rich collection of multimedia objects,

61
including his photo, the introductions to his movie, the movie extremely high degree of multi-modality integration, since it
scene and video clip, and even his partner Meg Ryan’s materials. allows the interaction among objects of any modality in any
possible ways (via different types of links).
In the second query (see Figure 7 (c)), the user chooses Meg
Ryan’s photo as the seed object. Following the structure links More recently, the MediaNet [1] and multimedia thesaurus
from it, we reach her biography, the materials about her movie (MMT) [22] are proposed, both of which seek to provide a
“You’ve Got Mail”, through which the Tom Hanks’ page is also multimedia representation of semantic concept—a concept
retrieved. Note that this search is opposite to what CBR systems described by various media objects including text, image, video,
usually do, i.e., using images to search text rather than searching etc—and establish the relationships among these concepts.
images by text. We suppose that in feedback, the user labels Tom MediaNet extends the notion of relationships to include even
Hanks’ biography as a relevant example, so that a user link is perceptual relationships among media objects. Both approaches
created between it and the Meg Ryan’s photo. can be regarded as “concept-centric” approaches since they realize
an organization of multi-modality objects around semantic
The last query (see Figure 7(d)) is even more ambitious. Starting
concepts. From this view, our mechanism is “concept-less” since
with a video clip, the user wants to find some related materials
we make no attempt to identify explicitly the semantics of each
about the movie “Cast Away”. As the results of traversing along
object.
structure links, the content on the page of “Cast Away” and Tom
Hanks are returned. In addition, the user link created in the
previous session leads us to Meg Ryan’s photo via Tom Hanks’
5.2 Link Analysis
There have been many successful previous works on link analysis,
biography. (It makes sense since Meg Ryan and Tom Hanks had
among which the most notable ones are the PageRank model and
cooperated in many famous movies.)
the notion of hubs & authorities. PageRank [3] is based on the
random-walk model and is used to compute the probability that a
5. RELATED WORK Web surfer visits a certain page. The effectiveness of this model
In this section, we discuss the connection of our model with the
has been proved by its successful application in search engine
previous works on multimedia retrieval and link analysis, and
Google [1]. In contrast, Kleinberg [13] suggested that each page
demonstrate in some cases, how our model can be reduced or
has two scores: authority score, which describes how authoritative
transformed to other approaches.
a page is to a certain topic, and hub score, which reflects how
many authoritative pages it points to.
5.1 Multimedia Retrieval
Previous works addressing multimedia retrieval can be classified The link analysis technique has been successfully applied to a
into two groups: approaches on single-modality as well as on broad range of applications. The approaches of Bharat et al. [2],
multi-modality integration. PageRank model [3], HITS [13] are used to search for most
authoritative pages to a certain topic. The approach proposed by
• Single-modality retrieval. The retrieval approach in this Rafiei et al. [19] identifies the topics of a designated page. Dean
group only deals with a single type of media, so that most content- et al. [6] discusses how to find related pages to a certain page.
based retrieval approaches (e.g., [4],[7],[12],[20],[21]) fall into There is also a group of works (e.g., Kumar et al. [14], Gibson et
this group. Among them, the QBIC system [7], MARS project al. [8], Pirolli et al. [18]) that aim at inferring and analyzing web
[12], VisualSEEK system [20] focus on image retrieval, VideoQ communities or other web structures from the hyperlinks.
system [4] is for video retrieval, and WebSEEK [21] system is a Henzinger et al. [10] suggested measuring link quality of a web
Web-oriented search engine that can retrieve both images and page using the random-walking model. Very recently, Lempel et
video clips. These approaches differ from each other in either the al. [15] proposes PicASHOW system, which employs link
low-level features extracted from the data, as well as the distance analysis to web-based image retrieval.
functions used for similarity calculation. Despite the differences,
all of them are similar in two fundamental aspects: (1) they all rely Since the link analysis approach in Octopus is geared towards the
on low-level features; (2) they all use the query-by-example goal of multimedia retrieval, it differs from conventional link
paradigm. Since the content layer of our LGM is built based on analysis approaches in the following aspects:
the similarity among objects on low-level features, our approach
• Application: To our knowledge, Octopus is the first
can be reduced to other CBR approaches if we consider only the
application of link analysis in the search of multi-modality data.
content layer during the retrieval process, and rank the candidates
(PicASHOW only deals with images.)
according to the weight of their content links to the seed.
• Link types: Our multifaceted knowledge base accommodates
• Multi-modality integration. In the past few years, some
three types of links, while most previous approaches focus on
works have investigated the integration of multi-modality data,
only hyperlinks, which is actually a special form of our structure
usually between text and image, for better retrieval performance.
link. This implies that our approach can be reduced to other
For example, the iFind [17] system proposes a unified framework
approaches if only the structure layer is addressed in the retrieval
under which the semantic feature (text) and low-level features are
process. Some link analysis approaches (e.g., [2],[19]) also take
combined for image retrieval, and the 2M2Net [23] system
into account the content (text) similarity. However, they usually
extends this framework to the retrieval of video and audio.
combine the content similarity with the analysis of hyperlinks,
WebSEEK system [21] extracts keywords from the surrounding
rather than building another separate layer for it as is the case in
text of image and videos, which is used as their indexes in the
the LGM.
retrieval process. Although these systems involve more than one
media, different medias are not actually integrated but are on • Link analysis algorithm: In terms of algorithm, our link
different levels. Usually, text is only used as the annotation analysis algorithm is much closer to the PageRank model, since
(index) of other medias. In this regard, our mechanism enables an for each object we calculate only one score. However, we do not

62
use the random-walk model, since our LGM is fundamentally challenging task, which requires a similarity function (between
different from the world of hyperlinks in which the random-walk media objects) as well as a clustering method. Our LGM provides
model makes sense. We do not adopt the hubs and authorities knowledgeable links, based on which different similarity
model because it is based on the observation that in the Web the functions can be easily formulated. Meanwhile, many clustering
relevance may propagate from one page to another via a totally methods have been proposed, such as the simulated and
irrelevant page through hyperlinks, which does not agree with our deterministic annealing algorithm [11]. Moreover, our model
basic premise that relevance spreads between directly linked inherently allows the clustering of multi-modality objects, rather
objects. than single-modality objects that most existing classification
approaches deal with.
• Link update: Most previous works on link analysis suggest
static approaches in that they only analyze the link structure. In • Personalized retrieval. The user layer of the LGM
contrast, our mechanism is incremental as it permits user links to characterizes the knowledge obtained from the behaviors of the
be enriched and updated by learning from user behaviors. whole population of users, and allows a query from a single user
Undoubtedly, our approach is more preferable since it allows self- to benefit from such common knowledge. However, users also
improvement of the retrieval performance. have personal interests and preferences that vary from one user to
another. To provide personalized retrieval service, a mechanism
6. CONCLUSION AND FUTURE WORK need to be developed to model the user preferences and adapt the
In this paper, we have described the Octopus mechanism for retrieval results towards such preferences. The 2-leveled “user
aggressive search of multi-modality data based on a multifaceted profiling” mechanism proposed by us in [16] provides a viable
knowledge base. Specifically, this mechanism applies link solution in this regard.
analysis techniques to search for multi-modality objects, the
relevance between which is described by a layered graph model 7. ACKNOWLEDGMENTS
(LGM) as the core of the knowledge base. A unique relevance The work described in this paper was supported, substantially, by
feedback technique is developed that can enhance the retrieval a grant from CityU (Project No. 7100196), partially by a grant
performance progressively by learning from user behaviors. The from the Research Grants Council of the Hong Kong Special
highlights of our mechanism are summarized as follows: Administrative Region, China [Project No. CityU 1119/99E], and
• At the interface level, Octopus provides users with great partially by a grant from the Doctorate Research Foundation of
convenience and flexibility. For example, the seed objects can be the State Education Commission of China.
of any modality and are not necessarily representative samples.
The retrieval results are also of multiple modalities, which can 8. REFERENCES
meet the variety of user requirements. [1] Benitez, A. B., Smith, J. R. and Chang, S. F. “MediaNet: A
Multimedia Information Network for Knowledge
• The LGM investigates a broad coverage of knowledge to Representation”. In Proc. of the SPIE 2000 Conference on
evaluate the similarity between media objects. Therefore, the Internet Multimedia Management Systems, vol.4210, 2000.
results retrieved based on it are more relevant (to the query) than
those retrieved by the CBR systems, which rely on low-level [2] Bharat, K. and Henzinger, M. R., “Improved Algorithm for
features only. Topic Distilling in Hyperlinked Environments”. In Proc. of
the 21st Int. ACM SIGIR Conf. on Research and
• The knowledge base is enriched by learning from user Development in Information Retrieval, pp. 104-111, 1998.
behaviors, such that the retrieval performance can be enhanced in
a hill-climbing manner. [3] Brin, S. and Page, L., “The Anatomy of a Large-Scale
Hypertextual Web Search Engine.” In Proc. of the 7th Int.
• The LGM provides a solid and generic foundation for World Wide Web Conf, pp. 107-117, 1998.
multimedia retrieval, which can be extended towards a number of
directions. For example, a new type of media can be easily [4] Chang, S. F., Chen, W., Meng, H. J., Sundaram, H. and
integrated into the model as long as its primitive features are Zhong, D., “VideoQ: An Automated Content Based Video
specified. Moreover, a new class of knowledge (on the relevance Search System Using Visual Cues”. In Proc. of ACM
between media objects) that is orthogonal with the existing Multimedia, pp. 313-324, 1997.
knowledge can be introduced into the LGM as a new layer with
[5] Chen, W. and Chang, S. F. “VISMAP: An Interactive
only minor adjustment of the link analysis algorithms.
Image/Video Retrieval System Using Visualization and
Due to the generality and extensibility of the LGM, many Concept Maps”, In Proc. of Int. Conf. on Image Processing
potential applications can be implemented based on it. We (ICIP), Greece, October 2001.
identify some of them as our future works:
[6] Dean, J. and Henzinger, M. R., “Finding Related Pages on
• Navigation. The LGM provides abundant links through the Web.” In Proc. of the 8th Int. World Wide Web Conf. pp.
which the user can traverse from one object to its related objects. 389-401, 1999.
Therefore, it supports a natural navigation scenario: when a user is
visiting (viewing) a media object, the system recommends him [7] Flickner, M., Sawhney, H., Niblack, W. and Ashley, J.,
with the objects linked with it in the LGM, ranked according to “Query by image and video content: The QBIC system.”
the weights and types of links, from which he can select the next IEEE Computer, pp. 23-32, 1995.
object to navigate. [8] Gibson, D., Kleinberg, J. M., and Paghavan, P., “Inferring
• Clustering. Clustering multi-modality objects into Web Communities from Link Topology.” In Proc. of the 9th
semantically meaningful groups is also an important and Conf. on Hypertext and Hypermedia, pp.225-234, 1998.

63
[9] Google Search Engine. http://www.google.com. [17] Lu, Y., Hu, C. H., Zhu, X. Q., Zhang, H. J. and Yang, Q. ”A
Unified Framework for Semantics and Feature Based
[10] Henzinger, M. R., Heydon, A., Mitzenmacher, M. and Relevance Feedback in Image Retrieval Systems”. In Proc. of
Najork, M., “Measuring Index Quality using Random Walks ACM Multimedia, pp. 31- 38, 2000.
on the Web”. In Proc. of the 8th Int. World Wide Web Conf.
pp. 213-225, 1999. [18] Pirolli, P., Pitkow, J., and Rao, R., “Silk from a Sow’s Ear:
Extracting Usable Structure from the Web.” In Proc. ACM
[11] Hofmann, T. and Buhmann, J. M., “Pairwise Data Clustering SIGCHI Conf. on Human Factors in Computing Systems, pp.
by Deterministic Annealing”, in IEEE Trans. on Pattern 383-390, 1997.
Analysis and Machine Intelligence, 19(1): 1-14, 1997.
[19] Rafiei, D. and Mendelzon, A. O., “What is this Page Known
[12] Huang, T. S., Mehrotra, S., and Ramchandran, K., for? Computing Web Page Reputations.” In Proc. of Int.
“Multimedia analysis and retrieval system (MARS) project,” World Wide Web Conf. pp. 823-835, 2000.
In Proc of 33rd Annual Clinic on Library Application of
Data Processing-Digital Image Access and Retrieval, 1996. [20] Smith, J. R. and Chang, S. F., “VisualSEEk: a fully
automated content-based image query system,” in Proc. of
[13] Kleinberg, J. M., “Authoritative Sources in a Hyperlinked ACM Multimedia 96, pp. 87-98, 1996.
Environment.” In Proc. of ACM-SIAM Symposium on
Discrete Algorithms, pp. 668-677, 1998. [21] Smith, J. R. and Chang, S. F., “Visually Searching the Web
for Content.” IEEE Multimedia Magazine, 4(3): 12-20,
[14] Kumar, R., Raghavan, P., Pajagopalan, S., and Tomkins, A., 1997.
“Trawling the Web for Emerging Cyber-communities”. In
Proc. of the 8th Int. World Wide Web Conf. pp. 403-415, [22] Tansley, R., “The Multimedia Thesaurus: An Aid for
1999. Multimedia Information Retrieval and Navigation”, Master
Thesis, Computer Science, University of Southampton, UK,
[15] Lempel, R. and Soffer, A., “PicASHOW: Pictorial Authority 1998.
Search by Hyperlinks on the Web.” In Proc. 10th Int. World
Wide Web Conf., pp. 438-448, 2001. [23] Yang, J., Zhuang, Y. T., Li, Q., “Search for Multi-Modality
Data in Digital Libraries”, in Proc. of 2nd IEEE Pacific-Rim
[16] Li, Q., Yang, J., and Zhuang, Y. T., “Web-based Multimedia Conference on Multimedia, pp. 482-489, China, 2001.
Retrieval: Balancing out between Common Knowledge and
Personalized Views”. In Proc. of 2nd Int. Conf. on Web
Information System and Engineering, pp. 100-109, 2001.

64

You might also like