0% found this document useful (0 votes)

264 views14 pages

Image Classification For Content-Based Indexing

Using binary Bayesian classifiers, we attempt to capture high-level concepts from low-level image features. Our system achieved a classification accuracy of 90.5% for indoor / outdoor, 95.3% for city / landscape, 96.6% for sunset / forest and mountain. With the development of digital photography, more and more people are able to store vacation and personal photographs on their computers.

Uploaded by

nobeen666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

264 views14 pages

Image Classification For Content-Based Indexing

Uploaded by

nobeen666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO.

1, JANUARY 2001 117

Image Classification for Content-Based Indexing

Aditya Vailaya, Associate Member, IEEE, Mário A. T. Figueiredo, Member, IEEE, Anil K. Jain, Fellow, IEEE, and
Hong-Jiang Zhang, Senior Member, IEEE

Abstract—Grouping images into (semantically) meaningful these databases more useful, we need to develop schemes for
categories using low-level visual features is a challenging and indexing and categorizing the humungous data.
important problem in content-based image retrieval. Using binary Several content-based image retrieval systems have been
Bayesian classifiers, we attempt to capture high-level concepts
from low-level image features under the constraint that the test recently proposed: QBIC [5], Photobook [26], SWIM [44],
image does belong to one of the classes. Specifically, we consider Virage [10], Visualseek [36], Netra [17], and MARS [20].
the hierarchical classification of vacation images; at the highest These systems follow the paradigm of representing images
level, images are classified as indoor or outdoor; outdoor images using a set of attributes, such as color, texture, shape, and
are further classified as city or landscape; finally, a subset of land- layout, which are archived along with the images. Retrieval
scape images is classified into sunset, forest, and mountain classes.
We demonstrate that a small vector quantizer (whose optimal is performed by matching the features of a query image with
size is selected using a modified MDL criterion) can be used to those in the database. Users typically do not think in terms
model the class-conditional densities of the features, required by of low-level features, i.e., user queries are typically semantic
the Bayesian methodology. The classifiers have been designed and (e.g., “show me a sunset image”) and not low-level (e.g., “show
evaluated on a database of 6931 vacation photographs. Our system me a predominantly red and orange image”). As a result, most
achieved a classification accuracy of 90.5% for indoor/outdoor,
95.3% for city/landscape, 96.6% for sunset/forest & mountain, of these image retrieval systems have poor performance for
and 96% for forest/mountain classification problems. We further (semantically) specific queries. For example, Fig. 1(b) shows
develop a learning method to incrementally train the classifiers the top-ten retrieved images (based on color histogram features)
as additional data become available. We also show preliminary from a database of 2145 images of city and landscape scenes,
results for feature reduction using clustering techniques. Our for the query in Fig. 1(a). While the query image has a monu-
goal is to combine multiple two-class classifiers into a single
hierarchical classifier. ment, some of the retrieved images have mountain and coast
scenes. Recent research in human perception of image content
Index Terms—Bayesian methods, content-based retrieval, digital [21], [24], [27], [31] suggests the importance of semantic cues
libraries, image content analysis, minimum description length, se-
mantic indexing, vector quantization. for efficient retrieval. One method to decode human perception
is through the use of relevance feedback mechanisms [33]. A
second method relies on grouping the images into semantically
I. INTRODUCTION meaningful classes [42]. Fig. 1(c) shows the top-ten results
(again based on color histograms) on a database of 760 city
C ONTENT-BASED image retrieval has emerged as an
important area in computer vision and multimedia
computing. Many organizations have large image and video
images for the same query; clearly, filtering out landscape
images improves the retrieval result.
collections (programs, news segments, games, art) in digital As shown in Fig. 1(a)–(c), a successful indexing/categoriza-
format, available for on-line access. Organizing these libraries tion of images greatly enhances the performance of content-
into categories and providing effective indexing is imperative based retrieval systems by filtering out irrelevant classes. This
for “real-time” browsing and retrieval. With the development rather difficult problem has not been adequately addressed in
of digital photography, more and more people are able to store current image database systems. The main problem is that only
vacation and personal photographs on their computers. As low-level features (as opposed to higher level features such as
an example, travel agencies are interested in digital archives objects and their inter-relationships) can be reliably extracted
of photographs of holiday resorts; a user could query these from images. For example, color histograms are easily extracted
databases to plan a vacation. However, in order to make from color images, but the presence of sky, trees, buildings,
people, etc., cannot be reliably detected. The main challenge,
thereby, lies in grouping images into semantically meaningful
Manuscript received February 8, 1999; revised August 11, 2000. The asso- categories based on low-level visual features. One attempt to
ciate editor coordinating the review of this manuscript and approving it for pub- solve this problem is the hierarchical indexing scheme proposed
lication was Prof. Tsuhan Chen.
A. Vailaya is with Agilent Technologies, Palo Alto, CA 94303-0867 USA in [45], [46], which performs clustering based on color and tex-
(e-mail: [email protected]). ture, using a self-organizing map. This indexing scheme was
M. Figueiredo is with the Instituto de Telecomunicações and Instituto Supe- further applied in [16] to create a texture thesaurus for indexing
rior Técnico, 1049-001 Lisboa, Portugal (e-mail: [email protected]).
A. K. Jain is with the Department of Computer Science and Engi- a database of aerial photographs. However, the success of such
neering, Michigan State University, East Lansing, MI 48824 USA (e-mail: clustering-based schemes is often limited, largely due to the
[email protected]). low-level feature-based representation of image content. For ex-
H.-J. Zhang is with Microsoft Research China, Beijing 100 080, China
(e-mail: [email protected]). ample, Fig. 2(a)–(d) shows two images and their corresponding
Publisher Item Identifier S 1057-7149(01)00098-7. edge direction coherence feature vectors (see [42]). Although,
1057–7149/01$10.00 © 2001 IEEE
118 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO. 1, JANUARY 2001

Fig. 1. Color-based retrieval. (a) Query image, (b) top-ten retrieved images
from 2145 city and landscape images, and (c) top-ten retrieved images from 760
city images; filtering out landscape images prior to querying clearly improves
the retrieval results.

Fig. 3. (a) Hierarchy of the 11 categories obtained from human provided

grouping [42] and (b) simplified semantic classification of images; solid lines
show the classification problems addressed in this paper.

features can be used in constrained environments to discrimi-

nate between certain conceptual image classes. To achieve au-
tomatic categorization/indexing in a large database, we need to
develop robust schemes to identify salient image features cap-
turing a certain aspect of the semantic content. This necessitates
an initial specification of meaningful classes, so that the data-
base images can be organized in a supervised fashion.
In this paper, we address the problem of image classification
from low-level features. Specifically, we classify vacation
photographs into a hierarchy of high-level classes. Photographs
are first classified as indoor or outdoor. Outdoor images are
then classified as city or landscape. A subset of landscape
Fig. 2. Edge direction coherence vector features for (a) fingerprint and (c) images is further classified into sunset, forest, and mountain
landscape image.
classes. The above hierarchy was identified based on experi-
ments with human subjects on a small database of 171 images
these are semantically very different concepts, their edge direc- [42] (as briefly described in Section II). These classification
tion histograms are highly similar, illustrating the limitations problems are addressed using Bayesian theory. The required
of this low-level feature in capturing semantic content. Yet, we class-conditional probability density functions are estimated,
shall show that these same features are sufficiently discrimina- during a training phase, using vector quantization (VQ) [9].
tive for city/landscape classification. That is, specific low-level An MDL-type principle [30] is used to determine the optimal
VAILAYA et al.: IMAGE CLASSIFICATION FOR CONTENT-BASED INDEXING 119

codebook size from the training samples. Advantages of the are grouped into the class natural scenes. Natural scenes and
Bayesian approach include sunset images were further grouped into the landscape class.
1) small number of codebook vectors represent each class, City shots, monuments, and shots of Washington DC were
thus greatly reducing the number of comparisons neces- grouped into the city class. Finally, the miscellaneous, face,
sary for each classification; landscape, and city classes were grouped into the top-level
2) it naturally allows for the integration of multiple features class of vacation scenes. We conducted additional experiments
through the class-conditional densities; to verify that the above hierarchy is reasonable: we used a
3) in addition to a classification rule, we have degrees of con- multidimensional scaling algorithm to generate a three-dimen-
fidence which may be used to incorporate a reject option sional (3-D) feature space to embed the 171 images from the
into the classifiers. dissimilarity matrix used above (generated from user
groupings). We then applied a -means clustering algorithm
The paper is organized as follows. Section II briefly mentions
to partition the (3-D) data. Our goal was to verify if the main
psychophysical studies which are the basis of our work in iden-
clusters in this representation space agreed with the hierarchy
tifying the global scene represented in an image. We also de-
shown in Fig. 3(a). For , we obtained two clusters of
scribe our experiments with human subjects to identify concep-
62 and 109 images, respectively. The first cluster consisted of
tual classes in a database of vacation images. After reviewing the
predominantly city images, while the second cluster contained
Bayesian framework for image classification in Section III, Sec-
landscape images. The following clusters were obtained with
tion IV addresses VQ-based density estimation and the MDL
principle for selecting codebook sizes. Section V discusses im-
plementation issues. We report the classification accuracies in 1) city scenes (70 images);
Section VI. Sections VII and VIII discuss approaches for using 2) sunrise/sunset images (21 images);
incremental learning and automatic feature selection. Finally, 3) forest and farmland scenes and pathways (49 images);
Section IX concludes the paper and presents directions for fu- 4) mountain and coast scenes (31 images).
ture research. These groupings motivated us to study a hierarchical classifica-
tion of vacation images.
In order to make the problem more tractable, we simplified
II. HIGH-LEVEL CLASSES IDENTIFIED BY HUMANS
the classification hierarchy as shown in Fig. 3(b). The solid lines
Psychophysical and psychological studies have shown show the classification problems addressed in this paper. This
that scene identification by humans can proceed, in certain hierarchy is not complete, e.g., a user may be interested in im-
cases, without any kind of object identification [1], [2], [34]. ages captured in the evening or images containing faces. How-
Biederman [1], [2] suggested that an arrangement of volumetric ever, it is a reasonable approach to simplify the image retrieval
primitives (geons), each representing a prominent object in the problem.
scene, may allow rapid scene identification independently of Another limitation of the proposed hierarchy is that the
local object identification. Schyns and Oliva [34] demonstrated leaf nodes are not mutually exclusive. For example, an image
that scenes can be identified from low spatial-frequency images can belong to both the city and sunset categories. One way
that preserve the spatial relations between large-scale structures to address this issue is to develop individual classifiers such
in the scene, but which lack the visual detail to identify local as city/non-city or sunset/non-sunset, instead of a hierarchy.
objects. These results suggest the possibility of coarse scene However, this would drastically increase the complexity of the
identification from global low-level features before the identity classification task (now we will have to identify city scenes
of objects is established. Based on these observations, we from all possible scenes, rather than differentiate between city
address the problem of scene identification as the first step and landscape scenes).
toward building semantic indices into image databases. Most images can be classified as representing indoor or
The first step toward building a classifier is to identify mean- outdoor scenes. Exceptions include close-ups and pictures of
ingful image categories which can be automatically identified a window or door. Outdoor images can be further divided into
by simple and efficient pattern recognition techniques. For city or landscape [40], [42]. City scenes can be characterized
this purpose, we conducted a simple small-scale experiment by the presence of man-made objects and structures such as
in which eight human subjects classified 171 vacation images buildings, cars, roads. Natural scenes, on the other hand, lack
[42]. Our goal was to identify a hierarchy of classes into which these structures. A subset of landscape images can be further
the vacation images can be organized. Since these classes classified into one of the sunset, forest, and mountain classes.
match human perception, they allow organizing the database Sunset scenes are characterized by saturated colors (red,
for effective browsing and retrieval. orange, or yellow), forest scenes have predominantly green
Our experiments revealed a total of 11 semantic cate- color distribution, and mountain scenes can be characterized
gories: forests and farmlands, mountains, beach scenes, by long distance shots of mountains (either snow covered, or
pathways, sunset/sunrise images, long distance city shots, barren plateaus).
streets/buildings, monuments/towers, shots of Washington, We assume that the input images do belong to one of the
DC, miscellaneous images, and faces. We organized these classes under consideration. This restriction is imposed because
11 categories into the hierarchy shown in Fig. 3(a). The first automatically rejecting images that do not belong to any of the
four classes (forests, mountains, beach scenes, and pathways) classes, based on low-level image features alone, is in itself a
120 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO. 1, JANUARY 2001

very difficult problem (see Fig. 2). However, for images be- an encoder : , mapping the input alphabet to
longing to the classes of interest, the Bayesian methodology the channel symbol set , and a decoder : which
can be used to reject ambiguous images based on the confi- maps to the output alphabet (or codebook). A distortion
dence values associated with the images (images that belong to measure specifies the cost associated with quantiza-
both the classes of interest, such as an image of a city scene tion, where . An optimal quantizer minimizes the
at sunset). We briefly discuss incorporating the reject option in average distortion under a size constraint on [8]. The gen-
Section VI-F. eralized Lloyd algorithm (GLA) is an iterative algorithm for
obtaining a (locally) optimal VQ. Under a mean square error
III. BAYESIAN FRAMEWORK (MSE) distortion criterion, GLA is equivalent to the -means
( ) clustering algorithm [11]. Any given input vector
Bayesian methods have been successfully adopted in many
is quantized into the closest (in ) of the codebook
image analysis and computer vision problems. However, its use
vectors. This defines a partition of the space into the so-called
in content-based retrieval from image databases is just being
Voronoi cells [8]. A comprehensive
realized [43].
study of VQ can be found in [3], [8].
We now review the Bayesian framework for image classifi-
cation. The set of possible images is partitioned into classes
B. Vector Quantization for Density Estimation
; any image belongs to one and only one
class. The images from class are modeled as samples of a Vector quantization provides an efficient tool for density esti-
random variable, , whose class-conditional probability den- mation [9]. Consider training samples from a class . In order
sity function is . Each class has an a priori probability, to estimate the class-conditional density of the th feature vector,
, with . A loss , VQ is used to obtain (with , usually )
function, : , specifies the loss incurred codebook vectors, ( ), from the training data.1
when class is chosen and the true class is . As is common In the so-called high-resolution approximation (i.e., for small
in classification problems, we adopt the “0/1” loss function: Voronoi cells), this density can be approximated by a piece-
, and , if . wise-constant function over each cell , with value
In most image classification problems, the decision is based
on, say , feature sets, , rather
than directly on the raw pixel values. Of course, is a function for (3)
of the image . We will then have class-conditional densities for
the features, rather than for the raw images. It is often assumed
that the feature sets are class-conditionally independent, that is where and are the ratio of training samples
falling into cell and the volume of cell , respectively,
for (1) (see [9]). This approximation fails if the cells are not suffi-
ciently small, for example, when the dimensionality of
is large. In that case, the class-conditional densities can be
The classification problem can be stated as: “given the feature
approximated using a mixture of Gaussians [9], [43], each
sets , classify the image into one of the classes in .”
centered at a codebook vector. The MSE criterion is the sum
The decision rule resulting from the “0/1” loss function is the
of the Euclidean distances of each training sample from its
maximum a posteriori (MAP) criterion [4], [29],
closest codebook vector. From a mixture point of view, this is
(2) equivalent to assuming covariance matrices of the form
(where is the identity) [43], leading to
In addition to the MAP classification, we also have a degree of
confidence which is proportional to .
(4)
IV. DENSITY ESTIMATION BY VECTOR QUANTIZATION
The performance of a Bayes classifier depends critically on where , (note that
the ability of the features to discriminate among the various ). The value of is not estimated by the VQ
classes. Moreover, since the class-conditional densities have to algorithm, and so we empirically choose it for each feature.
be estimated from data, the accuracy of these estimates is also Alternatively, we could use the EM algorithm to directly find
critical. Choosing the right set of features is a difficult problem maximum likelihood (ML) estimates of the mixture parameters,
to which we return in Section V-A. In this section, we focus under a diagonal covariance constraint [19]. This choice is
on estimating the class-conditional densities, adopting a vector computationally demanding, and we have found that the value
quantization approach [9]. of is not crucial; it simply affects the number of codebook
vectors that influence classification. Unless is exceptionally
A. Introduction to Vector Quantization
1Actually, learning vector quantization (LVQ) is used to select the codebook
For compression and communication applications, a vector
vectors. LVQ does not run the GLA separately for each class; in this algorithm,
quantizer (VQ) is described as a combination of an encoder and the codebook vectors are also “pushed away” from incorrectly classified sam-
a decoder [8]. A -dimensional VQ consists of two mappings: ples (see [14], [29]).
VAILAYA et al.: IMAGE CLASSIFICATION FOR CONTENT-BASED INDEXING 121

large, only a few codebook vectors close to the input pattern A. Image Features
influence the class-conditional probabilities.
Outdoor images tend to have uniform spatial color distribu-
tions, such as the sky is on top and is typically blue. Indoor
C. Selecting Codebook Size images tend to have more varied color distributions and have
Selecting is a key issue in using a VQ, or a mixture, for den- more uniform lighting (most are close up shots). Thus, it seems
sity representation. We start by noting that GLA approximately logical that spatial color distribution can discriminate between
looks for the maximum likelihood (ML) estimates of the param- indoor and outdoor images. On the other hand, shape features
eters of the mixture in (4). In fact, the EM algorithm becomes may not be useful because objects with similar shapes can be
exactly equivalent to the GLA when the variance goes to zero present in both indoor and outdoor scenes. Therefore, we use
[29]. We will therefore apply an MDL criterion to select , since spatial color information features to represent these qualitative
MDL allows extending maximum likelihood (ML) estimation to attributes. Specifically, first- and second-order moments in the
situations where the dimension of the model is unknown [30]. color space were used as color features (it was pointed
Consider a training set of independent samples out in [7] that moments yield better results in image re-
, from the class . These are, of course, trieval than other spaces). The image was divided into
samples of one of the features, although here we omit this subblocks and six features (three means and three standard de-
from the notation to keep it simpler. A direct application of the viations) were extracted [37], [41]. As another set of features for
standard MDL criterion would lead to the following criterion indoor/outdoor classification, we extract subblock MSAR tex-
to select [the size of the mixture in (4)] ture features as described in [18], [39].
We looked for similar qualitative attributes for city/land-
scape classification, and further classification of landscape
images. City images usually have strong vertical and horizontal
edges due to the presence of man-made objects. Non-city
images tend to have randomly distributed edge directions. The
where is the ML estimate assuming size , and edge direction distribution seems then as a natural feature to
is the number of real-valued discriminate between these two categories [42]. On the other
parameters needed to specify a -component mixture (with hand, color features would not have sufficient discriminatory
denoting “dimension of”) [30]. Notice that the additional power as man-made objects have arbitrary colors. In the case
term proportional to grows with , thus counter- of further classification of landscape images as sunset, forest,
balancing the unbounded increase, with , of the likelihood. or mountain, global color distributions seem to adequately
The penalty paid by each additional real param- describe these classes. Sunset pictures typically have saturated
eter has an asymptotical justification (see [30]). For a mixture, colors (mostly yellow and red); mountain images tend to have
however, it can be argued that each center does not “see” data the sky in the background (typically blue); and forest scenes
points, but only (on average) (for the th center) (see tend to have more greenish distributions. Based on the above
[15] and [6], for details). This leads to the following modified observations, we use edge direction features (histograms and
MDL (MMDL) criterion coherence vectors) for city/landscape classification and color
features (histograms, coherence vectors, and spatial moments)
in and color space for further classification of
landscape images [25], [38], [42]. Table I summarizes the
qualitative attributes of the various classes and the features
used to represent them.
(5)
B. Vector Quantization
We used the LVQ_PAK package [14] for vector quantization.
Half of the database was used to train the LVQ for each of the
V. IMPLEMENTATION ISSUES
image features. The MMDL criterion (Section IV-C) was used
Experiments were conducted on two databases (both inde- to determine the codebook sizes. For the indoor and outdoor
pendently and combined) of 5081 (indoor/outdoor classifica- classes, with the spatial color moment features, Fig. 4(a)–(c)
tion) and 2716 (city/landscape classification and further classifi- plots the MMDL cost function [(5)] versus the codebook size
cation of landscape images) images. The two databases, hence- . These plots show that and are the MMDL
forth referred to as D1 and D2, have 866 images in common, choices for the indoor and outdoor classes, respectively. For the
leading to a total of 6931 distinct images, collected from var- combination of the two classes, minimizes the MMDL
ious sources (Corel library, scanned personal photographs, key criterion. To confirm this choice from a classification point of
frames from TV serials, and images downloaded from the Web) view, Fig. 5 plots the accuracy of the indoor/outdoor classifier
and are of varying sizes (from to ). The (on an independent test set of size 2540) as a function of the
color images are stored with 24-bits per pixel in JPEG format. total codebook size . As is initially increased, the classifier
The ground truth for all the images was assigned by a single accuracy improves. However, it soon stabilizes and further in-
subject. creasing beyond 30 does not improve the accuracy. This con-
122 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO. 1, JANUARY 2001

TABLE I
QUALITATIVE ATTRIBUTES OF THE SEVERAL CLASSIFACTION PROBLEMS AND ASSOCIATED LOW-LEVEL FEATURES

clusion (and similar ones for city/landscape classification) sup-

ports the use of MMDL for codebook size selection.
Based on similar analysis (see [40]), 20 codebook vectors
were extracted for each of the city and landscape classes. For
further classification of landscape images, a codebook of five
vectors was selected for each class. These vectors were then
stored as representatives of each class. Table II shows the
number and dimensionality of the codebook vectors for the
various classification problems. Fig. 4. Determining codebook size for spatial color moment features for the
indoor/outdoor classification problem. (a) Indoor class, (b) outdoor class, and
VI. EXPERIMENTAL RESULTS (c) indoor and outdoor classes combined.

Given an input image, the classifier computes the class-con-

ditional probabilities for each of the features using (4). These
probabilities are then used to obtain the MAP classification
[(2)]. We present classification accuracies on a set of indepen-
dent test patterns as well as on the training patterns. We have
done classifications based on individual features and also based
on combinations of features [assumed independent, (1)]. As
we show later, each of the individual features chosen for the
Fig. 5. Accuracy of the indoor/outdoor classifier with increasing codebook
classification problems has sufficient discrimination power for size (trained on 2541 images and tested on an independent test set of 2540
that particular classification problem, and introducing other images).
features does not significantly improve the results.
TABLE II
DIMENSIONALITIES AND CODEBOOK SIZES FOR EACH CLASSIFIER
A. Indoor/Outdoor Classification
Database D1 (2470 indoor and 2611 outdoor images) was
used to train the indoor/outdoor classifier. Apart from the color
moment features, we also considered the subblock MSAR tex-
ture features [39], edge direction features, and color histograms.
MSAR features yielded an accuracy of around 75% on the test
set. A higher classification accuracy (using a -NN classifier
and leave-one-out testing) of 84% on a database of 1324 im-
ages was reported in [39]. We attribute this discrepancy to dif-
ferences in the database (our database of 5081 images is larger)
and mode of testing (we report results on an independent test an independent test set (Test Set 1 in Table III), respectively. On
set). Edge direction and coherence vector features yielded an a different test set (Test Set 2 in Table III) of 1850 images from
accuracy of around 60%, while the color moment features lead database D2, the classifier accuracy was 88.7%. An accuracy
to a much higher accuracy of around 90%. These results show of 90.5% was obtained on the entire database of 6931 images.
that the spatial color distribution (probably capturing illumina- Szummer et al. [39] use a -NN classifier and report 90% ac-
tion changes) is suited for indoor/outdoor classification. A com- curacy using leave-one-out testing, for the indoor/outdoor clas-
bination of color and texture features did not yield a better ac- sification on a database of 1324 images. Thus, our classifier’s
curacy than color moment features alone. performance is comparable to those reported in the literature. A
Table III shows the classification results with the color mo- major advantage of the Bayesian classifier over -NN classi-
ment features for indoor/outdoor classification. The classifier fier is its efficiency due to the small number of codebook vec-
showed an accuracy of 94.2% and 88.2% on the training set and tors needed to represent the training data.
VAILAYA et al.: IMAGE CLASSIFICATION FOR CONTENT-BASED INDEXING 123

TABLE III
ACCURACIES (IN PERCENT) FOR INDOOR/OUTDOOR CLASSIFICATION USING
COLOR MOMENTS; TEST SET 1 AND TEST SET 2 ARE INDEPENDENT TEST SETS

Fig. 6 shows a representative subset of the misclassified in-

door/outdoor scenes. Presence of bright spots either from some
light source or from sunshine through windows and doors seems
to be a main cause of misclassification of indoor images. The
main reasons for the misclassification of outdoor images are 1)
uniform lighting on the image mostly as a result of a close-up Fig. 6. Some misclassified (a) indoor and (b) outdoor images using color
shot and 2) low-contrast images (several of the indoor images moment features; the corresponding confidence values (in percent) associated
used in the training set were low contrast digital images and with the true class are presented.
hence most low contrast outdoor images were classified as in-
door scenes). The results show that spatial color distribution in the above experiment to 56, confirming that edge directions
captured in the subblock color moment features has sufficient are sufficient to discriminate between city and landscape.
discrimination power for indoor/outdoor classification.
C. Further Classification of Landscape Images
B. City/Landscape Classification
While our limited experiments on human subjects [42] re-
The city versus landscape classification problem and further vealed classes such as sunset and sunrise, forest and farmland,
classification of landscape images as sunset, forest, or moun- mountain, pathway, water scene, etc., these groups were not
tain using the Bayesian framework has been addressed in de- consistent among the subjects in terms of the actual labeling
tail in [40]. We summarize the results here. Table IV shows the of the images. We found it extremely difficult to generate a
results for the city/landscape classification problem using data- semantic partitioning of landscape images. We thus restricted
base D2. Edge direction coherence vector provides the best indi- classification of landscape images into three classes that could
vidual accuracy of 97.0% for the training set and 92.9% for the be more unambiguously distinguished: sunset, forest, and
test set. A total of 126 images were misclassified (95.3% accu- mountain. Of these 528 images, a human subject labeled 177,
racy) when the edge direction coherence vector was combined 196, and 155 images as belonging to the forest, mountain,
with the color histogram. Fig. 7 shows a representative subset and sunset classes, respectively. A two-stage classifier was
of misclassified images. Most of the misclassifications for city constructed. First, we classify an image into either sunset or
images could be attributed to the following reasons: the forest and mountain class. The above hierarchy was based
1) long distance city shots at night (difficulty in extracting on the human study, as shown in Fig. 3(a), where the sunset
edges); cluster seemed to be more compact and well separated from
2) top view of city scenes (lack of vertical edges); the other categories in the landscape class.
3) highly textured buildings; Table V shows the results for the classification of landscape
4) trees obstructing the buildings. images into sunset vs. forest and mountain classes. The color
Most of the misclassified landscape images had strong vertical coherence vector provides the best accuracy of 99.2% for the
edges from tree trunks, close-ups of stems, fences, etc., that led training set and 93.9% for the test set. Color features do much
to their assignment to the city class. better than the edge direction features here, since color distribu-
We also computed the classification accuracy using the edge tions remain more or less constant for natural images (blue sky,
direction coherence vector on an independent test set of 568 green grass, trees, plants, etc). A total of 18 images were mis-
outdoor images from database D1. A total of 1177 images of classified (a classification accuracy of 96.6%) when the color
the 4181 outdoor images in database D1 contained close ups of coherence vector feature was used. We find that combining fea-
human faces. We removed these images for the city/landscape tures does not improve the classification accuracy. This shows
test. Recent advances show that faces can be detected rather that color coherence vector has sufficient discrimination ability
reliably [32]. Of the remaining test images, we extracted 568 for the problem at hand.
that were not part of database D2. The edge direction features Table VI shows the classification results for the individual
yielded an accuracy of 90.0% (57 misclassifications out of the features for forest and mountain classes (373 images). Spatial
568 images). Combining color histogram features with edge di- color moment features provide the best accuracy of 98.4% for
rection coherence vector features reduced the misclassification the training set and 93.6% on the test set. A total of 15 images
124 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO. 1, JANUARY 2001

TABLE IV
CLASSIFICATION ACCURACIES (IN PERCENT) FOR CITY/LANDSCAPE CLASSIFICATION; THE FEATURES ARE ABBREVIATED AS FOLLOWS: EDGE DIRECTION
HISTOGRAM (EDH), EDGE DIRECTION COHERENCE VECTOR (EDCV), COLOR HISTOGRAM (CH), AND COLOR COHERENCE VECTOR (CCV)

indoor images that were misclassified as landscapes. If a face

detector is not available and we submit all the 269 images to
the city/landscape classifier, it classifies 199 images as city
images (most indoor images have man-made structures with
strong vertical and horizontal edges) and 70 as landscape.
Since we have not yet developed a classifier to identify sunset,
forest, and mountain images from other landscape images,
in the worst case, all 70 of these images will be fed to the
sunset/forest/mountain classifier and hence, degrade the overall
classification accuracy. Fig. 8(a) and (b) was classified as
sunrise/sunset images and Fig. 8(c) was classified as a forest
image.

E. Feature Saliency
The accuracy of the individual classifiers depends on the un-
derlying low-level representation of the images. For example,
the edge direction and coherence vector features yield accura-
Fig. 7. Subset of the misclassified (a) city images and (b) landscape images cies of about 60% for the indoor/outdoor problem, yet they yield
using a combination of edge direction coherence vector and color histogram approximately 95% accuracy for the city/landscape problem.
features. The corresponding confidence values (in percent) associated with the This shows the importance of feature definition and selection.
true class are indicated.
We have empirically determined that
1) spatial color moment features are better for indoor/out-
were misclassified (a classification accuracy of 96%) when the door classification;
spatial color moment features were used. Again, the combina- 2) edge direction histograms and coherence vector features
tions of features did not perform better than the color features, have sufficient discrimination power for city/landscape
showing that these features are adequate for this problem. Note classification;
that the spatial color moment features and the color coherence 3) color moments, histograms, and coherence vectors are
vector features yield similar accuracies for the classification of more suited for the classification of landscape images.
landscape images. However, the database of 528 images is very
small to identify the best color feature for the classification of
landscape images. Using color coherence vector features in- F. Reject Option
creases the complexity of the classifiers. Introducing a reject option is useful, yet a difficult problem
in image classification. For Bayesian classifiers, the simplest
D. Error Propagation in Hierarchical Classification strategy is to reject images whose maximum a posteriori prob-
ability is below a threshold . Table VII shows the accuracies
The goal of hierarchical classification is to break a complex for the indoor/outdoor and city/landscape image classifiers with
problem into simpler problems. However, since each classifier reject option, for . The indoor/outdoor classifier used
is not perfect, the errors from a classifier located higher up in spatial color moment features and was trained on 2541 images
the tree are propagated to the lower levels. from database and tested on the entire set (6931 images).
The indoor/outdoor image classifier yielded an accuracy of The classification accuracy improved from 90.5% (no rejec-
90.5% on the entire database of 6931 images (658 images were tion) to 92.1% at 5.4% reject rate. The city/landscape classifier
misclassified). Of these, 269 images were indoor images out used edge direction coherence vector features; it was trained on
of which 229 were close-ups of people and pets. Out of the 1358 images from database and tested on the complete data-
remaining 40 images, three were classified as landscape images base (2716 images). The classification accuracy improved
and 37 were classified as city images. Fig. 8 shows these three from 95.0% (no rejection) to 95.7% at 2.1% reject rate. There
VAILAYA et al.: IMAGE CLASSIFICATION FOR CONTENT-BASED INDEXING 125

TABLE V
CLASSIFICATION ACCURACIES (IN PERCENT) FOR SUNSET/FOREST/MOUNTAIN CLASSIFICATION; SPM STANDS FOR “SPATIAL COLOR MOMENTS”

TABLE VI
CLASSIFICATION ACCURACIES (IN PERCENT) FOR FOREST/MOUNTAIN CLASSIFICATION

TABLE VII
CLASSIFIER PERFORMANCE UNDER A REJECT OPTION

Fig. 8. Indoor images misclassified as landscape.

is a clear accuracy/reject tradeoff; too much rejection may be TABLE VIII

CLASSIFICATION ACCURACIES AS A FUNCTION OF TRAINING SET SIZE ON THE
needed to further reduce the error rate. INDOOR/OUTDOOR CLASSIFIER

VII. INCREMENTAL LEARNING

It is well-known that the classification performance depends
on the training set size: the more comprehensive a training set,
the better the classification performance. Table VIII compares
the classification accuracies of the indoor/outdoor image clas-
sifier (based on spatial color moment features) as the training
set size is increased. As expected, increasing the training set
size improves the classification accuracy. When we trained the
LVQ with all the available 5081 images using the color moment
features, a classification accuracy of 95.7% (resubstitution ac-
curacy) was obtained. This shows that the classifier still has the
capacity to learn, provided additional training samples are avail-
able. The above observations illustrate the need for an incre- incrementally train the classifier on the new samples. For the
mental learning method for Bayesian classifiers. Bayesian classifier proposed above, the initial training set is
Collecting a large and representative training set is expensive, represented in terms of the codebook vectors ( ). Learning
time consuming, and sometimes not feasible. Therefore, it is not involves incrementally updating these codebook vectors as new
realistic to assume that a comprehensive training set is initially training data become available.
available. Rather, it is desirable to incorporate learning tech- One simple method to retrain the classifier is to train it with
niques in a classifier [22], [29]. As additional data become avail- the new data, i.e., start with the previously learnt codebook vec-
able, the classifier should be able to adapt, while retaining what tors and run the LVQ with the new data. This straightforward
it has already learnt. Since the training set can become extremely method, however, does not assign an appropriate weight to the
large, it may not be feasible to store all the previous data. There- previously learnt codebook. In other words, if a classifier was
fore, instead of retraining the classifier on the entire training set trained on a large number of samples and then a small number
every time new samples are collected, it is more desirable to of new samples are used to further train the classifier using the
126 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO. 1, JANUARY 2001

above learning paradigm, the new data will unduly influence the TABLE IX
current value of the codebook vectors. Learning with this small NAIVE APPROACH TO INCREMENTALLY TRAINING A CLASSIFIER. ACCURACIES
ARE REPORTED ON AN INDEPENDENT TEST SET OF SIZE 2540
amount of new data will in fact lead to unlearning of the distri-
bution based on previous samples. Table IX demonstrates the re-
sults of training the indoor/outdoor classifier using only the new
data. The indoor/outdoor classifier was initially trained on 1418
images and yielded an accuracy of 79.8% on an independent test
set of 2540 images. When the classifier is further trained with
350 new images, the performance on the independent test set de-
teriorates to 63.7%. When the classifier is further trained on an
additional 773 samples using the naive approach, the accuracy
on the test set slightly recovers to 72.5%. Note that when all the
available data were used ( images),
the accuracy on the independent test set was 88.2% (Table VIII).
These results show that any robust incremental learning scheme
must assign an appropriate weight to the already learnt distribu-
tion.

A. Proposed Incremental Learning Scheme

The idea behind the proposed scheme is to try to generate the

original samples from the codebook vectors and then augment
these estimated samples to the new training set. The combined
training set is then used (starting at the current codebook vec-
tors) to determine the new set of codebook vectors. This method Fig. 9. Incremental learning with synthetic Gaussian data; (3) represents the
true means; ( ) represents the initial codebook vectors learnt from 100 samples
differs from traditional bootstrapping [11] which assumes that per class; (}) represents the codebook vectors after an additional 400 samples
the original training samples are available for sampling with re- per class; and () represents the codebook vectors after 500 more samples per
placement. In our case, the new samples representing the orig- class.
inal training set are generated based on the number of training
samples, the proportion of these samples assigned to each code- the individual variances of features of the training samples
book vector ( ), and the codebook vectors themselves. Fig. 9 assigned to the respective codebook vector.
illustrates this learning paradigm for synthetic data where two- The last four methods do not enforce the condition that the gen-
dimensional samples are generated from two i.i.d. Gaussian dis- erated samples be closest to the codebook vector they are esti-
tributions with mean vectors and , respectively, mated from. The above criterion is satisfied in Case 1 since the
and identity covariance matrices. We see that as the classifier generated samples are all identical to the codebook vector. The
is incrementally trained with additional data, the new codebook number of samples generated from each codebook vector are
vectors approach the true mean vectors. the same as the number of original training samples assigned
We have used the following methods to generate (indepen- to that codebook vector. If we had chosen to use the EM al-
dent) samples from a codebook vector. gorithm to estimate mixture representations of the class-con-
• Case 1: Using duplicates of the codebook vectors as ditional densities, instead of LVQ, then, incremental learning
the samples (this is, by far, the least computationally could be achieved by using an on-line version of EM, such as
demanding case, since no samples have to be actually the one in [35].
generated).
• Case 2: Sampling from a multivariate Gaussian, with co- B. Experimental Results
variance , centered at the codebook vectors. We have tested the proposed incremental learning method
• Case 3: Same as Case 2, except that we use a diagonal with the Bayesian indoor/outdoor and city/landscape classifiers.
covariance matrix. The diagonal elements correspond to Initially, half the images from the database were used to train
the individual variances of features of the training samples a classifier. The classifier was then incrementally trained (all
assigned to the respective codebook vector. the five methods described above were tested) using the re-
• Case 4: Sampling from a multivariate Gaussian with co- maining images. The performance of a classifier trained on the
variance , centered at the mean of the training patterns entire set of database images (nonincremental learning) was also
assigned to the codebook vector. Note that each codebook measured. Table X shows the classification accuracies for the
vector need not be the mean of the samples assigned to it, various methods. The best classification accuracies achieved
as both positive and negative examples influence the code- for each of the classifiers were 95.9% for the city/landscape
book vectors (see footnote 1, Section V-B). classifier (on 2716 images) and 94.6% for the indoor/outdoor
• Case 5: Same as Case 4, except that we use a diagonal classifier (on 5081 images), versus 97.0% and 95.7%, respec-
covariance matrix. The diagonal elements correspond to tively, for the classifiers trained on the entire database. These
VAILAYA et al.: IMAGE CLASSIFICATION FOR CONTENT-BASED INDEXING 127

results show that a classifier trained incrementally achieves al- TABLE X

most similar accuracies as one trained with all the data. The five CLASSIFICATION ACCURACIES (PERCENT) WITH AND WITHOUT INCREMENTAL
LEARNING; CASE i REPRESENTS ONE OF THE INCREMENTAL METHODS; IN
methods used to regenerate “training” samples perform equally NON-INCREMENTAL, THE WHOLE DATABASE WAS USED IN TRAINING
well. Since the first method (Case 1) requires, by far, the least
additional storage (only one number denoting the total number
of training samples used to train the classifier so far) and com-
putation (no random number generation), it clearly has the best
cost/performance tradeoff.

VIII. FEATURE SUBSET SELECTION

Automatic feature subset selection is an important issue in
designing classifiers. In fact, one usually finds that the per-
formance of a classifier trained on a finite number of samples
starts deteriorating as more features are added beyond a cer-
tain number (the curse of dimensionality [4], [12], [29]). Can
the classification be improved using feature subset selection
methods? Selecting the optimal features is a problem of expo-
nential time complexity and various suboptimal heuristics have
been proposed [12], [13].
Jain and Zongker [13] studied the merits of various feature
subset selection methods. While the branch-and-bound algo-
rithm proposed in [23] is “optimal,” it requires the feature selec-
tion criterion function to be monotone (i.e., it cannot decrease
when new features are added). The above requirement may not
be true for small samples. It is thus desirable to use approximate
methods that are fast and also guarantee near optimal solutions.
Therefore, we tested the sequential floating forward selection
(SFFS) method, which was shown to be a promising alternative
where the branch-and-bound method cannot be used [28]. Fig. 10. Accuracies for the indoor/outdoor classifier trained on varying
We have also applied a simple heuristic procedure based on sized feature vectors generated by FC (from the 600-dimensional spatial color
clustering the features (using -means [11]), trying to remove moment features); the dashed, dotted, and solid lines represent, respectively,
the accuracies of the training set (2541 images), test set (2540 images), and the
redundancy. The feature components assigned to each cluster entire database (5081 images).
are then averaged to form the new feature. Thus, the number of
clusters determines the final number of features. Although this base yielded 82.2% accuracy on the test set (2540 sam-
method does not guarantee an optimal solution, it does attempt ples). The lower accuracy on larger sets agrees with the
to eliminate highly correlated features in high-dimensional fea- observations in [13] on the pitfalls of using feature subset
ture spaces. We refer to this method as the feature cluster (FC) selection on sparse data in a high-dimensional space.
method.
B. Experiments Using FC
A. Experiments Using SFFS
The spatial color moment features used for indoor/outdoor
We have experimented with feature subset selection on the in- classification (feature dimensionality of 600) were clustered to
door/outdoor classifier using the implementation of SFFS pro- generate new feature vectors of sizes 50, 75, 100, 125, 150, 175,
vided in [13]. We found the algorithm to be very slow over the and 200. The components assigned to each cluster were aver-
entire training set of 2541 training samples from database . aged to define a new feature. This approach is incomparably
We hence took 700 samples each from the training and test sets faster than SFFS, taking only a few seconds on a training set
for the feature subset selection process. Our results using SFFS size of 2541, from database . The classification accuracies
are summarized as follows. for the various feature set sizes are plotted in Fig. 10. A code-
• It took the program 12 days on a Sun Ultra 2 Model book size of 30 (optimal for the spatial color moments features)
2300 (dual 300-MHz processors) processor with 512 MB was used for all the features. The best classification accuracy
memory to select up to 67 features from the 600-dimen- of 91.8% on the entire database of 5081 images (95.2% on the
sional feature vector for the small training set of 700 training set and 88.3% on an independent test set of 2540 im-
samples. ages) was obtained with feature vectors of 75 components. Note
• The best accuracy of 87% on the independent test set of that these accuracies are marginally better than those obtained
700 samples was provided by a subset of 52 features, com- from training the classifier on the 600-dimensional spatial color
pared to the 88.2% accuracy using all the 600 features. moment features (accuracy of 88.2% on an independent test set
• Training a new classifier, with the 52 features selected by of 2540 images and an accuracy of 94.2% on the training set).
SFFS, on the 2541 samples from the training set of data- On examining the feature components that were clustered to-
128 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO. 1, JANUARY 2001

TABLE XI ages) and finally, a subset of landscape images are classified as

ACCURACIES FOR INDOOR/OUTDOOR CLASSIFICATION WITH THE FEATURES sunset, forest, or mountain. We have adopted a Bayesian classi-
OBTAINED BY FEATURE CLUSTERING
fication approach, using vector quantization (LVQ) to learn the
class-conditional probability densities of the features. This ap-
proach has the following advantages:
1) small number of codebook vectors represent a particular
class of images, regardless of the size of the training set;
2) it naturally allows for the integration of multiple features
through the class-conditional densities;
3) it not only provides a classification rule, but also assigns
a degree of confidence in the classification, which may be
used to build a reject option.
Classifications based on local color moments, color histograms,
gether, we found that all groupings were formed within features color coherence vectors, edge direction histograms, and edge
of neighboring image regions. These preliminary results show direction coherence vectors have shown promising results.
that clustering the features (linear combination of features) is The accuracy of the above classifiers depends on the features
more efficient and accurate than the SFFS feature subset selec- used, the number of training samples, and the classifier’s ability
tion method for very high-dimensional feature vectors. to learn the true decision boundary from the training data. We
We used MMDL to select the optimal codebook size for have developed methods for incremental learning and feature
the new feature set. The criterion selected , for the subset selection. Another challenging issue is to introduce a re-
indoor/outdoor classifier based on the 2541 training samples. ject option. In the simplest form, the a posteriori class probabili-
Therefore, we extracted 25 codebook vectors each for the ties can be used for rejection (rejecting images whose maximum
indoor and outdoor image classes under the new feature set a posteriori probability is less than a threshold, —say 0.6). We
of 75 components. This illustrates how a reduction in feature are looking at other means of adding the reject option into the
size (from 600 spatial color moment features to the new set of system. Finally, we will introduce other binary classifiers into
75 features) leads to the generation of a larger codebook (50 the system for categories such as day/night, people/nonpeople,
vectors represent the underlying density as opposed to 30 for text/nontext, etc. These classifiers can be added to the present
the full spatial color moment features). hierarchy to generate semantic indices into the database.
Table XI shows the accuracies for the classifier trained on
these new features compared against those of the classifier
REFERENCES
trained on the full spatial color moment features. The FC
method for feature selection improved the classifier perfor- [1] I. Biederman, “On the semantics of a glance at a scene,” in Perceptual
Organizations, M. Kubovy and J. R. Pomerantz, Eds. Hillsdale, NJ:
mance from 91.2% to 92.4% for the indoor/outdoor problem Lawrence Erlbaum, 1981, pp. 213–253.
(on a database of 5081 images), while reducing the feature [2] I. Biederman, “Aspects and extensions of a theory of human image un-
vector dimensionality from 600 components to 75 components. derstanding,” in Computational Processes in Human Vision: An Inter-
disciplinary Perspective, Z. W. Pylyshyn, Ed. Norwood, NJ: Ablex,
Recall that the low-level features used for the indoor/outdoor 1988, pp. 370–428.
image classification problem are extracted over [3] P. C. Cosman, K. L. Oehler, E. A. Riskin, and R. M. Gray, “Using vector
subblocks in the image. Usually, neighboring subblocks in an quantization for image processing,” Proc. IEEE, vol. 81, pp. 1326–1341,
Sept. 1993.
image have similar features as various objects span multiple [4] R. O. Duda and P. E. Hart, Pattern Classification and Scene Anal-
subblocks (e.g., sky, forest, etc., may span a number of sub- ysis. New York: Wiley, 1973.
blocks in many images). Other linear and nonlinear techniques [5] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic,
and W. Equitz, “Efficient and effective querying by image content,” J.
for feature extraction (PCA, Discriminant Analysis, Sammon’s Intell. Inform. Syst., vol. 3, pp. 231–262, 1994.
nonlinear projection) may be as effective as FC in reducing [6] M. Figueiredo and A. K. Jain, “Unsupervised selection and estimation
feature dimensionality. of finite mixture models,” in Proc. Int. Conf. Pattern Recognition,
Barcelona, Spain, 2000.
[7] B. Furht, Ed., “Content-based image indexing and retrieval,” in The
Handbook of Multimedia Computing. Boca Raton, FL: CRC, 1998,
IX. CONCLUSION AND FUTURE WORK ch. 13.
[8] R. M. Gray, “Vector quantization,” IEEE ASSP Mag., vol. 1, pp. 4–29,
User queries in content-based retrieval are typically based Apr. 1984.
on semantics and not on low-level image features. Providing [9] R. M. Gray and R. A. Olshen, “Vector quantization and density estima-
high-level semantic indices for large databases is a challenging tion,” in Proc. SEQUENCES97, 1997.
[10] A. Hampapur, A. Gupta, B. Horowitz, C. F. Shu, C. Fuller, J. Bach, M.
problem. We have shown that certain high-level semantic cat- Gorkani, and R. Jain, “Virage video engine,” in Proc. SPIE Storage Re-
egories can be learnt using specific low-level image features trieval Image Video Databases V, San Jose, CA, Feb. 1997, pp. 188–197.
under the constraint that the images do belong to one of the [11] A. K. Jain and R. C. Dubes, Algorithms for Clustering
Data. Englewood Cliffs, NJ: Prentice-Hall, 1988.
classes under consideration. Specifically, we have developed a [12] A. K. Jain, R. Duin, and J. Mao, “Statistical pattern recognition: A re-
hierarchical classifier for vacation images. At the top level, va- view,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 4–38, Jan.
cation images are classified as indoor or outdoor. The outdoor 2000.
[13] A. K. Jain and D. Zongker, “Feature selection: Evaluation, application,
images are then classified as city or landscape (we assume a face and small sample performance,” IEEE Trans. Pattern Anal. Machine In-
detector that separates close-up images of people in outdoor im- tell., vol. 19, pp. 153–158, Feb. 1997.
VAILAYA et al.: IMAGE CLASSIFICATION FOR CONTENT-BASED INDEXING 129

[14] T. Kohonen, J. Kangas, J. Laaksonen, and K. Torkkola, “LVQ PAK: A [37] M. Stricker and A. Dimai, “Color indexing with weak spatial con-
program package for the correct application of learning vector quantiza- straints,” in Proc. SPIE Storage Retrieval Image Video Databases IV,
tion algorithms,” in Proc. Int. Joint Conf. Neural Networks, Baltimore, San Jose, CA, Feb. 1996, pp. 29–41.
MD, June 1992, pp. 725–730. [38] M. J. Swain and D. H. Ballard, “Color indexing,” Int. J. Comput. Vis.,
[15] J. L. M. Figueiredo and A. K. Jain, “On fitting mixture models,” in vol. 7, no. 1, pp. 11–32, 1991.
Energy Minimization Methods in Computer Vision and Pattern Recog- [39] M. Szummer and R. W. Picard, “Indoor-outdoor image classification,”
nition, E. Hancock and M. Pellilo, Eds. Berlin, Germany: Springer- in IEEE Int. Workshop Content-Based Access Image Video Databases
Verlag, 1999. (in conjunction with ICCV’98), Bombay, India, Jan. 1998.
[16] W. Y. Ma and B. S. Manjunath, “Image indexing using a texture dictio- [40] A. Vailaya, M. Figueiredo, A. Jain, and H.-J. Zhang, “A Bayesian frame-
nary,” in Proc. SPIE Conf. Image Storage Archiving Systems, vol. 2606, work for semantic classification of outdoor vacation images,” in Proc.
Philadelphia, PA, October 1995, pp. 288–298. SPIE Storage Retrieval Image Video Databases VII, vol. 3656, San Jose,
[17] W. Y. Ma and B. S. Manjunath, “Netra: A toolbox for navigating large CA, Jan. 1999, pp. 415–426.
image databases,” in Proc. IEEE Int. Conf. Image Processing, vol. 1, [41] A. Vailaya, M. Figueiredo, A. Jain, and H.-J. Zhang, “Content-based hi-
Santa Barbara, CA, Oct. 1997, pp. 568–571. erarchical classification of vacation images,” in Proc. IEEE Multimedia
[18] J. Mao and A. K. Jain, “Texture classification and segmentation using Systems’99, vol. 1, Florence, Italy, June 7–11, 1999, pp. 518–523.
multiresolution simultaneous autoregressive models,” Pattern Recognit., [42] A. Vailaya, A. K. Jain, and H. J. Zhang, “On image classification:
vol. 25, no. 2, pp. 173–188, 1992. City images vs. landscapes,” Pattern Recognit., vol. 31, no. 12, pp.
[19] G. McLachlan and T. Krishnan, The EM Algorithm and Exten- 1921–1936, 1998.
sions. New York: Wiley, 1997. [43] N. Vasconcelos and A. Lippman, “Library-based coding: A representa-
[20] S. Mehrotra, Y. Rui, M. Ortega, and T. S. Huang, “Supporting con- tion for efficient video compression and retrieval,” in Data Compression
tent-based queries over images in MARS,” in Proc. IEEE Int. Conf. Mul- Conf. ’97, Snowbird, UT, 1997.
timedia Computing Systems, ON, Canada, June 3–6, 1997, pp. 632–633. [44] H. J. Zhang, C. Y. Low, S. W. Smoliar, and J. H. Wu, “Video parsing
[21] T. P. Minka and R. W. Picard, “Interactive learning using a society of retrieval and browsing: An integrated and content-based solution,” in
models,” Pattern Recognit., vol. 30, no. 4, p. 565, 1997. Proc. ACM Multimedia ’95, San Francisco, CA, Nov. 5–9, 1995, pp.
[22] T. Mitchell, Machine Learning. New York: McGraw-Hill, 1997. 15–24.
[23] P. M. Narendra and K. Fukunaga, “A branch and bound algorithm for [45] H. J. Zhang and D. Zhong, “A scheme for visual feature based image in-
feature subset selection,” IEEE Trans. Comput., vol. 26, pp. 917–922, dexing,” in Proc. SPIE Conf. Storage Retrieval Image Video Databases,
Sept. 1977. San Jose, CA, February 1995, pp. 36–46.
[24] T. V. Papathomas, T. E. Conway, I. J. Cox, J. Ghosn, M. L. Miller, [46] D. Zhong, H. J. Zhang, and S.-F. Chang, “Clustering methods for video
T. P. Minka, and P. N. Yianilos, “Psychophysical studies of the per- browsing and annotation,” in Proc. SPIE Storage Retrieval Image Video
formance of an image database retrieval system,” in Proc. IS&T/SPIE Databases IV, San Jose, CA, February 1996, pp. 239–246.
Conf. Human Vision Electronic Imaging III, San Jose, CA, July 1998,
pp. 591–602.
[25] G. Pass, R. Zabih, and J. Miller, “Comparing images using color co-
herence vectors,” in Proc. 4th ACM Conference on Multimedia, Boston, Aditya Vailaya (A’00) received the B.Tech degree
MA, Nov. 1996, http://simon.cs.cornell.edu/Info/People/rdz/rdz.html. from the Indian Institute of Technology, Delhi, in
[26] A. Pentland, R. W. Picard, and S. Sclaroff, “Photobook: Content-based 1994 and the M.S. and Ph.D. degrees from Michigan
manipulation of image databases,” Proc. SPIE Storage Retrieval Image State University, East Lansing, in 1996 and 2000,
Video Databases II, pp. 34–47, Feb. 1994. respectively.
[27] R. W. Picard and T. P. Minka, “Vision texture for annotation,” Multi- He joined Agilent Laboratories, Palo Alto, CA,
media Syst., vol. 3, pp. 3–14, 1995. in May 2000, where he is currently applying pattern
[28] P. Pudil, J. Novovicova, and J. Kittler, “Floating search methods in fea- recognition techniques for decision support in
ture selection,” Pattern Recognit. Lett., vol. 15, pp. 1119–1125, Nov. bioscience research. His research interests include
1994. pattern recognition and classification, machine
[29] B. Ripley, Pattern Recognition and Neural Networks. Cambridge, learning, image and video databases, and image
U.K.: Cambridge Univ. Press, 1996. understanding.
[30] J. Rissanen, Stochastic Complexity in Stastistical Inquiry. Singapore: Dr. Vailaya received the Best Student Paper Award from the IEEE Interna-
World Scientific, 1989. tional Conference on Image Processing in 1999.
[31] B. E. Rogowitz, T. Frese, J. Smith, C. A. Bouman, and E. Kalin, “Percep-
tual image similarity experiments,” in Proc. IS&T/SPIE Conf. Human
Vision Electronic Imaging III, San Jose, CA, July 1998, pp. 576–590.
[32] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face de- Mário A. T. Figueiredo (S’87–M’95) received
tection,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 23–38, the E.E., M.S. and Ph.D. degrees in electrical and
Jan. 1998. computer engineering, all from the Higher Institute
[33] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, “Relevance feedback: of Technology [Instituto Superior Tecnico (IST)],
A power tool for interactive content-based image retrieval,” IEEE Trans. Technical University of Lisbon, Lisbon, Portugal, in
Circuits Syst. Video Technol., vol. 8, pp. 644–655, Sept. 1998. 1985, 1990, and 1994, respectively.
[34] P. G. Schyns and A. Oliva, “From blobs to boundary edges: Evidence Since 1994, he has been an Assistant Professor
for time and spatial scale dependent scene recognition,” Psychol. Sci., with the Department of Electrical and Computer
vol. 5, pp. 195–200, 1994. Engineering, IST. He is also a Researcher with the
[35] Y. Singer and M. Warmuth, “A new parameter estimation method for Communication Theory and Pattern Recognition
Gaussian mixtures,” in Advances in Neural Information Processing Sys- Group, Institute of Telecommunications, Lisbon. In
tems 11, M. S. Kearns, S. A. Solla, and D. A. Cohn, Eds. Cambridge, 1998, he held a visiting position with the Department of Computer Science
MA: MIT Press, 1999. and Engineering, Michigan State University, East Lansing. His scientific
[36] J. R. Smith and S. F. Chang, “Visualseek: A fully automated content- interests are in the fields of image analysis, computer vision, statistical pattern
based image query system,” in Proc. ACM Multimedia, Boston, MA, recognition, and information theory.
Nov. 1996, pp. 87–98. Dr. Figueiredo received the Portuguese IBM Scientific Prize in 1995.
130 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO. 1, JANUARY 2001

Anil K. Jain (S’70–M’72–SM’86–F’91) is a Univer- Hong-Jiang Zhang (S’90–M’91–SM’97) received

sity Distinguished Professor with the Department of the B.S. degree from Zhengzhou University, China,
Computer Science and Engineering, Michigan State in 1982 and the Ph.D degree from the Technical
University, Ann Arbor. He served as the Department University of Denmark, Lyngby, in 1991, both in
Chair from 1995 to 1999. His research interests in- electrical engineering.
clude statistical pattern recognition, Markov random In 1999, he joined Microsoft Research China,
fields, texture analysis, neural networks, document Beijing, as a Senior Researcher/Research Manager.
image analysis, fingerprint matching and 3-D object He was previously with Hewlett-Packard Labs,
recognition. He is the co-author of Algorithms for Clustering Data (Englewood Palo Alto, CA, where he was a Research Manager,
Cliffs, NJ: Prentice-Hall, 1988), edited the book Real-Time Object Measurement performing research and development in the areas
and Classification (Berlin, Germany: Springer-Verlag, 1988), and co-edited the of multimedia content retrieval and management technologies, intelligent
books Analysis and Interpretation of Range Images (Berlin, Germany: Springer- image processing and video coding, and Internet media. Before joining
Verlag, 1989), Markov Random Fields (New York: Academic, 1992), Artificial Hewlett-Packard Labs Labs, he was with the Institute of Systems Science,
Neural Networks and Pattern Recognition (Amsterdam, The Netherlands: El- National University of Singapore, where he led several projects in video
sevier, 1993), 3D Object Recognition (Amsterdam, The Netherlands: Elsevier, and image content analysis and retrieval, computer vision, and multimedia
1993), and BIOMETRICS: Personal Identification in Networked Society (Nor- information systems. He was with the Massachusetts Institute of Technology
well, MA: Kluwer, 1999). Media Lab in 1994 as a Visiting Researcher. He has authored two books, about
Dr. Jain received the Best Paper Awards in 1987 and 1991 and certificates 100 papers and book chapters, and numerous special issues of professional
for outstanding contributions in 1976, 1979, 1992, and 1997, from the Pat- journals in multimedia processing, content-based retrieval, and Internet media.
tern Recognition Society. He also received the 1996 IEEE TRANSACTIONS ON He has served on committees of more than 40 international conferences. He
NEURAL NETWORKS Outstanding Paper Award. He was the Editor-in-Chief of was the Program Committee Co-Chair of the ACM Multimedia Conference in
the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1999. His interests are in the areas of video and image analysis, processing and
(1990–1994). He is a Fellow of the IAPR. He received a Fulbright Research retrieval, media compression and streaming, Internet multimedia, computer
Award in 1998. vision and their applications.
Dr. Zhang currently serves on the editorial boards of five international jour-
nals, including IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO
TECHNOLOGY, IEEE TRANSACTIONS ON MULTIMEDIA, and IEEE MULTIMEDIA
MAGAZINE.