2011-Structural Image Classification With Graph Neural Networks
2011-Structural Image Classification With Graph Neural Networks
Abstract—Many approaches to image classification tend to [6][7] or simple tree structures [8]. While the use of local
transform an image into an unstructured set of numeric feature features are generally favoured for its segmentation-free ability
vectors obtained globally and/or locally, and as a result lose to locate distinctive regions, global features are able to capture
important relational information between regions. In order to
encode the geometric relationships between image regions, we the “gist” of an image and supply a rich set of cues to its image
propose a variety of structural image representations that are category [9].
not specialised for any particular image category. Besides the Thus, we explore variants of structural approaches that
traditional grid-partitioning and global segmentation methods, can handle models with hundreds of regions. In order to be
we investigate the use of local scale-invariant region detectors. able to select distinctive regions, in the context of common
Regions are connected based not only upon nearest-neighbour
heuristics, but also upon minimum spanning trees and Delaunay photometric and geometric image transformations, we consider
triangulation. In order to maintain the topological and spatial region detection with local scale-invariant regions and compare
relationships between regions, and also to effectively process this approach with global image segmentation and partitioning
undirected connections represented as graphs, we utilise the methods.
recently-proposed graph neural network model. To the best of Graphs are natural data structures to model relationships,
our knowledge, this is the first utilisation of the model to process
graph structures based on local-sampling techniques, for the task with nodes representing regions and edges encoding the re-
of image classification. Our experimental results demonstrate lationships between them. Images within the same category
great potential for further work in this domain. often possess a similar structure (for example, hair-eyes-nose-
Keywords-Image classification, structural representation, graph mouth in category face). In addition, the spatially-proximate
neural networks, region adjacency graph, minimum spanning regions in an image can be connected in a variety of loose
tree, Delaunay triangulation. geometric assemblies. However, finding an optimal geometric
model or graph suitable for use across all image categories is
I. I NTRODUCTION combinatorially expensive.
Current popular image representations have shifted to the Traditional machine learning models used for classification
use of local invariant features, which are more robust to noise problems cope with graph-structured data by performing a pre-
and are able to handle various common photometric and ge- processing stage to map the graph information into a simpler
ometric image transformations (for example, lighting changes representation such as a numeric vector of floating-point values
and viewpoint differences), in comparison to their global [10]. The “flattened” list-based data loses some important
counterparts. One such representation, which has demonstrated topological relationships between the structural components
great successes in image classification and retrieval, is known (e.g. nodes), and the final result is ultimately dependent on
as the bag-of-features or bag-of-visual-words model [1] [2] [3]. the details of the preprocessing algorithm used. Recently, a
However, a key disadvantage of such an orderless and unstruc- graph neural network (GNN) model [11] has been proposed
tured model is the loss of structural information, namely, the to perform supervised learning on graph data structures. It has
relationship between regions. Another popular representation been successfully applied to a number of applications such as
is the constellation model, which involves connecting a fixed web search [12], text mining [13][14], object localization [15],
and limited number of parts (typically six or seven) into a pre- and image classification [6][7].
defined structure such as the fully-connected or star model In this paper, we investigate various graph formations with
[4]. Unfortunately, the formation of these models involves different node selections and edge connections, which will
applying very precise geometric constraints on the feature enable an integration of both visual features and structural con-
locations, and the limited number of parts means ignoring text. We begin by constructing undirected graphs out of some
a good deal of information available in images. Meanwhile, commonly-used structures in image analysis: a 4-connected
structural representations allow an arbitrary number of features uniformly-sampled grid and the RAG. These are compared
to be situated at varying locations and tend to deal with a larger to two famous graph structures, the minimum spanning tree
number of regions. These regions are often identified based (MST) [16] and Delaunay triangulation [17] constructed with
on a global segmentation or global partitioning techniques [5] local regions. In order to preserve the graph structure of an
and are usually connected as a region adjacency graph (RAG) image and incorporate the topological relationships between
417
415
centers of the detected regions make up the set of points on our experiments, we set k to 150 and perform clustering on
which to create the Delaunay triangulation. Among the four 100 training images. With k clusters computed per-category
mentioned structural representations, Delaunay triangulation across the whole dataset, each cluster yields one representative
creates the most edges per node in an image. node, the centroid of that cluster. Each node in an image-graph
associates, based on a criterion, with a node from the repre-
B. Node and Edge Labels sentative set. We use the criterion of the minimum Euclidean
Given a set of regions extracted from an image, constructing distance, though this can be substituted with other metrics such
a fully-labelled graph also involves the selection of suitable as correlation or Mahalanobis distance. At this point, each
node and edge attributes for the respective node and edge node in a graph is associated with one centroid. In each cluster,
labels. Node labels incorporate features extracted from interest only the node closest to the centroid is retained. Finally, the
regions. Edge labels incorporate inter-node spatial information number of nodes per graph is reduced to be at maximum the
that is discarded in traditional set-based representations. number of clusters, k. The key advantage of employing this
filtering technique is that clustering need be performed only
once per-category, thus allowing new categories to be added
without re-computing the clusters. We apply further heuristics
to remove non-meaningful clusters (such as clusters with few
node members and outliers based on a threshold), which leaves
behind 85 − 100 final clusters per category.
III. C LASSIFICATION WITH GRAPH NEURAL NETWORKS
Fig. 2. Node and edge labelling method. Each node n has co-ordinates (x, y) The graph neural network (GNN) model [23][11] ex-
and scale s, ln is a node label, and l(ni ,nj ) is an edge label.
tends existing neural network methods for processing graph-
structured data. Unlike traditional recursive neural network
Our labelling method is illustrated schematically in Fig. 2. (RNN) models [24][25], whose input domain consists of
Each region yields a single node n, which is defined by directed acyclic graphs, GNN S are able to process a wider
the center co-ordinates (x, y) and the scale s of the region. general class of graphs including acyclic, cyclic, directed
This scale is automatically determined by the scale-invariant and undirected. Additionally, GNNs are able to process both
Hessian-Laplace detector. The label ln of a node n is a positional and non-positional graphs.
128-dimensional SIFT (Scale-Invariant Feature Transform) de- The underlying idea of the GNN model is that nodes in a
scriptor [19], which is based on grey-level gradient intensities. graph can represent objects or concepts, and edges represent
Each edge is labelled with three attributes derived from the their relationships. Information attached to a node n, called
positions and labels of the two nodes (ni , nj ) it connects. We a node label (ln ), usually includes features of the object (for
employ edge labels similar to those used by Revaud et al. [22] example, area and colour intensity). Similarly, an edge label
for object recognition: (l(n1 ,n2 ) ) includes features of the object relationships (such as
dist(ni ,nj )
• si +sj , the Euclidean distance between the nodes distance and angle).
ni and nj , normalised with respect to their scales. The For each node n, the GNN defines a state xn which is
denominator is always greater than 1. attached to each node, based on the information contained in
|si −sj |
• max(si ,sj ) , the normalised scale difference
the neighbourhood of n (see Fig. 3). The state xn contains
• dist(lni − lnj ), the normalised Euclidean distance a representation of the concept being modelled and can be
between the descriptors of the node labels. used to produce an output decision value on . For the task of
image classification, this output value can be interpreted as a
C. Node Filtering confidence value that a particular node belongs to a category or
With each interest region being represented by a node, this class. In the RNN, an output state is only attached to a single
approach may become computationally infeasible when the supersource node, which has to be explicitly selected in the
number rises to the hundreds-to-thousands per image. This is input domain. In contrast, the GNN allows supervision to be
especially the case when using the local scale-invariant detec- placed on every node, or a subset of them, to each produce an
tor, since multiple regions can be detected at different scales output decision value. For each iteration and for each node in
around the same locations. These identified regions of interest the graph, the connection weights are used to adapt the overall
can contain recurring patterns in different attribute spaces — network to fit the desired targets. The weights are updated by
such as recurring colours, textures, or SIFT descriptors. In the resilient back-propagation strategy [26], which is one of the
order to reduce the quantity of nodes in each image graph, most efficient strategies with feedforward neural architectures.
we employ a clustering method. This heuristically identifies
IV. E XPERIMENTS
nodes with similar patterns and selects a set of representative
nodes as part of our node filtering method. A. Experimental Settings
Firstly, for each image category in the dataset, a set of k Evaluation of the structural representations is performed
representative nodes is pre-selected based upon clustering. In upon four widely-varying unnormalized image categories from
418
416
performed a holdout method on the dataset. This involved
splitting the per-category image subset into a training set,
validation set and test set containing 150, 50 and 150 images
respectively. Due to the random initialisation of weights in
the GNN model, we reported on the average performance
over five experimental runs. Each run repeated the training
and classification process with different image subsets from
the entire dataset. In each run, the one-against-all strategy
was used to build a separate model for each category, in
which each model was trained to distinguish the images of
that category (positive class) from the images of the other
categories (negative class).
Instead of selecting one representative node to be supervised
per graph, we set all nodes per graph to be supervised. While
this increased the learning time significantly, we chose not to
constrain the model with one supersource node, which was
required in common RNN S [24]. This decision was further
supported by Di Massa’s [7] experimental results, in which it
Fig. 3. Each unique node is represented by a number. The bold node, node 1, was noted that more stable results were achieved by supervis-
has a state (x1 ), which depends on its label (l1 ), the labels of its connecting ing more nodes. By supervising all nodes, each node in a graph
edges (lco[n] ), the states of neighbouring nodes (xne[1] ) and their labels produced its own classified output-value between −1 and 1. In
(lne[n] ) [11].
order to report on the graph-focused or image-level result, for
each input graph, we simply averaged all its individual node
output-values. We note that other heuristics may be utilised to
the dataset collected by Fergus et al. [27]. A small selection
obtain a graph-focused result (such as removing outlier node
of images (resized to similar heights for the purpose of
output-values or introducing thresholds), but we have left this
presentation) from each of the four classes are shown in Fig. 4.
experimentation for future work.
Some images were extracted from the benchmark Caltech
We configured the GNN parameters to the commonly-used
database [28] and others were collected from Google’s image
two hidden layers, with ten hidden neurons and a linear acti-
search and are highly variable in nature.
vation function, matching the setup used in [6]. The maximum
number of iterations was set to 2500, and to reduce over-fitting,
evaluation of the GNN model parameters was performed every
20 epochs with the validation set [11]. We defined the cost to
be the mean squared error between the GNN’s output values
and the target values (1 or −1) from the validation set. The
model that achieved the lowest cost on the validation set was
considered the optimal model, and was then applied to the test
set.
Rather than using a fixed classification threshold on the
output values, we calculated the Receiver Operating Charac-
teristics (ROC) and reported the Area Under the ROC Curve
(AUC) [29]. Experiments were conducted with Matlab 7.9 on
a Linux system running on a quad-core 2.4GHz Intel CPU
and 16GB memory. Due to the current implementation of
the GNN emulator, dataset sizes and the number of classes
Fig. 4. Sample images from the dataset
had to be limited. This is due to the oversized matrices
created internally with larger input datasets, which hinders
We followed a similar experimental setup to that used by Di Matlab from performing the necessary complex mathematical
Massa [6] with the same image dataset. For each category, a operations on them. This issue is exacerbated by our chosen
subset of 350 images were randomly selected from the original approach of the supervision of all nodes per graph, and also
dataset. The selected images were split into two equally- by the large quantities of node and edges in some structural
sized positive and negative sets. For the negative half, we image representations.
performed stratification, which ensured that each category was B. Experimental Results and Discussions
represented in approximately equal proportion, in order to
avoid learning bias towards any particular category. Fig. 5 presents the average AUC performances of the four
To evaluate the performance of the classification task, we structural representations explained above. For the MST and
419
417
Delaunay structural representations, we present the clustering- objects and backgrounds are very diverse and do not share a
based node-filtering results, denoted with Clust, alongside the strong correlation.
non-filtered results, denoted with All.
420
418
VI. ACKNOWLEDGEMENTS [20] A. W. Fitzgibbon, M. Pilu, and R. B. Fisher, “Direct least-squares
fitting of ellipses,” IEEE Transactions on Pattern Analysis and Machine
The work presented in this paper was partially supported Intelligence, vol. 21, no. 5, pp. 476–480, May 1999.
by ARC (Australian Research Council) grants. A/Prof Zhang [21] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,
was affliated with NICTA (National ICT Australia) during the F. Schaffalitzky, T. Kadir, and L. V. Gool, “A comparison of affine
region detectors,” International Journal on Computer Vision, vol. 65,
course of these experiments. The authors would like to thank no. 1-2, pp. 43–72, 2005.
S. Zhang and M. Hagenbuchner for their assistance with the [22] J. Revaud, G. Lavoué, Y. Ariki, and A. Baskurt, “Scale-invariant
GNN. proximity graph for fast probabilistic object recognition,” in Proceedings
of the ACM International Conference on Image and Video Retrieval,
2010, pp. 414–421.
R EFERENCES [23] G. Monfardini, “A recursive model for neural processing in graphical do-
[1] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual mains,” Ph.D. dissertation, Universit degli Studi di Siena, (Dipartimento
categorization with bags of keypoints,” in European Conference on di Ingegneria dell’Informazione), Siena, Italy, 2007.
Computer Vision. Workshop on Statistical Learning in Computer Vision, [24] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive
Prague, Czech Republic, May 2004. processing of data structures,” IEEE Transactions on Neural Networks,
[2] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach vol. 9, no. 5, pp. 768–786, September 1998.
to object matching in videos,” in IEEE International Conference on [25] M. Bianchini, M. Gori, and F. Scarselli, “Processing directed acyclic
Computer Vision, Washington, DC, USA, 2003, pp. 1470 – 1477. graphs with recursive neural networks,” IEEE Transactions on Neural
[3] P. Tirilly, V. Claveau, and P. Gros, “Language modeling for bag-of- Networks, vol. 12, no. 6, pp. 1464–1470, November 2001.
visual words image categorization,” in ACM International Conference on [26] M. Riedmiller and H. Braun, “A direct adaptive method for faster
Image and Video Retrieval, Niagara Falls, Ontario, Canada, July 2008, backpropagation learning: The rprop algorithm,” in IEEE International
pp. 249–258. Conference on Neural Networks, 1993, pp. 586–591.
[4] X. Cheng, Y. Hu, and L.-T. Chia, “Hierarchical word image repre- [27] R. Fergus, P. Perona, and A. Zisserman, “A sparse object category model
sentation for parts-based object recognition,” in IEEE International for efficient learning and exhaustive recognition,” in Proceedings of the
Conference on Image Processing, Nov 2009, pp. 301 –304. IEEE Conference on Computer Vision and Pattern Recognition, vol. 1,
[5] C. Jiang and F. Coenen, “Graph-based image classification by weighting Washington, DC, USA, 2005, pp. 380–387.
scheme,” in Proceedings of Artificial Intelligence. Springer, 2008, pp. [28] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models
63–76. from few training examples: an incremental bayesian approach tested on
[6] V. Di Massa, G. Monfardini, L. Sarti, F. Scarselli, M. Maggini, and 101 object categories.” in IEEE International Conference on Computer
M. Gori, “A comparison between recursive neural networks and graph Vision and Pattern Recognition, Workshop on Generative-Model Based
neural networks,” in International Joint Conference on Neural Networks, Vision, 2004.
2006, pp. 778–785. [29] J. A. Hanley and B. J. Mcneil, “The meaning and use of the area under a
[7] V. Di Massa, “Graph neural networks, image classification and object receiver operating characteristic (roc) curve.” Radiology, vol. 143, no. 1,
recognition,” Ph.D. dissertation, Universit degli Studi di Siena, (Dipar- pp. 29–36, April 1982.
timento di Ingegneria dellInformazione), Siena, Italy, 2008. [30] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
[8] Z. Wang, D. Feng, and Z. Chi, “Comparison of image partition methods “Object detection with discriminatively trained part based models,” IEEE
for adaptive image categorization based on structural image represen- Transactions on Pattern Analysis and Machine Intelligence, vol. 32,
tation,” in the 8th International Conference on Control, Automation, no. 9, pp. 1627–1645, September 2010.
Robotics and Vision, vol. 1, Kunming, China, December 2004, pp. 676– [31] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek, “Evaluating
680. color descriptors for object and scene recognition,” IEEE Transactions
[9] A. Oliva and A. Torralba, “Modeling the shape of the scene: A on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1582–
holistic representation of the spatial envelope,” International Journal 1596, August 2010.
of Computer Vision, vol. 42, pp. 145–175, 2001.
[10] S. Haykin, Neural Networks: A Comprehensive Foundation. Upper
Saddle River, NJ, USA: Prentice Hall, 1998.
[11] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
“The graph neural network model,” IEEE Transactions on Neural
Networks, vol. 20, no. 1, pp. 61–80, 2009.
[12] F. Scarselli, S. L. Yong, M. Gori, M. Hagenbuchner, A. C. Tsoi,
and M. Maggini, “Graph neural networks for ranking web pages,” in
International Conference on Web Intelligence, Washington, DC, USA,
2005, pp. 666–672.
[13] S. L. Yong, M. Hagenbuchner, A. C. Tsoi, F. Scarselli, and M. Gori,
“Document mining using graph neural network,” in Proceedings of the
5th Initiative for the Evaluation of XML Retrieval Workshop, N. Fuhr,
M. Lalmas, and A.Trotman, Eds., 2007, pp. 458–472.
[14] R. Chau, A. C. Tsoi, M. Hagenbuchner, and V. C. S. Lee, “A con-
ceptlink graph for text structure mining,” in Proceedings of the Thirty-
Second Australasian Conference on Computer Science, Wellington, New
Zealand, January 2009, pp. 129–137.
[15] G. Monfardini, V. Di Massa, F. Scarselli, and M. Gori, “Graph neural
networks for object localization,” in European Conference on Artificial
Intelligence, Riva del Garda, Italy, August 2006, pp. 665–669.
[16] R. C. Prim, “Shortest connection networks and some generalization,”
Bell System Technology Journal, vol. 36, pp. 1389–1401, 1957.
[17] B. Delaunay, “Sur la sphe‘re vide,” Izv. Akad. Nauk SSSR, Otdelenie
Matematicheskii i Estestvennyka Nauk, vol. 7, pp. 793–80, 1934.
[18] W.-Y. Ma and B. Manjunath, “Edgeflow: a technique for boundary detec-
tion and image segmentation,” IEEE Transactions on Image Processing,
vol. 9, no. 8, pp. 1375 –1388, August 2000.
[19] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
International Journal on Computer Vision, vol. 60, no. 2, pp. 91–110,
2004.
421
419