Histopathology Image Classification Usin
Histopathology Image Classification Usin
1 Introduction
Medical imaging applications are challenging because they require effective and
efficient content representations to manage large image collections. The first
stage for medical image analysis is modeling image contents by defining an ap-
propriate representation. This is a fundamental problem for all image analysis
tasks such as image classification, automatic image annotation, object recogni-
tion and image retrieval, which require discriminative representations according
to the application domain. During the last few years, the bag of features ap-
proach has attracted great attention from the computer vision community. This
approach is an evolution of texton-based representations and is also influenced
by the bag of words assumption in text classification. In text documents, a word
dictionary is defined and all documents are processed so that the frequency of
each word is quantified. This representation ignores word relationships in the
document, i.e., it does not take into account the document structure. An anal-
ogy is defined for images in which a feature dictionary is built to identify visual
patterns in the collection. This representation has shown to be effective in dif-
ferent image classification, categorization and retrieval tasks [1,2,3].
The bag of features representation is an adaptive approach to model image
structure in a robust way. In contrast to image segmentation, this approach does
not attempt to identify complete objects inside images, which may be a harder
task than the image classification itself. Instead, the bag of features approach
looks for small characteristic image regions allowing the representation of com-
plex image contents without explicitly modeling objects and their relationships,
a task that is tackled in another stage of image content analysis. In addition,
an important advantage of the bag of features approach is its adaptiveness to
the particular image collection to be processed. In the same way as text docu-
ments, in which the appropriate word-list to be included in the dictionary may
be identified earlier in the process, the bag of features approach allows to iden-
tify visual patterns that are relevant to the whole image collection. That is, the
patterns that are used in a single image representation come from the analysis
of patterns in the complete collection. Other important characteristics of this
approach are the robustness to occlusion and affine transformations as well as
its computational efficiency [2].
Some of these properties are particularly useful for medical image analysis
and, in fact, the bag of features representation has been successfully applied to
some problems in medical imaging [4,5]. Histopathology images have a particular
structure, with a few colors, particular edges and a wide variety of textures. Also,
objects such as glands or tissues may appear anywhere in the image, in different
proportions and at different zoom levels. All those properties make the bag of
features a potentially appropriate representation for that kind of visual contents.
Up to our knowledge, the bag of features representation has not been evaluated
yet on histopathology images and that is the main goal of this paper.
This paper presents a systematic evaluation of different representations ob-
tained from the bag of features approach to classify histopathology images. There
are different possibilities to design an image descriptor using the bag of features
framework and each one lead to different image representations that may be
more or less discriminative. In addition, the obtained image descriptors are pro-
cessed using two kernel functions for Support Vector Machine classifiers. The
performed experiments allow to analyze the impact of different strategies in the
final classification result. The paper is organized as follows: Section 2 presents
the previous work on histopathology image classification. Section 3 describes
the bag of features methodology and all applied techniques. Section 4 presents
experimental results, and finally the concluding remarks are in Section 5.
codebook, that is, a visual vocabulary, in which the most representative patterns
are codified as codewords or visual words. Then, the image representation is
generated through a simple frequency analysis of each codeword inside the image.
Csurka et. al [2] describe four steps to classify images using a bag of features
representation: (1) Feature detection and description, (2) Codebook generation,
(3) the bag of features construction and finally (4) training of learning algorithms.
Figure 1 shows an overview of those steps. The bag of features approach is a
flexible and adaptable framework, since each step may be determined by different
techniques according to the application domain needs. The following subsections
present the particular methods and techniques that have been evaluated in this
work.
4 Experimental Evaluation
4.1 Experimental setup
The collection has 1,502 histopathology images with examples of 18 different
concepts. The collection is split into 2 datasets, the first one for training and
validation, and the second one for testing. The dataset partition is done us-
ing stratified sampling in order to preserve the original distribution of examples
in both datasets. This is particularly important due to the high imbalance of
classes. In the same way, the performance measures reported in this work are
precision, recall and F-measure to evaluate the detection rate of positive exam-
ples, since the class imbalance may produce trivial classifiers with high accuracy
that do not recognize any positive sample. In addition, since one image can be
classified in many classes simultaneously, the classification strategy is based on
binary classifiers following the one-against-all rule. Experiments to evaluate dif-
ferent configurations of the bag of features approach have been done. For each
experiment, the regularization parameter of the SVM is controlled using 10-fold
cross validation in the training dataset, to guarantee good generalization on the
test dataset. Reported results are calculated on the test dataset and averaged
over all 18 classes.
4.2 Results
The first evaluation focuses on the codebook size. We have tested six different
codebook sizes, starting with 50 codeblocks and following with 150, 250, 500,
750 and 1000. Figure 3 shows a plot for codebook size vs. F-measure using two
different bag of features configurations. The first strategy, is based on SIFT points
and the second is based on raw blocks. Perhaps surprisingly, the plot shows that
the classification performance decreases while the codebook size increases. This
behaviour is explained by the intrinsic structure of histopathology images: they
are usually composed of some kinds of textures, that is, the number of distinctive
patterns in the collection is limited. This fact can also be seen in the codebook
illustrated in Figure 2, which shows several repeated patterns, even with just
150 codeblocks. In the limit, it is reasonable that a large codebook size does not
have any discriminative power because each different pattern in an image has its
own codeblock.
The nature of the descriptor is also a determinant factor in this behaviour
since the performance of the SIFT points decreases faster than the performance
of raw blocks. This suggests that a SIFT-based codebook requires less codeblocks
to express all different patterns in the image collection, which is consistent with
the rotation and scale invariance properties of that descriptor. On the other
Codebook-Size vs. F-Measure
0.25
SIFT
Blocks
0.2
0.15
F-Measure
0.1
0.05
0
0 200 400 600 800 1000
Codebook Size
Fig. 3. Codebook size vs. F-Measure for two bag of features representation using SIFT
points and Blocks.
Acknowledgments
This work has been partially supported by Ministerio de Educación Nacional de
Colombia grant 1101393199, according to the COLCIENCIAS call 393-2006 to
support research projects using the RENATA network.
References
1. Bosch, A., Muñoz, X., Martı́, R.: Which is the best way to organize/classify images
by content? Image and Vision Computing 25(6) (June 2007) 778–791
2. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization
with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision.
(2004)
3. Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching
in videos. (2003) 1470–1477 vol.2
4. Tommasi, T., O.F.C.B.: CLEF2007 Image annotation task: An SVM-based cue
integration approach. In: Working Notes of the 2007 CLEF Workshop, Budapest,
Hungary (2007)
5. Iakovidis, D.K., Pelekis, N., Kotsifakos, E.E., Kopanakis, I., Karanikas, H.,
Theodoridis, Y.: A pattern similarity scheme for medical image retrieval. In-
formation Technology in Biomedicine, IEEE Transactions on (2008)
6. Long, L.R., Antani, S.K., Thoma, G.R.: Image informatics at a national research
center. Computerized Medical Imaging and Graphics 29 (2005) 171–193
7. Guld, M.O., Keysers, D., Deselaers, T., Leisten, M., Schubert, H., Ney, H.,
Lehmann, T.M.: Comparison of global features for categorization of medical im-
ages. Medical Imaging 5371 (2004) 211–222
8. Deselaers, T., Keysers, D., Ney, H.: FIRE - Flexible Image Retrieval Engine:
imageCLEF 2004 evaluation. In: CLEF 2004, LNCS 3491. (2004) 688–698
9. Datar, M., Padfield, D., Cline, H.: Color and texture based segmentation of molec-
ular pathology images using hsoms. In: Biomedical Imaging: From Nano to Macro,
2008. ISBI 2008. 5th IEEE International Symposium on. (2008) 292–295
10. Comaniciu, D., Meer, P., Foran, D.: Shape-based image indexing and retrieval
for diagnostic pathology. In: Pattern Recognition, 1998. Proceedings. Fourteenth
International Conference on. Volume 1. (1998) 902–904 vol.1
11. Caicedo, J.C., Gonzalez, F.A., Romero, E.: A semantic content-based retrieval
method for histopathology images. Information Retrieval Technology LNCS 4993
(2008) 51–60
12. Zheng, L., Wetzel, A.W., Gilbertson, J., Becich, M.J.: Design and analysis of
a content-based pathology image retrieval system. Information Technology in
Biomedicine, IEEE Transactions on 7(4) (2003) 249–255
13. Lam, R.W.K., Ip, H.H.S., Cheung, K.K.T., Tang, L.H.Y., Hanka, R.: A multi-
window approach to classify histological features. In: Pattern Recognition, 2000.
Proceedings. 15th International Conference on. Volume 2. (2000) 259–262 vol.2
14. Tang, H.L., Hanka, R., Ip, H.H.S.: Histological image retrieval based on semantic
content analysis. Information Technology in Biomedicine, IEEE Transactions on
7(1) (2003) 26–36
15. Fletcher, C.D.M.: Diagnostic Histopathology of tumors. Elsevier Science (2003)
16. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image
classification. (2006) 490–503
17. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna-
tional Journal of Computer Vision 60(2) (November 2004) 91–110
18. Li, F.F., Perona, P.: A bayesian hierarchical model for learning natural scene cate-
gories. In: CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2, Washington,
DC, USA, IEEE Computer Society (2005) 524–531
19. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge
University Press (2004)