Using Grayscale Images For Object Recognition With Convolutional-Recursive Neural Network
Using Grayscale Images For Object Recognition With Convolutional-Recursive Neural Network
Hieu Minh Bui1,2, Margaret Lech1, Eva Cheng1, Katrina Neville1, Ian S. Burnett3
1 2 3
School of Engineering Centre of Technology Faculty of Engineering &
RMIT University RMIT University Vietnam Information Technology
Melbourne, Australia Ho Chi Minh City, Vietnam University of Technology Sydney
Sydney, Australia
Abstract—There is a common tendency in object recognition sparse representation of those patches using various methods
research to accumulate large volumes of image features to improve such as sparse auto-encoder [11], restricted Boltzmann machine
performance. However, whether using more information [12], and k-means clustering [13]. Jarrett et al. proposed a high
contributes to higher accuracy is still controversial given the level of learning hierarchy to provide better features and
increased computational cost. This work investigates the improve recognition [14]; however, high computational
performance of grayscale images compared to RGB counterparts complexity is required due to the many parameters that must be
for visual object classification. A comparison between object tuned. Coates et al. [15] recently proposed a workaround by
recognition based on RGB images and RGB images converted to using a simple clustering algorithm and a single layer
grayscale was conducted using a cascaded CNN-RNN neural
convolutional neural network to obtain comparable results in
network structure, and compared with other types of commonly
used classifiers such as Random Forest, SVM and SP-HMP.
both performance and speed.
Experimental results showed that classification with grayscale The past few decades have also witnessed the development
images resulted in higher accuracy classification than with RGB of machine learning techniques, bringing significant
images across the different types of classifiers. Results also advancements into all pattern recognition applications. In
demonstrated that utilizing a small receptive field CNN and edgy particular, convolutional neural networks are able to reduce the
feature selection on grayscale images can result in higher computational processing power required to run object
classification accuracy with the advantage of reduced recognition systems. These techniques have been shown to be
computational cost. capable of solving various computer vision problems, where
machine learning successfully replaces manual feature selection
Keywords— object recognition; convolutional neural network; to fully automate complex data classification problems. One
image classification; machine learning successful machine learning example is object classification
based on the ImageNet dataset [16],[17].
I. INTRODUCTION As indicated by recent studies, accumulation of high-
Object recognition is one of the most challenging areas in volumes of information is not always beneficial and in many
computer vision research. Alongside developments in image cases can actually lead to reduced classification performance
capture devices, more high quality data has become available to [1],[3],[5]. Task-dependent selection of relevant features and
aid and improve recognition systems. In the early days of object removal of redundant information can expedite the classification
recognition, grayscale images were used quite successfully with process and improve accuracy.
manually selected features such as HoG [2], SIFT [4] and SURF One example is the use color information in image-based
[6]. object recognition. When the recognition task is required to
The introduction of color images led to larger range of distinguish between instances of the same object class (intra-
enhanced features [8],[9] taking advantage of the new color- class classification) that normally differ in color, the use of color
coded information. However, color-coded information has been images is essential [5]. In contrast, when the objects to be
shown to be susceptible to noise, lighting conditions, and the classified are from different classes that vary in shape and
quality of the capturing devices [10]. texture (extra-class classification), especially when objects from
Similarly, with the recent wide availability of low-cost depth the same class may come in different colors (Fig. 5), then the use
sensors, researchers have gained yet another modality that can of color images may be redundant. In many cases, grayscale
be incorporated into an object recognition system to improve the images that preserve the essential shape and texture information
recognition accuracy [1]. While depth images may provide extra from their original RGB representation may be sufficient to
cues to improve recognition performance, most depth sensors describe classes of objects. In such cases, the use of grayscale
use infra-red projections, which are strongly affected by images for classification may be more efficient. Current
lightning conditions and the physical properties of the object. approaches tend to incorporate color information into the
With the dataset sizes becoming larger whilst labelled data descriptor for both intra- and extra-class object recognition
is scarce, unsupervised feature learning is commonly applied. A [1],[3],[5],[7],[15]; however, there is still a need to examine the
popular approach is to slice the image into patches, and learn the role of color in object classification between different classes.
321
Authorized licensed use limited to: University of Leeds. Downloaded on November 06,2022 at 19:53:24 UTC from IEEE Xplore. Restrictions apply.
Fig. 2 - Block diagram of the proposed system
322
Authorized licensed use limited to: University of Leeds. Downloaded on November 06,2022 at 19:53:24 UTC from IEEE Xplore. Restrictions apply.
90
80
Accuracy (%)
70
GRAY
RGB
60
50
10 40 160 640
Number of hidden node
Fig. 3 - Performance of RGB and grayscale images versus number of RNNs Fig. 4 - Performance of RGB and grayscale images versus number of
used hidden nodes used
There are ܰ independent RNNs similarly derived, thus the surpasses the performance of RGB images, and this
complete final representation for each input image is a ܭൈ ܰ performance improvement increases when the number of RNNs
matrix. In this work, we tested several values of ܰ from 1 to 128. exceeds 60.
As grayscale images are represented by only one channel,
C. Classification less memory is required for convolution calculations compared
In this work, two types of classifiers were investigated: a L- to three-channel RGB images. With the convolution layer
BFGS based [23] softmax classifier, and a simple single layer effectively a set of matrix multiplications, the number of
feed forward neural network trained with scaled conjugate multiplications required is subsequently also three times
gradient [24]. The L-BFGS was used to compare with the work greater. In this work, averaging over 10 empirical
in [7]. However, L-BFGS is known to have scalability measurements of execution time in Matlab, each convolution of
difficulties with large datasets due to the calculation of the one image with 128 filters required about 0.026 seconds for the
Hessian matrix. Therefore, a simple neural network is also RGB images, and 0.013 seconds for the corresponding
evaluated in this paper to test the extracted features. grayscale images. Whilst it is recognized that the measurement
IV. EXPERIMENTS AND RESULTS of computational time is affected by various external factors
including the background activities of the operating system and
A. Dataset the pipeline implementation of CPU instructions, there is an
To evaluate the proposed object classification approach, the indicative reduction in the computational cost.
RGB-D dataset from [1] was used for all experiments. The Table 1 shows the comparison of this work and other related
dataset contains 300 object instances, divided into 51 classes. approaches. Interestingly, CNN-RNN with grayscale images
Each object instance was represented by approximately 600 outperformed the SP+HPM approach, which used a number of
image combinations of RGB, depth and mask data. extra features and is currently a state-of-the-art performer on
As the proposed approach represented extra-class object RGB data. This result may be jointly explained both by the
recognition, the same data sampling settings as in previous work CRNN structure and color properties of the dataset (Fig. 5).
[7],[25] were used for comparison. For each object instance, one Whilst some objects from different classes exhibit very similar
out of five consecutive frames was periodically sampled to be colors, some classes can contain instances of various different
used in experiments. Separation for training and testing was also colors. In such cases, adding color into the feature space may
replicated from previous work with 10 split profiles. Each split confuse the classifier.
profile selected one instance in each class to be used in testing,
to ensure that the recognition system will be tested with
instances not seen in training.
B. Object recognition using L-BFGS optimization
To compare with the related work described in [7], a CNN TABLE 1 - COMPARISON TO OTHER METHODS
(Accuracy = mean ± standard deviation)
with receptive field size of ͻ ൈ ͻ, stride 1, and RNNs with Methods Feature used Accuracy
receptive field size of ͵ ൈ ͵, stride 3, 4 levels was implemented. Random Forest [1] efficient match kernel (EMK), 74.7±3.6
The CNN input images were scaled to size ͳͶͺ ൈ ͳͶͺ, and the SIFT, texton histogram, color
number of filters used for CNN was ܭൌ ͳʹͺ. histogram
SVM [3] color, gradients, local binary 77.7±1.9
The output of RNNs was used to train a soft-max classifier patterns
based on L-BFGS [23]. Fig. 3 shows the recognition accuracy SP+HMP [5] gray intensity, RGB 82.4±3.1
versus number of RNNs when using a L-BFGS classifier. Fig. CNN-RNN RGB [7] RGB 80.8±4.2
3 clearly illustrates that with only 30 RNNs, the classification CNN-RNN GRAY gray intensity 82.2±2.04
performance of the system using grayscale images already (proposed approach)
323
Authorized licensed use limited to: University of Leeds. Downloaded on November 06,2022 at 19:53:24 UTC from IEEE Xplore. Restrictions apply.
TABLE 2 - IMPROVED ACCURACY WITH FEATURE SELECTION ON
GRAYSCALE IMAGES
Number of filter from k- Number of retained Accuracy
means filter
300 300 84.02±2.6
300 128 83.98±2.9
128 128 83.50±2.9
case of using all 128 filters output from k-means, with 99%
Fig. 5 - Color distributions in the dataset confidence. Thus, a restructured CNN-RNN with selected edgy
filters achieved an accuracy comparable to a 300-filter
representation with only the computational cost of 128 filters.
C. Object recognition using simple single layer neural The same procedure was also applied for color images, but
network the resulted accuracy was significantly worse (only 80.78±2.08
As L-BFGS does not scale well with large-sized datasets, for 128 filters).
the classification performance was evaluated on a simple
single-layer neural network trained with Scaled Conjugate V. CONCLUSION
Gradient (SCG). 90% of the training set was used to train the The cascaded CNN-RNN neural network structure
network, with the remaining 10% used for validation. investigated in this paper was used to encode grayscale images
The network size strongly influenced the recognition of objects from a RGB-D database into a sparse representation.
performance. Peak performance was reached with only around Two different types of classifiers were used to learn the object
90 nodes and a set of randomly initialized weights and biases models from the training dataset and perform recognition on the
(Fig. 4). Accuracy provided by this simple neural network was testing dataset.
slightly lower than the performance of the L-BFGS approach, Experimental results showed that object recognition based
but these results also show that using grayscale images produced on grayscale images outperformed recognition based on RGB
a higher classification accuracy than using RGB images. It is images. The discriminative power of features extracted from
likely that an optimized version of the network could provide grayscale images was consistently higher than the
even better results. In addition, the SCG has complexity only
discriminative power of the RGB images when using both L-
twice that of traditional gradient descent, therefore the training
BFGS and single layer neural network classifiers.
should be faster than L-BFGS for large datasets [26].
Moreover, the advantage of using grayscale images was not
D. Improvement with small receptive field and more edgy only in higher recognition accuracy but also in more efficient
features processing at the convolutional layer (known to be the most
While almost all images in the dataset were sized around critical bottleneck in a CNN-RNN system). These experimental
ͻͲ ൈ ͻͲ pixels, there is a redundancy in [7] to upscale the results suggest that grayscale images may be most reliable for
images into ͳͶͺ ൈ ͳͶͺ pixels. Such upscaling does not add computing features for extra-class object classification. In
extra information in describing the objects. Moreover, the work addition, experiments using feature selection indicated the
in [15] showed that small receptive field CNN provided higher importance of edges in object description. Experimental results
classification accuracy. Thus, a smaller receptive field CNN of showed that using a smaller number of highly edgy features can
ൈ and an input image size ofͻͲ ൈ ͻͲ was investigated in provide comparable recognition accuracy at a reduced
this paper. With this change, the receptive field of the RNN was computational cost. Further exploration on filter selection may
subsequently resized to be Ͷ ൈ Ͷ with 3 depth levels. help to improve the performance of the system.
To further reduce the computational complexity, a simple
feature selection scheme using edge density was examined as REFERENCES
indicated in [27]: Canny edge detection was performed on the
filter bank resulting from k-means clustering, sorting the output
in decreasing order of number of edgels (edge-pixels) detected, [1] K. Lai, B. Liefeng, R. Xiaofeng, and D. Fox, "A large-scale
and retaining only a number of the most edgy filters. hierarchical multi-view RGB-D object dataset," in Robotics
and Automation (ICRA), 2011 IEEE International
Table 2 shows the results obtained using grayscale images,
Conference on, 2011, pp. 1817-1824.
where it can be seen that the accuracy is significantly improved [2] N. Dalal and B. Triggs, "Histograms of oriented gradients
with the restructured CNN-RNN, outperforming the existing for human detection," in Computer Vision and Pattern
state-of-the-art performer [5] presented in Table 1. Also, edge- Recognition, 2005. CVPR 2005. IEEE Computer Society
based feature selection improved performance in both accuracy Conference on, 2005, pp. 886-893.
and speed. A statistical ANOVA test shows no difference in [3] L. Bo, X. Ren, and D. Fox, "Depth kernel descriptors for
accuracy when using 128 most edgy filters selected from 300 object recognition," in Intelligent Robots and Systems
k-means output filters, compared to when using all of the 300 (IROS), 2011 IEEE/RSJ International Conference on, 2011,
filters. However, the statistical test shows a difference in pp. 821-826.
accuracy when using the 128 most edgy filters compared to the
324
Authorized licensed use limited to: University of Leeds. Downloaded on November 06,2022 at 19:53:24 UTC from IEEE Xplore. Restrictions apply.
[4] D. G. Lowe, "Distinctive image features from scale- [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet
invariant keypoints," International journal of computer classification with deep convolutional neural networks," in
vision, vol. 60, pp. 91-110, 2004. Advances in neural information processing systems, 2012,
[5] L. Bo, X. Ren, and D. Fox, "Unsupervised feature learning pp. 1097-1105.
for RGB-D based object recognition," in Experimental [17] Q. V. Le, "Building high-level features using large scale
Robotics, 2013, pp. 387-402. unsupervised learning," in Acoustics, Speech and Signal
[6] H. Bay, T. Tuytelaars, and L. Van Gool, "Surf: Speeded up Processing (ICASSP), 2013 IEEE International Conference
robust features," in Computer vision–ECCV 2006, ed: on, 2013, pp. 8595-8598.
Springer, 2006, pp. 404-417. [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-
[7] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng, based learning applied to document recognition,"
"Convolutional-recursive deep learning for 3d object Proceedings of the IEEE, vol. 86, pp. 2278-2324, 1998.
classification," in Advances in Neural Information [19] B. B. Le Cun, J. S. Denker, D. Henderson, R. E. Howard, W.
Processing Systems, 2012, pp. 665-673. Hubbard, and L. D. Jackel, "Handwritten digit recognition
[8] J. Van de Weijer, T. Gevers, and A. D. Bagdanov, "Boosting with a back-propagation network," in Advances in neural
color saliency in image feature detection," Pattern Analysis information processing systems, 1990.
and Machine Intelligence, IEEE Transactions on, vol. 28, pp. [20] Y. Bengio, A. Courville, and P. Vincent, "Representation
150-156, 2006. learning: A review and new perspectives," Pattern Analysis
[9] G. J. Burghouts and J.-M. Geusebroek, "Performance and Machine Intelligence, IEEE Transactions on, vol. 35, pp.
evaluation of local colour invariants," Computer Vision and 1798-1828, 2013.
Image Understanding, vol. 113, pp. 48-62, 2009. [21] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, "Parsing
[10] M. Ebner, "Color Constancy. Hoboken," ed: NJ: Wiley, natural scenes and natural language with recursive neural
2007. networks," in Proceedings of the 28th international
[11] C. Poultney, S. Chopra, and Y. L. Cun, "Efficient learning conference on machine learning (ICML-11), 2011, pp. 129-
of sparse representations with an energy-based model," in 136.
Advances in neural information processing systems, 2006, [22] A. Hyvärinen and E. Oja, "Independent component analysis:
pp. 1137-1144. algorithms and applications," Neural networks, vol. 13, pp.
[12] G. E. Hinton, S. Osindero, and Y.-W. Teh, "A fast learning 411-430, 2000.
algorithm for deep belief nets," Neural computation, vol. 18, [23] M. Schmidt, "minFunc: unconstrained differentiable
pp. 1527-1554, 2006. multivariate optimization in Matlab," 2012.
[13] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, [24] M. F. Møller, "A scaled conjugate gradient algorithm for fast
"Visual categorization with bags of keypoints," in Workshop supervised learning," Neural networks, vol. 6, pp. 525-533,
on statistical learning in computer vision, ECCV, 2004, pp. 1993.
1-2. [25] A. E. Johnson, "Spin-images: a representation for 3-D
[14] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, surface matching," Citeseer, 1997.
"What is the best multi-stage architecture for object [26] L. Bottou, "Large-scale machine learning with stochastic
recognition?," in Computer Vision, 2009 IEEE 12th gradient descent," in Proceedings of COMPSTAT'2010, ed:
International Conference on, 2009, pp. 2146-2153. Springer, 2010, pp. 177-186.
[15] A. Coates, A. Y. Ng, and H. Lee, "An analysis of single- [27] B. Alexe, T. Deselaers, and V. Ferrari, "Measuring the
layer networks in unsupervised feature learning," in objectness of image windows," Pattern Analysis and
International conference on artificial intelligence and Machine Intelligence, IEEE Transactions on, vol. 34, pp.
statistics, 2011, pp. 215-223. 2189-2202, 2012.
325
Authorized licensed use limited to: University of Leeds. Downloaded on November 06,2022 at 19:53:24 UTC from IEEE Xplore. Restrictions apply.