Batik Image Retrieval Using Convolutional Neural Network PDF
Batik Image Retrieval Using Convolutional Neural Network PDF
3010~3018
ISSN: 1693-6930, accredited First Grade by Kemenristekdikti, Decree No: 21/E/KPT/2018
DOI: 10.12928/TELKOMNIKA.v17i6.12701 ◼ 3010
Abstract
This paper presents a simple technique for performing Batik image retrieval using
the Convolutional Neural Network (CNN) approach. Two CNN models, i.e. supervised and unsupervised
learning approach, are considered to perform end-to-end feature extraction in order to describe the content
of Batik image. The distance metrics measure the similarity between the query and target images in
database based on the feature generated from CNN architecture. As reported in the experimental section,
the proposed supervised CNN model achieves better performance compared to unsupervised CNN in
the Batik image retrieval system. In addition, image feature composed from the proposed CNN model
yields better performance compared to that of the handcrafted feature descriptor. Yet, it demonstrates
the superiority performance of deep learning-based approach in the Batik image retrieval system.
1. Introduction
The brain is an amazing organ in the human body. With our brains, we can understand
what we see, smell, taste, hear and touch. The infant brain weight is only about half a kilogram
but can solve a big problem, and even supercomputers cannot. After several months of birth,
the baby can recognize the face of his parents, discern discrete objects from the background,
and begin to speak. Within one year the baby has an intuition about natural objects, can follow
objects and understand the meaning of a sound. When they are children, they can understand
grammar and have thousands of words in their vocabulary.
Building machines that have intelligence like our brains are not easy, to make machines
with artificial intelligence we have to solve very complex computing problems that we have even
struggled with, problems that our brains can solve in a matter of seconds. To overcome this
problem, we have to develop other ways to program computers that have been used in this
decade. Therefore there arises an active field of artificial computer intelligence and also
commonly called deep learning [1].
Nowadays Artificial intelligence has undergone very rapid development. Ai has been
used in many fields of research, in the field of computer vision Content-Based Image Retrieval
(CBIR) has been developed in multi-level schemes with low-level features to high-level features.
Convolutional Neural Network (CNN) has been successfully used to be an effective descriptor
feature and gain accurate results. In general, the features gain by the deep learning method are
trained by mimic human perceptions through various operations such as convolution and
pooling. Deep learning has become a descriptor feature that is better than low-level features.
Although now the CNN module has become state of the art in computer vision this does not
guarantee the features obtained from the highest level always get the best performance [2].
In the Content-Based Image Retrieval system aims to provide the right way to do
the browsing, retrieving and searching some desired images that have been stored in the image
database. The image database contains many images that have been stored and arranged in
a storage device. Usually, the size of the image database is very large so that the process of
searching for specific images manually requires a lot of time, and causes conditions that are
uncomfortable for the user. For example, Batik is a cultural heritage of the archipelago
Indonesia that has a high value and blend of art, laden with philosophical meanings
and meaningful symbols that show the way of thinking of the people making it. Batik is a craft
that has been a part of Indonesian culture especially Javanese for a long time, batik have
Received March 18, 2019; Revised July 2, 2019; Accepted July 18, 2019
TELKOMNIKA ISSN: 1693-6930 ◼ 3011
a lot of motives, pattern and color so to take specific batik picture from the database very
challenging [3].
This paper offers a solution to use convolutional neural networks to carry out
CBIR tasks to solve problems that occur in taking batik images. The method intended is
to produce effective image descriptors from the CNN architecture. Descriptors of this feature
are very important for content-based shooting systems. The Image feature is used to improve
the performance and to solve problems in existing batik shooting systems.
where 𝑘 and 𝑠 denote kernel size and stride, respectively. The function 𝑓𝑘𝑠 is the layer type used
such as matrix dot multiplication for convolutional layers, max spatial for max pooling layers,
nonlinear functions for activation functions, and other types of layers. This form of functionality is
maintained using kernel size and step composition while still using the transformation rules.
While a general network computes general nonlinear functions, a network with only
layers of this form computes a nonlinear filter, which we call a deep filter or fully convolutional
network. FCN naturally operates at any size input and produces the appropriate spatial
dimensions. The loss function is valued composed with the FCN defines task. If the loss
function is a sum over the spatial dimensions of the final layer 𝑙(𝑥; 𝜃) = ∑𝑖𝑗 𝑙 ′ (𝑥𝑖𝑗 ; 𝜃),
the parameter gradient will be a sum over the parameter gradients of each of its spatial
components. Thus stochastic gradient on 𝑙 computed on whole images will be the same as
the stochastic gradient on 𝑙′, taking all the final receptive fields as minibatch. When calculating
this receptive field is done repeatedly with forward and backward propagation operations
feedback will be more effective if the calculation is done layer by layer in all images compared to
computing patch by patch to the part of the image. An illustration of a CNN operation can be
seen in Figure 1.
The proposed CNN model constructs the feature descriptor from Batik image. This
feature descriptor is to measure the similarity between query and target images in database
under the K-Nearest Neighbors (KNN) [11] strategy. This KNN technique performs similarity
matching with the distance score criterion. This paper investigates two CNN models in
the training stage, i.e. with supervised and unsupervised learning approaches. Figure 1
illustrates an example of proposed supervised CNN architecture for Batik image retrieval.
The supervised terminology refers to the utilization of class label, whereas unsupervised
disobeys the image label in the training process. Autoencoder is simple example of
unsupervised CNN method which compresses the data features into smaller size and recovers
back to the original data [12].
3. Method
This section presents two methods for generating the feature descriptor in the Batik
image retrieval system. We firstly explain the supervised CNN model. Then, the unsupervised
CAE model [13] is subsequently described in this section.
After performing six convolution and max-pooling operations, an input image of size
128 × 128 × 3 is converted into new representation with dimensionality 2 × 2 × 256. This new
data representation is then flatten to become one dimensional data of size 1 × 1 × 1024. This
flatten data is subsequently processed and trained with the Multi-Layer Perceptron (MLP).
Herein, the MLP receives 1024 input feature and feeds into 1024 input neurons. The hidden and
output layers are set as 256 and 97, respectively. The value of 97 in output layers is equivalent
to that of the desired class target, i.e. the number of Batik image classes used in the proposed
image retrieval system.
where 𝑠𝑓 denotes the nonlinear activation function in encoder side. CAE simply performs
a linear operation if one simply uses identity function for 𝑠𝑓 . The 𝑊 and 𝑏𝑋 ∈ 𝑅𝑛 are encoder
parameters, respectively, referring as weight matrix and bias vector. In contrast, the decoder
reconstructs 𝑋′ from 𝑌 representation by means of function 𝑔. This process can be simply
illustrated as:
X ′ = g(Y) = sg (W ′ Y + bY ) (4)
where 𝑠𝑔 represents the activation function in decoder side. The 𝑏𝑌 and 𝑊 are the bias vector
and weight matrix, respectively, denotingas decoder parameter.
Strictly speaking, the CAE model searches the global or near optimum
parameter = (𝑊, 𝑏𝑋 , 𝑏𝑌 ) in the training process. This task is equivalent to the minimization
process of loss function over all dataset 𝑋 under the following objective function:
where 𝐿(∙,∙) denotes the auto-encoder loss function. In this paper, we simply use linear
reconstruction 𝐿2 for loss function, or commonly referred as Mean Squared Error (MSE) [18].
This loss function is formally defined as:
n n
2
L2 (θ) = ∑‖xi − xi′ ‖2 = ∑‖xi − g(f(xi ))‖ (6)
i=1 i=1
where 𝑥𝑖 ∈ 𝑋, 𝑥𝑖′ ∈ 𝑋′ and 𝑦𝑖 ∈ 𝑌, respectively denote the original input data, reconstructed data,
and new compact representation of input data.
In this paper, the CAE architecture was built with four encoding blocks and four
decoding stages. This architecture includes a stacked Convolutional Auto-Encoder.
The summary of CAE architecture used in this paper can be seen in Table 2. Suppose that an
input image is of size 128 × 128 × 3. As it can be inferred from Table 2, this image is convolved
four times to obtain new simpler and compact representation. This process can be also
considered as repetitive encoding. Herein, the new representation is regarded as neural code
with dimensionality 4 × 4 × 128. By using the backward approach and decoding process, this
neural code can be recovered back to yield the reconstructed image of original size
128 × 128 × 3. This reverse process performs the deconvolution and unpooling operations.
The CAE neural code can be further utilized as the feature descriptor in the proposed Batik
image retrieval system.
4. Experimental Study
Extensive experiments were carried out to investigate and examine the proposed
method performance in the Batik image retrieval system. Firstly, we give a brief description
about the image dataset used in the experiment. The effectiveness of the proposed method is
subsequently observed under visual investigation. Then, the objective performance
comparisons are further evaluated to overlook the effect of different distance metrics and
superiority of the proposed method in comparison with the former competing schemes.
4.1. Dataset
This experiment utilizes a set of Batik images, refered as Batik image dataset, over
various patterns, colors, and motifs. This image database consists of 1552 image. This
database is further divided into 97 image classes. Each class contains a set of similar images
regarding to their motifs and content appearance. Each image class owns 16 similar images, in
which all images belonging to the same class are considered as similar images. Figure 2 gives
several examples of Batik images from the dataset.
(a) (b)
RV
pi (n) = (7)
n
RV
ri (n) = (8)
M
where 𝑝𝑖 (𝑛) and 𝑟𝑖 (𝑛) denotes the precision and recall rate, respectively, if image 𝑖 is turned as
query image. The symbols 𝑛 and 𝑀 represent the number of retrieved images and total images
in database which is relevant to image 𝑖, respectively. 𝑅𝑉 is the number of images which are
relevant to query image 𝑖 obtained at 𝑛 retrieved images.
Figure 4 shows the performance comparison over various distance metrics in terms of
Precision and Recall scores. All images in database are chosen as query image. The number of
retrieved images are set as 𝑛 = {1,2, … ,16}. In most cases, Bray-Curtis distance yields the best
retrieval performance compared to that of the other distance metrics for both CNN and CAE
image feature. In the Batik image retrieval system, the Bray-Curtis distance becomes a good
candidate for measuring the similarity between the query and target images in database.
Table 3 tabulates more complete comparsions for the proposed image retrieval system
using CNN and CAE features over various distance. This comparison is evaluated in terms of
average recall rate with the number of retrieved images as 𝑛 = 16. Herein, all images in
database are turned as query image. As reported in this table, the proposed method with
supervised CNN delivers better performance compared to that of CAE technique. The image
feature obtained from proposed supervised CNN method is more suitable for Batik image
retrieval task.
(a) (b)
𝑁
1
𝐴𝑃𝑅 = ∑ 𝑟𝑖 (𝑛) (9)
𝑁
𝑖=1
where 𝑟𝑖 (𝑛) and 𝑁 are the recall rate for query image 𝑖 and the total number of images in
database, respectively. Herein, all images in database are turned as query image indicating that
𝑁 = 1552. Thus, the APR value is averaging over all query images. The number of retrieved
images is set as 16 yielding 𝑛 = 16. To make a fair comparison, this experiment also
investigates the dimensionality of image feature.
Table 4 reports the performance comparison in terms of feature dimensionality and APR
value. As shown in this table, the proposed supervised CNN yields the best performance in
comparison with the other competing schemes. It is noteworthy that the proposed method
requires lowest feature dimensionality (with exceptional on comparison to LBP [20] scheme).
This lower dimensionality indicates the faster process on KNN searching for effective Batik
image retrieval system. Thus, the proposed method can be considered on implementing
the Batik image retrieval and classification system.
Table 3. APR CNN and CAE Table 4. APR Comparison with Former Method
Method Euclidean Manhattan Bray-curtis Method Feature Size APR (%)
CNN 0.9938 0.9931 0.9947 LBP [24] 59 92.57
CAE 0.6737 0.6387 0.7654 LTP [25] 118 95.65
CLBP [26] 118 95.17
LDP [27] 236 93.52
Gabor Filter [28] 144 96.55
ODBTC+PSO [3] 384 97.68
Proposed Supervised CNN 97 99.47
5. Conclusions
A new content-based image retrieval system has been presented in this paper. This
system achieves the retrieval accuracies 99.47% and 76.54%, respectively, while the image
feature is constructed from CNN and CAE deep learning-based architecture on Batik image
database. The CNN outperforms the former existing schemes in terms of retrieval accuracy.
In addition, it requires the lowest image features, i.e. 97 feature dimensionality, compared to
other methods. For future work, a slight modification can be carried out for CAE model by
adding fully-connected layers before and after the neural code section. This scenario may
reduce the dimensionality of image feature, at the same time, it improves the performance for
Batik image retrieval.
References
[1] Johnson MH. The neural basis of cognitive development. In: Damon W. Editor. Handbook of
child psychology: Cognition, perception, and language. Hoboken: John Wiley & Sons Inc. 1998: 1-49.
[2] Liu P, et al. Fusion of deep learning and compressed domain features for content-based image
retrieval. IEEE Transactions on Image Processing. 2017. 26(12): 5706-5717.
[3] Prasetyo, H, et al. Batik Image Retrieval Using ODBTC Feature and Particle Swarm Optimization.
Journal of Telecommunication, Electronic Computer Engineering. 2018. 10(2-4): 71-74.
[4] Datta R, Li J, Wang JZ. Content-based image retrieval: approaches and trends of the new age.
Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval. 2005.
[5] Eakins JP, Graham ME. Content based image retrieval: A report to the JISC technology applications
programme. 1999.
[6] Russakovsky O, et al. Imagenet large scale visual recognition challenge. International Journal of
Computer Vision. 2015; 115(3): 211-252.
[7] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural
networks. Advances in neural information processing systems. 2012: 1097-1105.
[8] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv. 2014.
[9] Szegedy C, et al. Going deeper with convolutions. Proceedings of the IEEE conference on computer
vision and pattern recognition. 2015: 1-9.
[10] He K, et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016: 770-778.
[11] Cover T, Hart P. Nearest neighbor pattern classification. IEEE transactions on information theory.
1967; 13(1): 21-27.
[12] Petscharnig S, Lux M, Chatzichristofis S. Dimensionality reduction for image features using deep
learning and autoencoders. Proceedings of the 15th International Workshop on Content-Based
Multimedia Indexing, ACM. 2017.
[13] Masci J, et al. Stacked convolutional auto-encoders for hierarchical feature extraction. International
Conference on Artificial Neural Networks. 2011: 52-59.
[14] Wang R, et al. A Crop Pests Image Classification Algorithm Based on Deep Convolutional
Neural Network. TELKOMNIKA Telecommunication Computing Electronics and Control. 2017;
15(3): 1239-1246.
[15] Baharin A, Abdullah A, Yousoff SNM. Prediction of Bioprocess Production Using Deep Neural
Network Method. TELKOMNIKA Telecommunication Computing Electronics and Control. 2017;
15(2): 805-813.
[16] Sudiatmika IBK, Rahman F, Trisno T, Suyoto S. Image forgery detection using error level analysis
and deep learning. TELKOMNIKA Telecommunication Computing Electronics and Control. 2019;
17(2): 653-659.
[17] Setiawan W, Utoyo MI, Rulaningtyas R. Classification of neovascularization using convolutional
neural network model. TELKOMNIKA Telecommunication Computing Electronics and Control. 2019;
17(1): 463-472.
[18] Meng Q, et al. Relational autoencoder for feature extraction. 2017 International Joint Conference on
Neural Networks (IJCNN). 2017: 364-371.
[19] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv. 2014.
[20] Hagan MT, Menhaj MB. Training feedforward networks with the Marquardt algorithm. IEEE
transactions on Neural Networks. 1994; 5(6): 989-993.
[21] Danielsson PE. Euclidean distance mapping. Computer Graphics image processing. 1980; 1
4(3): 227-248.
[22] Craw S. Manhattan distance. In: Sammut C, Webb GI. Encyclopedia of Machine Learning and Data
Mining. Springer. 2017: 790-791.
[23] Kokare M, Chatterji B, Biswas P. Comparison of similarity metrics for texture image retrieval.
TENCON 2003. IEEE, Conference on Convergent Technologies for the Asia-Pacific Region. 2003; 2:
571-575.
[24] Ojala T, Pietikainen M, Maenpaa T. Multiresolution gray-scale and rotation invariant texture
classification with local binary patterns. IEEE Transactions on pattern analysis machine intelligence.
2002; 24(7): 971-987.
[25] Tan X, Triggs B. Enhanced local texture feature sets for face recognition under difficult lighting
conditions. IEEE transactions on image processing. 2010; 19(6): 1635-1650.
[26] Guo Z, Zhang L, Zhang D. A completed modeling of local binary pattern operator for texture
classification. IEEE Transactions on Image Processing. 2010; 19(6): 1657-1663.
[27] Zhang B, et al. Local derivative pattern versus local binary pattern: face recognition with high-order
local pattern descriptor. IEEE transactions on image processing. 2010; 19(2): 533-544.
[28] Prasetyo H, Wiranto W, Winarno W. Statistical Modeling of Gabor Filtered Magnitude for Batik Image
Retrieval. Journal of Telecommunication, Electronic Computer Engineering. 2018; 10(2-4): 85-89.