Sparse Coded Spatial Pyramid Matching and Multi-Kernel - CP
Sparse Coded Spatial Pyramid Matching and Multi-Kernel - CP
sciendo
PAPERS
Sparse coded spatial pyramid matching and multi-kernel
integrated SVM for non-linear scene classification
Support vector machine (SVM) techniques and deep learning have been prevalent in object classification for many years.
However, deep learning is computation-intensive and can require a long training time. SVM is significantly faster than
Convolution Neural Network (CNN). However, the SVM has limited its applications in the mid-size dataset as it requires
proper tuning. Recently the parameterization of multiple kernels has shown greater flexibility in the characterization of the
dataset. Therefore, this paper proposes a sparse coded multi-scale approach to reduce training complexity and tuning of
SVM using a non-linear fusion of kernels for large class natural scene classification. The optimum features are obtained
by parameterizing the dictionary, Scale Invariant Feature Transform (SIFT) parameters, and fusion of multiple kernels.
Experiments were conducted on a large dataset to examine the multi-kernel space capability to find distinct features for
better classification. The proposed approach founds to be promising than the linear multi-kernel SVM approaches achieving
91.12 % maximum accuracy.
K e y w o r d s: multiple kernel learning, support vector machine, classification, SIFT, spatial pyramid matching (SPM)
1 Department of Electronics and Communication Engineering, Indus University, Rancharda, Ahmedabad, 382115, India, 2 Department
of Electrical Engineering, Prince Mohammad Bin Fahd University, PO Box 1664, Al Khobar 31952, Saudi Arabia, ∗ corresponding author
[email protected]
Training Testing
images images
Sample
Tunable SIFT feature extraction
features
Training labels
Sparse
KSVD
coding
Offline codebook
learning
Spatial
pooling
and investigates the performance by tuning the parame- where the max pooling function F is applied on each
ters. In the proposed algorithm, the conversion of SIFT column of absolute sparse code U as
features vector quantization to the sparse code is
zj = max{|u1j |, |u2 j |, . . . , |uMj |}, (6)
M
X
min kxm − um V 2 k2 + λ|um |, (1) where zj is the j -th element of z , uij is the matrix
U,V
m=1 element at i -th row and j -th column of U , and M is the
subject to kvk k ≤ 1 , k = 1, 2, . . . , K , where V is N × K number of local descriptors in the region.
size over-complete (K > N ) dictionary and um is sparse Multiclass classification usually disintegrates into groups
coefficient matrix for signal xm . of binary 1, −1 problem that can easily accommodate
In this unit, L2-norm on V and L1-norm on Um is typ- the functionality of standard SVM famous approaches
ically applied with regularization parameter λ. The prob- one-versus-rest and one-versus-one. In this experiment,
lem in (2) is convex in V and U simultaneously. This can we have implemented SVM as proposed in [17] to solve
be solved to fix the iteration number to achieve optimiza- the following convex optimization problem using kernel
tion over V or U while fixing any other. Fixing codebook weights dm . X
V , (3) can be solved as a linear regression problem with J(d) = Jp (d), (7)
L1-norm regularization on sparse coefficients p∈P
min kxm − um V k22 + λ|um | . (2) where P is the set of all pairs to be considered, and Jp (d)
um is the binary SVM objective value shown in (8) for the
Fixing U, same problem will be transformed to least classification problem pertaining to pair p.
square with quadratic constraints
maxα − 12 i,j αi αj m dm Km (zi , zj )
P P
min kX − U V k2F , (3) J(d) = 1
with 0 ≤ αi ≤ vl ∀i , (8)
V
P
subject to kvk k ≤ 1 , ∀k = 1, 2, . . . , K . Lagrange dual i αi = 1
[16] can cleanly deal with it. where αi is Lagrange multiplier. The gradient of the given
ScSPM feature is computed by the histogram pooling function in (8) can be found as
method
M
1 X ∂J 1 XX ∗ ∗
z= um . (4) =− αi,p αj,p yi yj Km (zi , zj ) ∀m, (9)
M m=1 ∂dm 2 i,j
p∈P
Fig. 2. Sample images of datasets used in this experiment: (a) – Caltech-101, (b) – Scene-15
4 Experiments and results failed to extract local features and hence classification
accuracy. Therefore, the proposed method uses 16 ori-
The improvement in the classification accuracy in entations and four bins for SIFT features. We excluded
SVM requires fine-tuning of the kernel within SVM. the detailed discussion of other types of SPM (KSPM
Therefore, a different fusion of the kernel is used, and clas- and LSPM) and MKL and their results are used in the
sification accuracy is analysed rigorously. A benchmark comparison. ScSPM [18] used a linear kernel on spatial-
dataset named Caltech-101 [27] and Scene-15 [28–30] is pyramid pooling for sparsified SIFT features whereas, in
used in the experiment. The Caltech-101 dataset con- the proposed experiment, the linear kernel is replaced by
tains 9145 images for 101 different classes with various MKL as suggested in [17].
object categories like animals, instruments, vehicles, flow- By the rigorous study of literature and detailing the
ers, plants, etc. The dataset contains 40 to 800 images work for ScSPM, it has been observed that the patch
per class. The scene-15 dataset contains main indoor and size reference to dictionary size, number of training and
outdoor scenes like the kitchen, living room, offices, etc. testing samples, and SVM contributes magnificently in
Though the number of classes is less, it has low inter- the improvement of classification rate.
class covariance, making it difficult to achieve high accu-
The patch size reference to dictionary size contributes
racy. A total of 4000 images are available in 15 classes
ranging from 200 to 400. Sparse coding of features and to the sparsity of the features as expressed in (3). In
multi-resolution approach using spatial pyramid match- the proposed experiment, 256 × 1024 and 16 × 16 sizes are
ing robust to local spatial translation [18] are used. That used for dictionary and patch respectively. Dictionary is
trained for 30 iterations for KSVD. The average coef-
reduces the training complexity from O(n3 ) to O(n) and
keeps the testing complexity constant. ficients value for the learned KSVD dictionary over 30
iterations is shown in Fig. 3. The One-Versus-Rest SVM
Each sparsified SIFT features sets for Caltech-101 was
approach is used in training. The fusions of the kernels
divided into 30 and 15 training samples and the remain-
with the values of their coefficients are listed in Tab. 1.
ing for testing. For this experiment, Scene-15 dataset was
We tested the performance for 5 independent runs and
divided into 50 and 100 training images per class, and
noted average accuracy achieved after five runs. This ex-
the remaining were left for testing. In this experiment,
periment was conducted with Intel Core i3 of 2.50 GHz,
we extracted SIFT feature as per our previous work on
8 GB RAM, and Windows-10 of 64 bit machine configu-
parametrizing SIFT and sparse dictionaries [26]. Total six
parameters are involved in the extraction of SIFT fea- rations.
tures, including the number of Gaussian functions, its Table 2 presents a comparison of the obtained results
variance, amount of image scaling, histogram bins’ ori- with other state-of-art methods.
entation, and its radius and features vector size. The em- The proposed method uses the same size as the dic-
pirical study presented in [26] suggests that the size of tionaries used in [26] and [18]. Wang et al [31] presented
SIFT feature depends on the number of bins and angles. an SVM-based scene classification model where images
And analysis propagates that large bins with less angle are characterized using SIFT features obtained from the
378 B. Gajjar, H. Mewada, A. Patani: SPARSE CODED SPATIAL PYRAMID MATCHING AND MULTI-KERNEL INTEGRATED . . .
Dataset: Caltech–101
Average
Algorithm accuracy Training Method name
(%) images
67.0 ± 0.45
ScSPM [18] 73.02 ± 0.54 30 SPM sparse coding
BOW(400), [19] 72.02
BOW(1000), [19] 70.11 30 Bag of words
BOW(4000), [19] 71.24
NBNN, [20] 70.4 15 Naive-Bayes nearest-neighbor
70.7 Local visual feature coding based
LVFC-HSF, [21]
78.7 on heterogeneous structure fusion
CLGC(RGB-RGB) (22) 72.6 30 Concatenation of local and global color
64.0
CSAE, [23] 15 Convolutional sparse auto-encoder
71.4
LMMK, [24] 62.3 Large margin multiple kernel
Parameterizing ScSPM (26) 77.08 ± 0.31 30 Parameterizing SPM sparce coding
79.29 ± 0.43 15
85.06 ± 0.31 30 kernel K1
Proposed 79.87 ± 0.36 15
method 85.72 ± 0.47 30 kernel K2
78.96 ± 0.24 15
84.97 ± 0.21 30 kernel K3
Dataset: Scene –15
Average
Algorithm accuracy Training Method name
(%) images
ScSPM, [18] 87.28 ± 0.93 SPM sparse coding
LVFC-HSF, [21] 87.23 Local visual feature coding based
100 on heterogeneous structure fusion
OVH, [25] 87.07 Orthogonal vector histogram
Parameterizing ScSPM, [26] 81.13 ± 0.53 Parameterizing SPM sparce coding
81.94 ± 0.54 50
89.12 ± 0.41 100 kernel K1
Proposed 83.32 50
method 91.12 ± 0.57 100 kernel K2
85.55 ± 0.40 50
90.52 ± 0.21 100 kernel K3
putation complexity for higher dimensional feature vec- this kind of similarity ie sunflowers, water lily in Caltech-
tor. Overall accuracy they reported ∼ 3 % higher than 101 and bedroom, living room in Scene-15 dataset, etc. In
our but in their experiment test images from each class the future scope, we will examine the effect of this feature
are fixed to 15. For the scene-15 dataset, OVH [25] cal- on alike classes.
culates a global rotation invariant geometric visual word
to relate with BoVW as special information but cannot
take advantage of distinct local information. References
The proposed approach increases the accuracy to
[1] H. Liu and L. Yu, “Toward integrating feature selection algo-
85.72 % for a large multiclass dataset of CalTech-101 and rithms for classification and clustering”, IEEE Transactions on
91.12 % for the scene-15 dataset. The features classifica- knowledge and data engineering, vol. 17, no. 4, pp. 491–502,
tion using kernels K1, K2, and K3 provides a better classi- 2005.
fication rate than large margin multiple kernel (LMMK). [2] S. S. Bucak, R. Jin, and A. K. Jain, “Multiple kernel learning
The sparse-based learning model’s improvement depends for visual object recognition” A review”, IEEE Transactions
on the configuration of the parameters in the sparse dic- on Pattern Analysis and Machine Intelligence, vol. 36, no. 7,
pp. 1354–1369, 2013.
tionary and the extraction of robust feature sets. Also,
[3] M. Varma and D. Ray, “Learning the discriminative power in-
the optimum selection of dictionary size, patch size, and variance trade-off”, in 2007 IEEE 11th International Conference
integration of the kernel plays a vital role in classify- on Computer Vision, pp. 1–8, IEEE, 2007.
ing a large confusing multi-label dataset. The non-linear [4] S. Xu and X. An, “Ml2s-svm: multi-label least-squares support
nature of polynomial and Gaussian kernel helped distin- vector machine classifiers”, The Electronic Library, 2019.
guish these features in SVM and hence proposed model [5] D. Kancherla, J. D. Bodapati, and N. Veeranjaneyulu, “Effect
achieved a better classification rate. of different kernels on the performance of an svm based classifi-
cation”, Int. J. Recent Technol. Eng, no. 5, pp. 1–6, 2019.
[6] S. Bouteldja and A. Kourgli, “A comparative analysis of svm,
k-nn, and decision trees for high resolution satellite image scene
5 Conclusions classification”, in Twelfth International Conference on Machine
Vision (ICMV 2019), vol. 11433, p. 114331I, International Soci-
CNN has obtained large popularity in the classification ety for Optics and Photonics, 2020.
models at the cost of large training time and increased [7] D. Santos, E. Lopez-Lopez, X. M. Pardo, R. Iglesias, S. Barro,
computation cost. In comparison with CNN, SVM is and X. R. Fdez-Vidal, “Robust and fast scene recognition in
robotics through the automatic identification of meaningful im-
found to have greater flexibility in characterization if an ages”, Sensors, vol. 19, no. 18, p. 4024, 2019.
appropriate kernel is used for challenging datasets. The [8] X. Bai, J. Du, Z.-R. Wang, and C.-H. Lee, “A hybrid approach to
single kernel limits its application for datasets having lin- acoustic scene classification based on universal acoustic models”,
ear classification. Therefore, a multi-kernel SVM has ex- in Interspeech, pp. 3619–3623, 2019.
perimented again with the aim of optimization in the se- [9] S. Nazir, Y. Qian, M. Yousaf, S. A. V. Carroza, E. Izquierdo,
lection of the kernels and study of various parameters and E. Vazquez, “Human action recognition using multi-kernel
learning for temporal residual network”, 2019.
affecting the kernel performance in classification. The in-
[10] Y. Wang, W. Yu, and Z. Fang, “Multiple kernel based svm clas-
vestigation of simple MKL over ScSPM features for clas-
sification of hyperspectral images by combining spectral, spa-
sification accuracy is presented initially and the role of tial, and semantic information”, Remote Sensing, vol. 12, no. 1,
various parameters has been explored to minimize the re- p. 120, 2020.
dundant features. Then a sparse-dictionary is created for [11] C. Tong-Tong, L. Chan-Juan, Z. Hai-Lin, Z. Shu-Sen, L. Ying,
minimizing the features size. and D. Xin-Miao, “A multi-instance multi-label scene classifica-
tion method based on multi-kernel fusion”, in 2015 SAI Intelli-
After getting the maximum sparsity of the dictionary,
gent Systems Conference (IntelliSys), pp. 782–787, IEEE, 2015.
the effect of MKL on overall classification accuracy is pre-
[12] H. Hasan, H. Z. Shafri, and M. Habshi, “A comparison between
sented. We noted that even with a minimal combination support vector machine (svm) and convolutional neural network
of a single type kernel like Polynomial as shown in Tab. 2 (cnn) models for hyperspectral image classification”, in IOP
accuracy will be higher than the single kernel SVM al- Conference Series: Earth and Environmental Science, vol. 357,
gorithm. Multiple combinations of Gaussian kernels lead p. 012035, IOP Publishing, 2019.
to an increase in the classification accuracy to 85.72 % [13] A. Sampath and N. Gomathi, “Fuzzy-based multi-kernel spher-
ical support vector machine for effective handwritten character
for 101 class datasets. We observe that training time and recognition”, Sadhana , vol. 42, no. 9, pp. 1513–1525, 2017.
storage requirement also increases with a higher number [14] H. Patel and H. Mewada, “Analysis of machine learning based
of the Gaussian kernel which makes difficult to work on scene classification algorithms and quantitative evaluation”, In-
large dataset like Caltech-256 using minimum hardware ternational Journal of Applied Engineering Research, vol. 13,
requirements. Hence we conclude that with good features no. 10, pp. 7811–7819, 2018.
and Multi kernels, object recognition is still an open area [15] F. Zamani and M. Jamzad, “A feature fusion based localized
to work. Coral reef classification using image augmenta- multiple kernel learning system for real world image classifi-
cation”, EURASIP Journal on image and Video processing,
tion in [32] gives promising results in the limited dataset. vol. 2017, no. 1, pp. 1–11, 2017.
In that work, they used RGB and gray colours as a feature [16] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse cod-
for predicting most corals that have the most similarity. ing algorithms”, in Advances in neural information processing
In this work, there are many classes in a dataset that have systems, pp. 801–808, 2007.
380 B. Gajjar, H. Mewada, A. Patani: SPARSE CODED SPATIAL PYRAMID MATCHING AND MULTI-KERNEL INTEGRATED . . .
[17] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, categories”, in 2006 IEEE Computer Society Conference on
“Simplemkl”, Journal of Machine Learning Research, vol. 9, Computer Vision and Pattern Recognition (CVPR’06), vol. 2,
pp. 2491–2521, 2008. pp. 2169–2178, IEEE, 2006.
[18] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid [31] H.-H. Wang, C.-W. Tu, and C.-K. Chiang, “Sparse representa-
matching using sparse coding for image classification”, in 2009 tion for image classification via paired dictionary learning”, Mul-
IEEE Conference on computer vision and pattern recognition, timedia Tools and Applications, vol. 78, no. 12, pp. 16945–16963,
pp. 1794–1801, IEEE, 2009. 2019.
[19] H. Liao, J. Xiang, W. Sun, and S. Yu, “Adaptive aggregating [32] S. Sharan, S. Kininmonth, U. V. Mehta, et al , “Automated
multi-resolution feature coding for image classification”, Math- cnn based coral reef classification using image augmentation and
ematical Problems in Engineering, vol. 2014, 2014.
deep learning”, International Journal of Engineering Intelligent
[20] O. Boiman, E. Shechtman, and M. Irani, “In defense of nearest Systems, vol. 29, no. 4, pp. 253–261, 2021.
neighbor based image classification”, in 2008 IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1–8, IEEE, Received 3 May 2021
2008.
[21] G. Lin, C. Fan, H. Zhu, Y. Miu, and X. Kang, “Visual feature
coding based on heterogeneous structure fusion for image clas- Bhavinkumar Gajjar is a holder of MTech degree in
sification”, Information Fusion, vol. 36, pp. 275–283, 2017.. Communication System Engineering from Gujarat Technolog-
[22] L. Kabbai, M. Abdellaoui, and A. Douik, “Image classification ical University, India. His field of interest is Image processing,
by combining local and global features”, The Visual Computer, Computer Vision and Optimization Algorithms. Currently he
vol. 35, no. 5, pp. 679–693, 2019.
is working on accuracy enhancement for multiclass classifica-
[23] W. Luo, J. Li, J. Yang, W. Xu, and J. Zhang, “Convolutional
tions techniques as a research scholar in Indus University. He
sparse autoencoders for image classification”, IEEE transac-
tions on neural networks and learning systems, vol. 29, no. 7, is professional software developer and working in Arohi Op-
pp. 3289–3294, 2017. erations Pvt Ltd. He has 7 years of academic and 3.5 years
[24] B. Hosseini and B. Hammer, “Large-margin multiple kernel of industrial experience. He has six international and two na-
learning for discriminative features selection and representation tional publications in reputed journals/conferences.
learning”, in 2019 International Joint Conference on Neural Net- Hiren Mewada has obtained his MTech and PhD de-
works (IJCNN), pp. 1–8, IEEE, 2019. gree from Sardar Vallbhbhai National Institute of Technology-
[25] B. Zafar, R. Ashraf, N. Ali, M. Ahmed, S. Jabbar, and S. A. Surat, Gujarat, India. Presently he is Assistant Research Pro-
Chatzichristofis, “Image classification by addition of spatial in-
fessor at Prince Mohammad Bin Fahd University, Kingdom of
formation based on histograms of orthogonal vectors”, PloS one,
vol. 13, no. 6, p. e0198175, 2018. Saudi Arabia. Previously he was associate professor at Charo-
[26] B. Gajjar and H. M. A. Patani, “Parameterizing sift and sparse tar University of Science and Technology, Gujarat, India. He
dictionary for svm based multi-class object classification”, In- has more than 17 years teaching experience. His current ar-
ternational Journal of Artificial Intelligence, vol. 19, pp. 95–108, eas of interest are computer vision, signal processing, machine
2021. learning and Embedded System design. He has published more
[27] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of ob- than 60 research papers and completed several funded research
ject categories”, IEEE transactions on pattern analysis and ma- projects. He is coauthor of one book and published five book
chine intelligence, vol. 28, no. 4, pp. 594–611, 2006. chapters. He is member of IETE and ISTE.
[28] A. Oliva and A. Torralba, “Modeling the shape of the scene:
Ashwin Patani has obtained his MTech from Gujarat
A holistic representation of the spatial envelope”, International
journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001. university, Gujarat and PhD degree from meghalaya university
[29] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for Meghalaya, India. Presently he is senior Assistant Professor at
learning natural scene categories”, in 2005 IEEE Computer So- Indus University, Ahmedabad Gujarat. He has more than 15
ciety Conference on Computer Vision and Pattern Recognition years teaching experience. His current areas of interest are
(CVPR’05), vol. 2, pp. 524–531, IEEE, 2005. sensors & networks, machine learning and Embedded System
[30] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of fea- design. He has published more than 20 research papers. He is
tures: Spatial pyramid matching for recognizing natural scene author of one book. He is member of IETE and ISTE.