Adaptive Feature Selection and Image Classification Using Manifold Learning Techniques
Adaptive Feature Selection and Image Classification Using Manifold Learning Techniques
Corresponding authors: Amna Ashraf ([email protected]) and Nazri Mohd Nawi ([email protected])
ABSTRACT Manifold learning techniques aim to the non-linear dimension reduction of data. Dimension
reduction is the field of interest and demand of many data analysts and is widely used in computer vision,
image processing, pattern recognition, neural networks, and machine learning. The research has been divided
into two phases to recognize manifold learning techniques’ importance. In the first phase, the manifold
learning approach is used to improve the ‘feature selection by clustering’. Clustering algorithms such as
K-means, spectral clustering, and the Gaussian Mixer Model have been tested with manifold learning
approaches for adaptive feature selection. The results obtained are satisfactory compared to simple clustering.
In the second phase, a Triple Layered Convolutional Architecture (TLCA) has been proposed for image
classification bearing 85.34%, 59.14%, 71.43%, 90.06%, and 71.71% accuracy levels for the datasets such
as Pistachio, Animal, HAR, Mango Leaves, and Cards respectively. The performance of the proposed TLCA
model is compared to the other deep learning models i.e., CNN, LSTM, and GRU. To further improve the
accuracy, reduced dimensional data from manifold learning technique is used and achieved higher accuracies
from Hybrid Triple Layered Convolutional Architecture HTLCA as 97.73%, 87.18%, 97.97%, 99.19%, and
96.91% for the mentioned sequence of datasets. The effectiveness and precision of the suggested methods
are demonstrated by the experimental findings.
INDEX TERMS Clustering, feature extraction, feature selection triple layered convolutional architecture.
FIGURE 2. Adaptive feature selection includes techniques such as feature extractor and clustering algorithms for feature selection.
manifold learning techniques are hybridized in attention to combinations of clustering techniques and manifold learning
attain better performances. In the following sections, the techniques. Experimentally tested clustering methodologies
manifold learning techniques and the clustering techniques are discussed below.
are explained respectively. The second phase of the research
is about image classification, where a new model TLCA is 1) K-MEAN
proposed and the three state-of-the-art algorithms are tested It works in an iterative process [19], of assigning all
and evaluated for image classification. the data points to the groups with the initial supposi-
tion of a specific centroid to each cluster. This assign-
A. ADAPTIVE FEATURE SELECTION ment of data points is done by calculating the Euclidean
The objective of feature selection for clustering is to select distance (8) between the data points and the supposed
a set of most relevant features that facilitate the discovery of centroids.
r
natural clusters in the data, according to the selected crite- Xn
rion [17]. These selected features may lead to the best version d (x, y) = (xi − yi )2 (8)
i=1
of relevant features if a suitable feature extraction technique
is applied to consider the spatial features of image data X. The centroid chosen for a fixed number of clusters in
Fig. 2 represents the complete flow of how data spectrum is the first step keeps on changing to minimize the sum
used to capture spectral features and how feature extraction of distances between the data points and the assigned
is performed. As normal preprocessing steps, data normal- centroids.
ization and data scaling of a spectral signature are used to 1 X
Ci = Xi (9)
provide the spectral features of an image. Therefore, we add |Ni |
feature extraction using isomap, LLE, UMAP, or PHATE.
These techniques are known as manifold learning techniques 2) SPECTRAL CLUSTERING
explained in Section II. Numerous fields, such as data analysis, video indexing,
Clustering algorithms discussed in the literature are sen- character identification, image processing, speech separa-
sitive to largeness or dimensionality or both. There is an tion, etc., have effectively implemented spectral clustering.
entropy-based solution is proposed for the ranking of fea- In these applications and many more, the number of data
tures [18]. The key issue regarding this resolution is the elements to cluster can be extraordinarily large [20]. Basic
repeated calculations required for the information-entropy- concepts of spectral clustering involve algebraic graph the-
based significance of an attribute set, which slows down ory and graph cut methods. The advanced development of
feature selection for large datasets. Consequently, fea- spectral clustering comprises the aspects of similarity matrix,
ture extraction followed by feature selection assisted in Laplacian matrix, selecting eigenvectors, and the number
this regard. Adaptive feature selection involves different of clusters chosen. The main focus of spectral clustering is
choosing a distance measurement that adequately describes in terms of model size, and its performance is better also.
the intrinsic structure of the data elements. Data within the The convolutional layer has the properties of a local recep-
same category should have a high degree of similarity and tive field, which retains the input shape. Another point to
adhere to space consistency. The measurement of similar- be noticed is that the convolutional layer frequently calcu-
ity is vital to the efficacy of spectral clustering [21]. As a lates the same convolution kernel and various input positions
rule, the Gaussian kernel function is chosen as the similarity through a sliding window, thereby effectively preventing the
measure. training parameter size from becoming excessively large. The
Following the construction of a similarity matrix, the pooling layer reduces the computational load by minimiz-
corresponding Laplacian matrix is created using various ing the number of connections between the convolutional
graph cut methods. The efficacy of spectral clustering algo- layers [29] and alleviates the convolutional layer’s exces-
rithms is significantly influenced by the selection of graph sive position sensitivity. CNN ensures the invariance of
cut methods and the construction of Laplacian matrices. input image pixels with respect to displacement, scaling, and
Through eigen-decomposition, the eigenvalues and eigenvec- distortion.
tors of a Laplacian matrix can be determined. An analysis
of the properties of eigenspace demonstrates that: (a) not 2) LSTM
every Laplacian matrix’s eigenvector is relevant for clus- Long Short-Term Memory (LSTM) is a sophisticated
tering; (b) eigenvector selection is crucial because using form of Recurrent Neural Networks (RNN) that cap-
uninformative eigenvectors could result in poor cluster- tures long-term dependencies. LSTM was introduced in
ing results; and (c) the corresponding eigenvalues can- 1997 [30] and improved in 2013 [31], garnering a
not be used to select relevant eigenvectors for a realistic great deal of popularity in the deep learning community.
dataset. LSTM models have proven more effective than standard
RNNs at retaining and utilizing information over extended
3) GAUSSIAN MIXER MODEL sequences [32].
GMM [22] works the same as k-means does but k-means In an LSTM network, the current input at a particu-
only performs better for the data distributed over circular lar time step and the output from the previous time step
shapes. The reason behind this is it clusters the points only in a are supplied into the LSTM unit, which in turn generates
circular shape with a radius defined by the most distant point. an output that is passed on to the subsequent time step.
In the case of GMM, the clusters can be oblong depending Commonly, the final hidden layer of the last time phase,
upon the data distribution. Besides assigning a cluster to each and sometimes all hidden layers, are used for classification
point, GMM considers the probability that a certain point purposes [33].
belongs to which cluster. Three gates comprise LSTM: input gate, forget gate,
and output gate. Each gate serves a distinct purpose in
B. IMAGE CLASSIFICATION
regulating the passage of information. Based on the cur-
rent input and the preceding internal state, the input gate
A popular technique for classifying hyperspectral images is
determines how to update the internal state. The for-
supervised classification. The fundamental procedure is to
get gate determines how much of the preceding state of
calculate the discriminant function and then establish the
the internal environment should be forgotten. Lastly, the
discriminant criterion based on the given sample category and
output gate regulates the effect of the system’s internal
prior knowledge; Support vector machine, artificial neural
state [34].
network (ANN) [23], convolutional neural network (CNN)
[24], long short-term memory (LSTM), decision tree, gated
3) GRU
recurrent unit networks (GRU) [25] and maximum like-
lihood classification methods are supervised classification A gated recurrent unit (GRU) is an improvement on the
techniques that are frequently employed. Some of these are conventional RNN (recurrent neural network). In 2014,
described below. Kyunghyun Cho [35] introduced it for statistical machine
translation. More or less they are similar to LSTM. GRU
also employs gates to control the information flow, just
1) CNN
like LSTM. They are comparatively more recent than
CNN’s structure includes the convolutional, pooling, non- LSTM and are superior to LSTM in terms of simplicity of
linear activation, and fully connected layers. In general, the architecture.
image is preprocessed [26] before being provided to the net- Unlike LSTM, it lacks a distinct cell state (Ct) and pos-
work via the input layer, passed through a series of alternately sesses only a hidden state (Ht). Due to their simplified
arranged convolutional and pooling layers, and then a fully architecture, GRUs can be trained more quickly. Only two
connected layer is used for classification. gates comprise GRU: Reset gate and Update gate. Equations
CNN [27], [28] adds a very distinctive convolutional and for their functionalities are as follows.
pooling layer compared to Multilayer Perceptron (MLP). For
large data sets, CNN exhibits exceptional cost performance rt = σ (xt × Ur + Ht−1 × Wr ) (10)
C. EXPERIMENTAL SETUP
To evaluate the proposed model of adaptive feature selection
ut = σ (xt × Uu + Ht−1 × Wu ) (11) and image classification model TLCA, we used the following
experimental setup and five datasets, whose prescription is
The reset gate uses equation 1, where Ur and Wr are the mentioned as follows.
weight matrices for the reset gate. Similarly, the update gate The experiment setup involves disk storage, system RAM,
uses equation 2, where Uu and Wu are the weight matrices and GPU RAM as hardware requirements and Python3 as
for update gate. software prerequisites. Depending on dataset size and model
TABLE 2. Dataset description. and 10% for validation. Their distribution can be seen in the
table.
The distribution of data over different classes in each
dataset is demonstrated in the histograms shown in Fig. 4.
Datasets Pistachio and Animal are binary class datasets while
others are multiclass. The dataset ‘Human Action Recogni-
tion’ (HAR) is a balanced dataset which means each class
has an equal number of images and the dataset ‘Mango
Leaves’ is the most imbalanced while others are near to
balanced.
TABLE 4. Image classification results. figure. It can be observed from the accuracy plots of three
datasets that accuracy is inclined over 20 epochs and the
model performed well against state-of-the-art algorithms
while convergence in loss plots is not that eminent, hence
providing room for over-fitting to be handled. In HTLCA,
we first extract features then reduce dimensions using
PHATE, and then train TLCA offline using the labeled image
dataset.
V. DISCUSSION
Adaptive feature learning via clustering is introduced in
the first phase of this research. K-means, GMM, and spec-
tral clustering It is evident that feature selection can be
significantly improved using manifold learning techniques.
In the second phase of experimentation, the proposed image
classification model HTLCA is tested, and the accuracy is
compared with state-of-the-art classification models: CNN,
LSTM and GRU. The association with manifold learning fur-
ther improves classification performance. The convergence
graph for the training/validation accuracy and loss of the
proposed model shows how it behaves (Fig. 5); good fit in
some cases while overfitting for other datasets. The results
show that there is no overfitting in the case of small datasets.
Among the model accuracy mentioned in Table 4, TLCA
curves are demonstrated in the upper half of Fig 4 for achieved the best classification performance for the Pistachio
each of the 5 datasets. Similarly, training loss and vali- and Cards image datasets, with an accuracy of 85.34% and
dation loss curves are shown in the lower half of each 71.71%. Using PHATE as a preprocessing step (HTLCA)
increases the classification accuracy up to 97.73%, 60.18%, about 100 times smaller than others, so adequate training
97.97%, and 95.65% for Pistachio, Animal, HAR, and Cards data is required for better model performance. For datasets
datasets respectively. Here are some useful insights observed Pistachio, Mango Leaves, and Animal, the training accu-
in experimental results. racy and validation accuracy graph lines are much closer to
For Larger datasets Pistachio, HAR, Mango Leaves, and each other, which depicts a very small overfitting of data.
Cards, we observe a smooth curve of TLCA accuracy(Fig. 5), It means the model is performing well on unseen data. Con-
while for dataset ‘Animal’, jerks are found. This dataset is trary to this, the data is overfitted for the datasets HAR and
Cards. Accuracy may be compromised for smaller datasets [9] X. Chen, R. Chen, Q. Wu, F. Nie, M. Yang, and R. Mao,
and the datasets where data distribution over classes is not ‘‘Semisupervised feature selection via structured manifold learning,’’
IEEE Trans. Cybern., vol. 52, no. 7, pp. 5756–5766, Jul. 2022, doi:
balanced. For instance, the Animal dataset is small (Table 2), 10.1109/TCYB.2021.3052847.
and for Mango Leaves, data distribution over 16 classes is [10] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu,
unbalanced (see Fig. 4). ‘‘Feature selection: A data perspective,’’ ACM Comput. Surv., vol. 50, no. 6,
pp. 1–45, Nov. 2018, doi: 10.1145/3136625.
[11] S. Lv, S. Shi, H. Wang, and F. Li, ‘‘Semi-supervised multi-label
VI. CONCLUSION AND FUTURE WORK feature selection with adaptive structure learning and manifold learn-
Manifold learning is a technique of machine learning and ing,’’ Knowl.-Based Syst., vol. 214, Feb. 2021, Art. no. 106757, doi:
10.1016/j.knosys.2021.106757.
data analysis that extracts significant features from high- [12] J. Alzubi, A. Nayyar, and A. Kumar, ‘‘Machine learning from theory
dimensional data. Different clustering algorithms have dif- to algorithms: An overview,’’ J. Phys., Conf., vol. 1142, Nov. 2018,
ferent performances on various datasets for feature selection. Art. no. 012012, doi: 10.1088/1742-6596/1142/1/012012.
[13] H. Bhaskar, D. C. Hoyle, and S. Singh, ‘‘Machine learning in bioin-
Their accuracies can be enhanced using manifold learn- formatics: A brief survey and recommendations for practitioners,’’
ing techniques i.e. PHATE, UMAP, isomap, and LLE. The Comput. Biol. Med., vol. 36, no. 10, pp. 1104–1125, Oct. 2006, doi:
extracted features can also assist in image classification. 10.1016/j.compbiomed.2005.09.002.
[14] D. Liao, Y. Qian, and Y. Y. Tang, ‘‘Constrained manifold learning for
Therefore, feature extraction by manifold learning followed
hyperspectral imagery visualization,’’ IEEE J. Sel. Topics Appl. Earth
by adaptive feature selection or image classification performs Observ. Remote Sens., vol. 11, no. 4, pp. 1213–1226, Apr. 2018, doi:
well and can be depicted by experimental results. Animal, 10.1109/JSTARS.2017.2775644.
HAR and Cards datasets perform better with PHATE fol- [15] N. Hoque, D. K. Bhattacharyya, and J. K. Kalita, ‘‘MIFS-ND:
A mutual information-based feature selection method,’’ Exp. Syst. Appl.,
lowed by Kmeans while for the Pistachio dataset, its Isomap vol. 41, no. 14, pp. 6371–6385, Oct. 2014, doi: 10.1016/j.eswa.2014.
followed by Kmeans that performs well. In the second phase 04.019.
of experimentation, the proposed image classification model [16] K. R. Moon, D. van Dijk, Z. Wang, S. Gigante, D. B. Burkhardt,
W. S. Chen, K. Yim, A. V. D. Elzen, M. J. Hirn, R. R. Coifman,
TLCA is evaluated compared with modern classification N. B. Ivanova, G. Wolf, and S. Krishnaswamy, ‘‘Visualizing structure
models: CNN, LSTM, and GRU and governed the accura- and transitions in high-dimensional biological data,’’ Nature Biotechnol.,
cies of 97.73%, 60.18%, 97.97%, and 95.65% for Pistachio, vol. 37, no. 12, pp. 1482–1492, Dec. 2019, doi: 10.1038/s41587-019-
0336-3.
Animal, HAR and Cards dataset respectively. [17] J. R. Adhikary and M. N. Murty, ‘‘Feature selection for unsupervised
In the future, this research can be extended with dimen- learning,’’ in Neural Information Processing (Lecture Notes in Computer
sion reduction by auto-encoders. As we see, in the results Science), vol. 7665. Berlin, Germany: Springer, 2012, pp. 382–389, doi:
10.1007/978-3-642-34487-9_47.
how drastically performance accelerated by using manifold [18] M. Dash and H. Liu, ‘‘Feature selection for clustering,’’ in Knowledge
learning techniques, extra feature reduction can cause lesser Discovery and Data Mining, Current Issues and New Applications (Lecture
training times, enhancing or at least retaining the accuracy Notes in Computer Science), vol. 1805. Berlin, Germany: Springer, 2000,
pp. 110–121, doi: 10.1007/3-540-45571-x_13.
level of feature selection and image classification. Moreover, [19] J. Yadav and M. Sharma, ‘‘A review of K-mean algorithm,’’ Int. J. Eng.
work can be done to resolve data overfitting issues. Trends Technol., vol. 4, no. 7, pp. 2972–2976, 2013.
[20] H. Jia, S. Ding, X. Xu, and R. Nie, ‘‘The latest research progress on spectral
clustering,’’ Neural Comput. Appl., vol. 24, nos. 7–8, pp. 1477–1486,
DATA AVAILABLE STATEMENT
Jun. 2014, doi: 10.1007/s00521-013-1439-2.
Code will be available on demand. [21] L. Wang, L. F. Bo, and L. C. Jiao, ‘‘Density-sensitive spectral clustering,’’
Acta Electron. Sin., vol. 35, no. 8, pp. 1577–1581, 2007.
[22] G. J. McLachlan and S. Rathnayake, ‘‘On the number of components in a
REFERENCES
Gaussian mixture model,’’ WIREs Data Mining Knowl. Discovery, vol. 4,
[1] A. J. Izenman, ‘‘Introduction to manifold learning,’’ Wiley Interdiscip. Rev. no. 5, pp. 341–355, Sep. 2014, doi: 10.1002/widm.1135.
Comput. Stat., vol. 4, no. 5, pp. 439–446, 2012, doi: 10.1002/wics.1222. [23] F. Paquin, J. Rivnay, A. Salleo, N. Stingelin, and C. Silva, ‘‘Multi-phase
[2] E. Oja, ‘‘The nonlinear PCA learning rule in independent compo- semicrystalline microstructures drive exciton dissociation in neat plastic
nent analysis,’’ Neurocomputing, vol. 17, pp. 25–45, Sep. 1997, doi: semiconductors,’’ J. Mater. Chem. C, vol. 3, pp. 10715–10722, Jan. 2015,
10.1016/S0925-2312(97)00045-3. doi: 10.1039/C5TC02043C.
[3] G. H. L. van der Maaten, ‘‘Visualizing data using t-SNE,’’ Ann. Oper. Res., [24] D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, K. Modi,
vol. 219, no. 1, pp. 187–202, 2014, doi: 10.1007/s10479-011-0841-3. and H. Ghayvat, ‘‘CNN variants for computer vision: History, architecture,
[4] Y. Zhang, Z. Zhang, J. Qin, L. Zhang, B. Li, and F. Li, ‘‘Semi- application, challenges and future scope,’’ Electronics, vol. 10, no. 20,
supervised local multi-manifold isomap by linear embedding for feature p. 2470, Oct. 2021, doi: 10.3390/electronics10202470.
extraction,’’ Pattern Recognit., vol. 76, pp. 662–678, Apr. 2018, doi: [25] R. Dey and F. M. Salem, ‘‘Gate-variants of gated recurrent unit (GRU)
10.1016/j.patcog.2017.09.043. neural networks,’’ in Proc. IEEE 60th Int. Midwest Symp. Circuits Syst.
[5] D. Lunga, S. Prasad, M. M. Crawford, and O. Ersoy, ‘‘Manifold-learning- (MWSCAS), Aug. 2017, pp. 1597–1600, doi: 10.1109/MWSCAS.2017.80
based feature extraction for classification of hyperspectral data: A review 53243.
of advances in manifold learning,’’ IEEE Signal Process. Mag., vol. 31, [26] D. Ciregan, U. Meier, and J. Schmidhuber, ‘‘Multi-column deep neural net-
no. 1, pp. 55–66, Jan. 2014, doi: 10.1109/MSP.2013.2279894. works for image classification,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
[6] J. Zhang, S. Z. Li, and J. Wang, ‘‘Manifold learning and applications in Recognit., Jun. 2012, pp. 3642–3649, doi: 10.1109/CVPR.2012.6248110.
recognition,’’ in Intelligent Multimedia Processing With Soft Computing. [27] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, ‘‘Learning and trans-
Berlin, Germany: Springer, 2006, pp. 281–300, doi: 10.1007/3-540-32367- ferring mid-level image representations using convolutional neural net-
8_13. works,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
[7] Y. Fan and Z. Zhao, ‘‘Cryo-electron microscopy image analysis using pp. 1717–1724, doi: 10.1109/CVPR.2014.222.
multi-frequency vector diffusion maps,’’ 2019, arXiv:1904.07772. [28] I. H. Md Yusof, M. An, and M. H. Barghi, ‘‘Integration of lean
[8] B. Liu, S.-X. Xia, F.-R. Meng, and Y. Zhou, ‘‘Manifold regularized extreme construction considerations into design process of construction
learning machine,’’ Neural Comput. Appl., vol. 27, no. 2, pp. 255–269, projects,’’ in Proc. 31st Annu. Assoc. Res. Constr. Manag. Conf., 2015,
Feb. 2016, doi: 10.1007/s00521-014-1777-8. pp. 885–894.
[29] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, NAZRI MOHD NAWI received the bachelor’s
G. Wang, J. Cai, and T. Chen, ‘‘Recent advances in convolutional neu- degree from Universiti Sains Malaysia (USM), the
ral networks,’’ Pattern Recognit., vol. 77, pp. 354–377, May 2018, doi: master’s degree in computer science from Uni-
10.1016/j.patcog.2017.10.013. versity Teknologi Malaysia (UTM), and the Ph.D.
[30] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ degree in data mining from Swansea University,
Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: Wales, U.K.
10.1162/neco.1997.9.8.1735. He is currently a Professor with the Department
[31] A. Graves, ‘‘Generating sequences with recurrent neural networks,’’ 2013,
of Software Engineering, Faculty of Computer
arXiv:1308.0850.
Science and Information Technology, Universiti
[32] F. M. Shiri, T. Perumal, N. Mustapha, R. Mohamed, M. A. B. Ahmadon,
and S. Yamaguchi, ‘‘A survey on multi-resident activity recognition in
Tun Hussein Onn Malaysia (UTHM), where he
smart environments,’’ 2023, arXiv:2304.12304. has been a Faculty Member, since 2001. In recent years, he has focused
[33] S. Minaee, E. Azimi, and A. Abdolrashidi, ‘‘Deep-sentiment: Senti- on better techniques for classification, analyzing, and hybridizing some new
ment analysis using ensemble of CNN and bi-LSTM models,’’ 2019, improvements on ANN using meta-heuristic techniques. He has successfully
arXiv:1904.04206. supervised a few Ph.D. students and currently, he is supervising eight Ph.D.
[34] W. Fang, Y. Chen, and Q. Xue, ‘‘Survey on research of RNN-based spatio- students and published more than 100 papers in journals and conference
temporal sequence prediction algorithms,’’ J. Big Data, vol. 3, no. 3, proceedings. His research interests include soft computing and data mining
pp. 97–110, 2021, doi: 10.32604/jbd.2021.016993. techniques, particularly in artificial neural networks, ranging from theory to
[35] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, design and implementation. He has been involved with many conferences
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using RNN and workshop program committees and serves as a reviewer for many
encoder–decoder for statistical machine translation,’’ in Proc. Conf. Empir- outstanding journals and international conferences.
ical Methods Natural Lang. Process. (EMNLP), 2014, pp. 1724–1734, doi:
10.3115/v1/d14-1179.