1. Introduction
Diabetes mellitus is a chronic disease whose global prevalence was estimated to be 10.5% (536.6 million people) in 2021, which is expected to rise to 12.2% (783.2 million) in 2045 according to the International Diabetes Federation [
1]. Diabetic foot ulcers (DFUs) constitute a long-term and common complication derived from diabetes [
2,
3] with an estimated global prevalence of roughly 6.3% [
4] and a lifetime incidence of between 19% and 34% for the diabetic population [
5]. Ulcers represent the most frequently recognized and highest risk factor, because a possible infection of the wound often results in the amputation of the foot or lower limb. Worldwide, it is estimated that a limb is amputated every 20 s due to diabetes [
6]. Furthermore, the recurrence rate of DFU is high and varies widely among different regions. The recurrence rates remain about 60% after three years [
5], although these figures have been updated and, as of 2019, the recurrence rate estimation was 22.1% per person-year (py) [
7]. The lowest recurrence rate was roughly 16.9% per py in Africa, while the highest was 24.9% per py in Europe [
7].
These complications can be avoided, reduced, or substantially delayed by early detection, assessment, diagnosis, and tailored treatment [
2,
8]. DFU detection by machine learning (ML) or deep learning (DL) approaches is mainly focused on the already formed ulcer [
9,
10]. A large public dataset, composed of 4000 images with ground truth labeling, was released for the Diabetic Foot Ulcers Grand Challenge (DFUC 2020) aiming to improve the detection accuracy in a real-world scenario and to accelerate the development of innovative approaches [
6]. In addition, extensive literature can be found for DFU localization and detection [
11], as well as wound classification [
12,
13,
14]. Furthermore, remote, noncontact, and automated DFU detection may be plausible using mobile and cloud technologies [
6].
Alternatively, identifying the underlying conditions that sustain skin and tissue damage at an early stage, previous to the onset of superficial wounds, is an emerging area of research [
15,
16,
17,
18]. Early diagnosis is extremely valuable for any pathology, particularly one that can prevent a fatal outcome, as in the case of the present application. Infrared thermography has demonstrably established itself as a complementary tool for the early identification of superficial tissue damage. Real-time visualization of plantar temperature distribution is provided while the surface to be measured remains intact [
3]. Thus, the entire plantar aspect of both feet can be conveniently analyzed in a very short time with great sensitivity and specificity, putting forward thermography as a suitable technique for diabetic neuropathy screening programs [
19]. Nevertheless, the heat pattern of the plantar aspect of the feet and its association with diabetic foot pathologies are subtle and often nonlinear [
20]. Thus, the interpretation of plantar thermograms requires the development of computer-aided eiagnosis (CAD) systems that do not rely on subjective interpretations or inherent limitations of human visual perception. Consequently, interobserver variability and workload may be decreased, whereas CAD systems may outperform clinicians regarding cost, accuracy, and speed, thus leading to an enhanced level of medical care [
3].
Ideally, these CAD systems should classify subjects at risk of developing an ulcer from a single thermogram containing the plantar aspect of both feet and, if possible, quantify the severity of the lesion. Previous attempts proposed quantitative parameters for detecting thermal changes based on the varying temperature distribution exhibited by diabetic subjects in comparison with healthy ones [
3]. Recently, the importance of early detection and gaps regarding performance accuracy were brought into focus, resulting in the development of an unsupervised approach for severity stratification [
18]. Several features based on infrared thermography are proposed in the state-of-the-art methods for identifying foot disorders. Additionally, there is an interest in detecting features that are relevant for the detection of DFU [
18]; different methods for feature selection are being explored.
Feature selection is a field of statistical multivariate and ML methods that reduces the number of input variables. The main objective is to find an optimal subset from the input variables set,
S, that causes an improvement, for instance, in the classifiers by reducing the amount of redundant input data. This provides classifiers with a better cost–performance ratio. At the same time, it improves the interpretability of data, which are commonly high-dimensional [
21].
Feature selection methods can be traditionally categorized into the following classes: filter, wrapper, and embedded methods. Filter methods consist of a preprocessing step that removes irrelevant features based on a per-feature relevance score [
21,
22,
23,
24]. The wrapper methods are those in which, after defining the searching subspace (all possible variable subsets) and applying a model as a black box, a search and evaluation strategy is carried out to obtain the optimal selection of variables or features [
21,
25]. These methods are computationally expensive and especially demanding in DL models [
26]. Finally, embedded methods incorporate variable selection during the training process, employing a regularization for reducing the number of variables used during classification [
21]. The least absolute shrinkage and selection operator (LASSO) regularization technique [
27] is the most popular embedded method, whose objective function is constrained by an
normalization. LASSO is widely used [
28,
29] but its main limitation is the restriction to linear functions.
In this study, following previous approaches to determine relevant features, variational dropout [
30,
31] was used as a feature selection embedded method for reducing the state-of-the-art variables used for DFU detection based on infrared thermograms. In addition, a new approach based on selecting the features in coincidence among the different feature selection methods was designed. The new set of features extracted was employed as input for a support vector machine (SVM) [
32] classifier. The SVM classifier was used as a reference, with the aim of assessing the performance of these features. Finally, for comparison purposes, features previously reported as state-of-the-art were also fed to the classifier.
3. Results
Regarding the evaluation of feature ranking based on variational DL approaches, the implemented architecture is depicted in
Figure 3. As can be observed, the variational feature selector was just used in the first layer after the input. Two dropout layers were added in the following layers to mitigate overfitting problems, in which a
was employed as the dropout rate.
The results presented in this section were extracted using a batch size of 32 samples during the training process, having a minimum batch size of 4 in the last iteration. The ADAM optimizer [
52] was used for training the DL model. The parameters to control exponential decay rates for the moment estimation,
and
, were set to 0.9 and 0.999, respectively. The learning rate (
) was set to
when the variational feature selector was the concrete dropout approach and
when variational dropout was applied. The number of training epochs was set to 500.
In order to avoid many features becoming pruned early during the first iteration of the training, a Lagrange multiplier,
, was employed in the regularization term of Equations (
4) and (
8). So, the model was able to learn a valuable representation of the data in the latent spaces before being heavily penalized. Specifically,
linearly increases from 0 to 1 using a step size of
per epoch. This approach is based on the
annealing trick for variational autoencoders [
34] previously proposed in [
53].
The performance of the models was evaluated by applying
(see
Section 2.1.2) to obtain the sparse representation of the original input space.
Figure 4 shows the sparse rate during the training phase of the respective model in each iteration of the cross-validation. As can be observed, concrete dropout obtained a sparse rate of around 50%, and the variational dropout approach obtained a sparse rate of around 60% in most of the cases. This means that, in general, more than half of the features were considered irrelevant. Additionally, variational dropout started to become sparser in an early epoch, whereas concrete dropout required a higher
. According to the sparse representation, using the test set in each fold, average accuracies were 89.1% and 85.7% for concrete and variational dropout, respectively. In addition, we noticed that using the variational parameter,
, as feature ranking, the most important features were roughly the same in all the experiments. In comparison, the LASSO approach received a sparse rate of 44%, using a lower number of features than the DL approaches, and with an approximate accuracy of 90%. These results were not reliable for comparison purposes because the models were fully optimized, including the hyperparameters, and the test set was not large enough to reject a possible overfitting.
3.1. Feature Selection
Following the workflow described in
Section 2.4, the most relevant features, listed in
Table 3, were extracted for all the approaches considered: LASSO, random forest, and concrete and variational dropout. For the LASSO approach, the feature ranking was estimated by the absolute value of its coefficient. In relation to concrete and variational dropout, the variational parameter was used as feature ranking (see
Section 2.1.2).
The 10 first features extracted for each approach were considered the most relevant and are highlighted in bold in
Table 3. Therefore, approximately
of the total features extracted were considered relevant. Regarding the distribution of these features by angiosome, MPA and LPA presented the largest number of features with a total of nine and six associated features, respectively. LCA and MCA angiosomes had, respectively, three and four associated features each. For the entire foot, only two associated features were found.
Furthermore, the ten first features found to appear in all the implemented approaches are listed by rank in
Table 4. The ranks of these features changed according to the approach employed. Thus, the lowest rank of each feature, among the different approaches, was assigned as its final rank. The search for coincidence was restricted to the first 30 ranked features provided for each approach. However, as observed in
Table 4, the assigned ranks are listed in intervals ranging from ten units. Features found up to a rank lower than 50 were considered. As noticed, if only the 10 first features in coincidence were considered, the angiosomes with more associated features were LPA and the entire foot, with three associated features both, whereas MCA and LCA had two associated features each. No associated features were found for the MPA angiosome in this case.
Considering the features in coincidence among the different approaches,
Table A1, in
Appendix A, depicts an extended version of the most promising features distributed per angiosome. As observed, the largest number of features in coincidence, a total of four, was associated with the LPA angiosome.
An SVM [
32] classifier was used with all the features as input to provide a reference aiming to quantify the performance of the extracted features, their rank, and selected combination. SVM aims to generate a hyperplane in a high-dimensional space, generated by a kernel, that separates the data into classes. Initially, using the available features as input, the SVM classifier was optimized using a randomized search [
54] to obtain the best parameters. As a result, a Gaussian kernel, also known as the radial basis function (RBF) kernel, was used. The RBF kernel has a hyperparameter,
, that controls the spread of the Gaussian center. In addition, the hyperparameter
C in SVM is used for directing the
penalty, which controls the trade-off between decision boundary and misclassification. The best performance, displayed in
Table 5, was achieved with a
value of 0.0035 and a
C value of 7.743.
3.2. Evaluation of Features by SVM Classifier
Several experimental settings were considered to evaluate the extracted features for the chosen classification task, which was to distinguish between healthy and diabetic patients. In this case, the SVM classifier was not optimized; that is, standard hyperparameters were chosen to offer a fair comparison between the proposed approaches to rank the features. For the different experiments described in this section,
was set to 0.1, motivated by the low dimensional space of the input data. In addition, the hyperparameter
C was set to 1. This configuration was the same for the different selected features, trying to avoid bias in the conclusions due to well-fitted settings for the indicated features. The average value resulting from five-fold cross-validation, testing the models five times, was used for the metrics estimation depicted in
Table 6, as previously reported [
18].
First, the SVM was fed with the ten first features extracted for each approach, LASSO, random forest, and concrete and variational dropout (features highlighted in bold in
Table 3). Second, the ten first features in coincidence, this is, those that appeared in all the approaches and are listed in
Table 4, were also employed to feed the classifier. Finally, to compare the features extracted and the subsequent classification task with those from a previous study [
18], the following ten ranked features were also considered: TCI, NTR_C
, NTR_C
, MPA_mean, LPA_mean, LPA_ET, LCA_mean, highest temperature, NTR_C
, and NTR_C
. These features were among the top ten features resulting from testing several techniques, which included Pearson, chi square, recursive feature elimination (RFE), logistics, random forest, and LightGBM. The metrics of the performance for each approach, according to the experimental settings described, are listed in
Table 6.
Notice that, contrary to the setup employed in the present study in which all features were extracted by foot, L or R, the foot to which the previously mentioned features were associated was not specified in [
18]. Therefore, the mean value between both feet was calculated in order to match these features and offer a fair comparison. In addition, the NTR class definition considerably differed from the one considered previously; thus, the equivalent class, based on temperature values, was used instead. NTR_C
and NTR_C
in the original study corresponded to ranges between 31 and 32 °C as well as 30 and 31 °C, respectively [
18,
48]. In the present study, the closest approximations were NTR_C
and NTR_C
, for which the respective ranges coincided with the ranges mentioned above.
Considering the features extracted for each approach and the subsequent classification task, all approaches provided good metric values. However, the best scores, except for the recall parameter, were observed for the concrete dropout approach. When the set of relevant features were those common to all the approaches, although at different rank positions, the performance in this experimental setting provided the best recall. Furthermore, as noticed, the recall values were lower in comparison with the other parameters of the performance metrics. This may have been due to the imbalance between healthy and diabetic samples from the original dataset, because a low recall score is associated with a high number of false negatives. A relevant number of healthy samples was generated for balancing using SMOTE, which performed a linear interpolation between samples. Therefore, recall was penalized because it was exclusively dependent on the diabetic samples. In this case, considering the precision–recall tradeoff, a lower recall was preferred due to the associated implications.
As shown in
Table 6, the performance of all the models when using the corresponding first 10 features as well as when using the first 10 features in coincidence, was quite similar to those considered as reference values (shown in
Table 5). However, the classical LASSO approach and DL-based concrete dropout exhibited a slightly better performance with only 10 features.
4. Discussion
Several approaches were considered to select relevant features used for DFU detection based on infrared thermograms. Classical approaches, LASSO and random forest, were tested versus two innovative approaches based on DL, concrete and variational dropout. The outputs of these approaches were analyzed to extract a new set of features considered relevant to classify whether a thermogram corresponded to a healthy or diabetic person. The results provided by the proposed approaches exhibited promising results, particularly for the concrete dropout approach.
Regarding the performance of the traditional approaches in comparison with that of the DL-based ones, LASSO provided results close to those of the concrete dropout, according to the F1 score, although the latter exhibited a slightly better performance compared with the established reference values. However, LASSO is limited to linear solutions, while concrete dropout does not suffer from this limitation. No fine-tuning of the models was implemented to increase the respective performance, because a comparison between extracted input features was intended. Thus, when a few features were used, i.e., 10, both methods produced promising performance. In this particular case, the LASSO approach would be an easy-to-implement and faster alternative to concrete dropout, as comparable performance was achieved. Furthermore, considering the most relevant features of each approach, six of the selected features matched for these two approaches, see
Table 3. Thus, the similarity in performance may have been due to this coincidence of features.
For further comparison, the optimization of these two approaches, considering 10 input features, was performed. The hyperparameters considered for fine-tuning were the kernel (RBF, linear, or polynomial), the degree of the polynomial in case the corresponding kernel was selected,
, and
C (data not shown). For the LASSO approach, the best model used a third-degree polynomial kernel with a
value of 0.1 and a
C value of 2.2. The best concrete dropout settings were achieved for the RBF kernel, with a
value of 0.3, and a
C value of 3.8. The F1 scores were approximately 0.89 and 0.90 for LASSO and concrete dropout, respectively. Thus, the performance of the LASSO approach closely matched that of the reference (
). Additionally, an increase in the number of input features from 10 to 50, in combination with an optimized SVM, exhibited a slightly increased performance for the LASSO approach, with an F1 score of roughly 0.91. However, the performance of the concrete dropout decreased with an approximate F1 score of 0.87. In this case, both approaches used an RBF kernel, with
being 0.06 and 0.04 and
C being 1.2 and 1.5 for LASSO and concrete dropout, respectively. This improvement observed with LASSO when the number of features was increased might have been produced by the oversampling based on SMOTE, which generated 77 new samples by a linear interpolation between samples from the minority class. This process might have added correlation to the dataset, making it more likely that LASSO could find features with a high degree of correlation. Further analysis is planned to confirm this hypothesis. Regardless, concrete dropout is less sensitive to these problems due to its nature. In any case, these results were achieved by cross-validation, testing with around 48 samples per fold. In our previous study [
55], using the INAOE dataset, we showed that the traditional classification metrics were not reliable due to the small amount of data in the test set, which might be a nonrepresentative subset to evaluate the model. On the other hand, the decrease observed in the performance of concrete dropout when the number of features was increased seemed plausible due to the implicit noise added by the extra features.
A previous study [
18] was considered as a reference to quantify the performance of the extracted features for the classification task. This reference study employed a stacking classifier using gradient boost, XGBoost and random forest considering previously ranked features as input: TCI, NTR_C
, NTR_C
, MPA_mean, LPA_mean, LPA_ET, LCA_mean, and highest temperature. The best classification performance achieved was reported as approximately 94% accuracy, precision, sensitivity, and F1 score. Using these proposed features, the values reported in the present study, around 77% in the F1 score, are considerably lower than those reported previously (see
Table 6). However, although the definition of the features was slightly modified and the classifier employed considerably differed; the same input features exhibited a roughly 15% lower performance in comparison with the features extracted in this study. This difference may be explained by the use of an extended dataset as well as their proposed new labeling in the INAOE dataset for distinguishing between mild, moderate, and severe cases in the diabetic foot domain, which was not tested in this study. Another state-of-the-art work recently reported, employing the INAOE dataset, extracting features by clusters instead of by angiosomes [
56]. In addition, a new feature, the cluster thermal index (CTI), was proposed, which provides a measure of temperature deviation between a subject and the control group, considering not only the temperature difference between the clusters but also the range of temperature in the control group. In this case, several models were provided to classify healthy and diabetic subjects. Multiclassification was employed to refine the stratification of diabetic patients using logistic regression, SVM, and K-nearest neighbors. The results reported for the binary classification with SVM are comparable to those reported here, the accuracy being approximately 86%. Furthermore, using only 50 thermograms from the INAOE database and extracting texture features, the reported accuracy of the SVM classifier was roughly 96% [
57]. In addition, employing a private dataset composed of 24 healthy and 36 diabetic subjects, a binary classification using SVM achieved 95% accuracy [
17]. These values are quite superior to those reported in this paper. However, a true comparison cannot be drawn because the set of features employed considerably differed.
Before feature extraction, our initial study focused on establishing a balanced dataset of diabetic and healthy subjects by fusing a publicly unbalanced available dataset [
41] with a local dataset composed of healthy subjects. Furthermore, the preprocessing of the thermograms was also carefully considered to extract the features previously reported [
18,
41]. Among set of features, considered within the state-of-the-art features, were the highest temperature, TCI, HSE, ET, NRT, and several statistical variables such as the mean value, being associated with the entire foot as well as some defined angiosomes (see
Section 2.3). Notably, in the present stduy, these features were extracted by foot, and considered separately, unlike previous studies in which an average between the R and L foot was assumed due to the lack of specific information on the procedure. For this reason, the number of features extracted is considerably much higher than in previous reports [
18], 188 versus 37 features.
Of the most important state-of-the-art features, TCI is especially relevant. The TCI is focused on providing a quantification of the thermal change, independent of the observed distribution, and a difference of 1 °C is considered enough to notice a significant difference between the classes proposed [
47]. In this study, reference values were not modified, in comparison with the original study, to calculate the TCI values despite more healthy subjects being considered in the extended dataset. Regardless, none of the features related to TCI were considered relevant among all the implemented approaches or within those features in coincidence.
Regarding NTR, the number of classes based on the thermogram temperatures was extended to 10 because the range of relevant temperatures considered was increased from 18 to 37 °C compared wtih the original study, which was from 25 to 35 °C [
48]. These modifications were motivated by, first, the exclusion of temperature values characteristic of healthy patients, which were excluded in the original study to avoid the background, because foot segmentation was not available. The private thermograms from healthy volunteers, fused with the INAOE database for balancing purposes, showed that the temperature distributions were below 28 °C for many subjects. Second, these private thermograms were previously segmented; thus, excluding other heat sources within the background was not required. As a result, we did not discard any NTR class as was proposed in [
48] for removing the background. A better-performing classifier was expected by extending the temperature ranges. However, most of the features related to the NTR were not considered relevant and, similar to that observed for the TCI, none were obtained within the features in coincidence among all the approaches. Furthermore, according to the F1 score, the best-performing approach was concrete dropout, and not a single feature related to the NTR was among those considered relevant.
Opposite to that described above, in the present study, among those specially designed features for DFU, only HSE and ETD seem to be relevant. Furthermore, as described, the sole of the foot was divided into four different angiosomes, and their individual features were extracted. The extraction demonstrated an unbalanced significance of the angiosomes and, therefore, the division of the foot into angiosomes seemed a determinant factor for feature extraction and played an important role in the analysis. In particular, the LPA angiosome appeared as the most predictive, with more associated features than the other angiosomes, followed by LCA (see
Table A1).
Perhaps the extended dataset employed in this study, based on two different population samples, added a varying contribution of pathogenic factors that led to variable outcomes [
58]. The present study can be considered sort of a multicenter study, providing a generalization factor for the classification task at hand, and therefore, the set of relevant features may be significantly changed from previous studies. Further studies are required with an increased dataset, composed of a balanced number of diabetic and healthy subjects, and preferably from different population samples, in order to continue generalizing the existing approaches. Furthermore, we will continue toward the study of the importance of the different angiosomes as well as the exploration of new, interesting features that appear within the state-of-the-art methods. Most importantly, the assessment of their predictive value for classification will also be an area to explore in detail.