As stated above, the first goal of the experiments undertaken was to evaluate and compare the behavior of the five individual classifiers. The second goal was to explore the performance of the six meta-classifiers, including a comparison of their performance with the individual classifiers.
5.1. Experiments with Individual Classifiers
The first set of results represents the T1 accuracy and is summarized in
Table 4. Here, as defined in
Section 2,
represents the set of training class labels,
denotes the set of testing class labels, while
represents the harmonic mean of the two (as described in
Section 4.3).
The first group of rows copies results found in [
11] (where only the T1 measurement was used to assess the performance), whereas the second group of rows displays the results obtained during the experiments undertaken (in-house implementation). Moreover, the first row of the results reported for the
SAE classifier for the in-house implementation represents the accuracy obtained from the encoder, whereas the second row represents the accuracy obtained from the decoder. Finally, for each classifier and for each dataset, the “best” results are marked in bold font.
The following observations can be made based on the results presented in
Table 4, where experiments based on in-house implementation are compared with the results presented in [
11]. Note that, if the difference between the results is within
they are considered to be close enough to be considered indistinguishable. This is stipulated since, as stated earlier, the aim of this investigation was not to fine-tune the hyperparameters, but to fairly compare the performance of different approaches to GZSL.
In-house implementation of
DeViSE—obtained a
increase in accuracy (over results reported in [
11]) for the
AWA1 dataset.
ALE—achieved a increase in accuracy for the aPY dataset.
SJE—delivered an increase in accuracy for the aPY dataset.
ESZSL—produced similar results on all five datasets.
SAE—achieved an , a , a , an and an increase in accuracy for the CUB, AWA1, AWA2, aPY and SUN datasets, respectively.
In summary, the results for in-house implementation were of slightly better performance than the results reported in [
11]. However, the results were relatively close (within 10%) and the difference probably originated from differences in hyperparameter tuning. Overall, these results support each other as representing a fair estimate of the current state-of-the-art accuracy of GZSL solvers for the T1 performance measure.
With respect to the results originating from the in-house implementation of individual classifiers (see
Section 4.1) reported in the bottom part of
Table 4, it can be seen that: (i)
ALE produced the highest accuracy value on the
CUB,
SUN and
aPY datasets (
,
and
, respectively), whereas
DeViSE achieved the highest accuracy values on both the
AWA1 and
AWA2 datasets (
and
, respectively); (ii) none of
SJE,
ESZSL or
SAE outperformed the remaining classifiers for any of the five datasets; (iii) for the harmonic means results, only
DeViSE,
ALE and
SJE were above
for the
CUB dataset; (iv) all of the individual classifiers performed considerably worse on the
aPY dataset than on the other datasets. This result is consisten with that reported in [
12,
13] for the ZSL settings. It also corresponds to the pattern of results reported in [
11]; and finally, (v) there was a significant drop in accuracy across the board compared to the results from the ZSL experiments reported in [
12,
13]. The reason for this drop was the availability of the seen classes taken from the source domain
during the testing phase in the target domain
. In other words, the fact that
. Here, the seen classes acted as “deterrence” for the individual classifiers, since they were trained to exclusively classify the seen classes from the source domain, while no unseen class from the target domain was present during the training phase. This gave the seen classes “an edge” over their counterparts, i.e., an increased bias during the prediction process. This resulted in the seen classes having a higher selection rate than their unseen counterparts. As a consequence, this reduced the overall accuracy of the individual classifiers.
The remaining experimental results, reported below, could not have been compared to others, as, to the best of our knowledge, no fully comparable results obtained for the same approaches, with the same datasets, with the same performance measures, exist in the literature. The first set of results was obtained for the T5 accuracy and is reported in
Table 5. Here, again, the “best” results are marked in bold font.
The following observations follow from the results reported in
Table 5: (A)
DeViSE produced the highest performance on the
AWA1,
AWA2 and
aPY datasets (
,
and
, respectively); (B)
ALE achieved the best overall performance on both the
CUB and
SUN datasets (
and
, respectively); (C) all of the reported results were less than
; (D) The
SUN dataset appeared to be the “hardest” dataset when the T5 performance measure was applied; finally, (E)
SAE performed the worst. Most of the values of
SAE for the T5 measure were relatively similar to one another across all five datasets. This should be taken into account when comparing the values of
SAE for the T5 measurement to the values for the T1 measure.
The next set of results relate to the performance of the five classifiers, measured according to LogLoss accuracy. The results are summarized in
Table 6. Note that, here, the lowest value is the “best”; the best results are marked in bold font).
Comparing the results in
Table 6, the following observations can be made: (a)
ESZSL achieved the lowest values on both the
AWA1 and
aPY datasets (
and
, respectively); (b)
ESZSL achieved the lowest values for each of the
CUB,
AWA1,
AWA2,
aPY and
SUN datasets (
,
,
,
and
, respectively). When evaluated from the LogLoss perspective, the
SUN dataset again appeared to be the most difficult to deal with.
Finally, in
Table 7, the performance of the five individual classifiers is compared in terms of the F1 measure. Here, the “best” (highest) values are marked in bold font.
The following observations can be made on the basis of the results reported in
Table 7: (1) The
ALE algorithm achieved the highest performance on the
CUB,
aPY and
SUN datasets (
,
, and
, respectively); (2)
DeViSE achieved the highest values on the
AWA1 and
AWA2 datasets (
and
, respectively); and (3) the
aPY dataset was the hardest to deal with when the F1 accuracy measure was applied.
Overall, on the basis of all the experiments performed, similar conclusions can be drawn to those reported in [
12,
13]: (i) Different performance measures promoted different GZSL approaches; (ii) the
aPY and
SUN datasets were the most difficult to classify, depending on the performance measure that was being used. This differed from the ZSL settings, where only the
aPY dataset was found to be difficult; and (iii) none of the individual classifiers can be considered “the best”. Moreover, none of the classifiers delivered particularly good results, regardless of the dataset and performance measure used to evaluate it.
The concern to identify the best overall approach was addressed in [
13] by introducing a competitive scoring scheme. Specifically, each of the five classifiers was assigned scores from 5 to 1 for each dataset for each performance measure, depending on its result (the best performance received 5 points, while the worst received 1 point). Next, the results were added. The same approach to representing “robustness” for all classifiers was applied; the results are displayed in
Table 8. The “best” results for each dataset, and overall, are marked in bold.
The observations from
Table 8 can be summarized as follows: (A)
DeViSE performed best for the
AWA1 and
AWA2 datasets and also obtained the best overall result (78 points); (B)
ALE performed the best for the
CUB,
aPY, and
SUN datasets; and (C)
SAE performed the worst for almost all datasets, as well as overall.
These results differ from these reported in [
12,
13], where the best overall score was reported for the
ESZSL classifier. Overall, for the GZSL problem, if specific characteristics of the dataset are not known beforehand, the
DeViSE approach may be the one to try first. However, the results obtained strengthen the view that much more work is needed to develop a deeper understanding of the relationships between datasets, approaches and performance measures with respect to the GZSL problem.
5.2. Performance of Meta-Classifiers
The second part of the investigation concerns the experimental evaluation of the performance of meta-classifiers. Here, the T1 accuracy results are summarized in
Table 9. The “best” results are marked in bold font.
The following key information can be derived from the results reported in
Table 9: (i) The
GT achieved the highest performance for the
CUB,
AWA1 and
aPY datasets (
,
and
, respectively); (ii)
Con achieved the highest overall T1 accuracy for the
AWA2 dataset (
); (iii)
MV achieved the highest T1 accuracy value for the
SUN dataset (
); (iv) these results imply that none of
MDT,
DNN or
Auc achieved the best scores for any single dataset; (v) all the results were below
accuracy; (vi) all the results obtained were higher than those reported in
Table 4; (vii) the largest difference was for the
ESZSL classifier and the
AWA2 dataset (69.07%), while the smallest was for the
SAE classifier and the
CUB dataset (0.12%); and (viii) the
SUN dataset was found to be the most difficult when the performance of the meta-classifiers was measured in terms of the T1 accuracy measure.
It is important to note that the T5 accuracy was not reported, as some meta-classifiers, e.g., DNN, returned the individual classifier result as an output. Hence, since the experimentation was undertaken for five individual classifiers, the resulting accuracy of the meta-classifier using the T5 measure would always be 100%, regardless of the correctness of the output.
For use of the F1 accuracy measure to assess the performance of the meta-classifiers, the results are summarized in
Table 10 (bold-font-marked results are “the best”).
The following observations can be derived from
Table 10: (a)
MV achieved the highest values for the
CUB and
SUN datasets (
and
, respectively); (b)
Con achieved the highest performance for the
AWA1 and
AWA2 datasets (0.32 and 0.13, respectively); (c)
DNN achieved the highest accuracy score for the
aPY dataset (0.19); (d) none of
MDT,
GT or
Auc achieved the best accuracy score for any dataset; (e) all of the reported results were below
; (f) the obtained results were comparable to, but somewhat worse than, the results reported in
Table 7; and (g) the
AWA2 dataset was the most difficult using the F1 accuracy measure.
Comparing the results obtained when applying the F1 accuracy score to the meta-classifiers with those obtained for the individual classifiers, it can be seen that the meta-classifiers performed better than the individual classifiers on the AWA1 and aPY datasets (0.32 and 0.28 compared to 0.26 and 0.1, respectively). At the same time, the individual classifiers obtained better results on the CUB, AWA2, and SUN datasets (0.37, 0.17 and 0.29 compared to 0.33, 0.13 and 0.28, respectively).
Using the method of competitive point distribution described above to measure the combined performance of the individual classifiers, the results calculated for the meta-classifiers are reported in
Table 11 (“best” results reported in bold font). Both the performance measures, T1 and F1, were combined for the meta-results. The top scorer, in a given category, is given six points, since there are six meta-classifiers.
The following can be noted on the basis of the results reported in
Table 11: (A)
GT performed the best for the
CUB and
AWA1 datasets and obtained one of the best overall results alongside
Con (total score for both equal to 46 points), which achieved the best score for the
AWA2 dataset; (B)
GT and
DNN performed the best for the
aPY dataset; (C)
MV performed the best for the
SUN dataset; and (D)
Auc performed the worst overall (both on individual datasets and for combined performance).
Finally, the same competitive score combination method was applied jointly to both the meta-classifiers and the individual classifiers. For obvious reasons, only the T1 and the F1 accuracy measures were taken into account; since 11 classifiers were compared, the top score was 11 points. The results are displayed in
Table 12 (bold font marks the “best” results).
Observations that can be made from the results reported in
Table 12 are detailed below:
For the individual classifiers–ALE obtained the highest score on both the CUB and SUN datasets (18 and 22), as well as the highest overall score (88). Only ALE achieved a score of 20 or higher on any given dataset. DeViSE, ALE and SJE obtained overall scores of 60 or higher, whereas ESZSL and SAE scored below this threshold.
For the meta-classifiers–Con obtained the highest score on both the AWA1 and AWA2 datasets (20 and 21); DNN and GT both obtained the highest score for the aPY dataset; MV, DNN, GT and Con obtained scores of 20 or higher for any given dataset. All the meta-classifiers achieved an overall score of 60 or higher, with half achieving overall scores of above 70.
When comparing individual and meta-classifiers, half of the meta-classifiers obtained total scores higher than 70, whereas less than half of the individual classifiers did.
None of the meta-classifiers obtained total scores of less than 60, while three of the five individual classifiers did.
Only ALE obtained a score of 20 or above on an individual dataset, whereas four of the six (i.e., two-thirds) of the meta-classifiers reached this level.
Unlike the conclusions presented in [
12,
13] for the ZSL problem setting, where the simpler meta-classifiers (e.g.,
MV) gave better results, in the case of the GZSL problem, the more complex meta-classifiers (e.g., DNN) delivered better results.
Overall, it can be concluded that, in GZSL settings, selection of the “best approach” is very much context-dependent. With “inside knowledge” of the characteristics of the dataset and/or the aims for obtaining fine-tuned results for a given performance measure, this can be achieved using one of the individual classifiers. On the other hand, when the goal is to solve the problem, while avoiding the “worst case scenario”, then, use of meta-classifiers is desirable.