0% found this document useful (0 votes)

13 views64 pages

Robust and Explainable Framework To Address Data Scarcity in Diagnostic Imaging

Uploaded by

avleenkaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views64 pages

Robust and Explainable Framework To Address Data Scarcity in Diagnostic Imaging

Uploaded by

avleenkaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Robust and Explainable Framework to Address

Data Scarcity in Diagnostic Imaging

Zehui Zhao1 , Laith Alzubaidi2*, Jinglan Zhang1 , Ye Duan3 ,
Usman Naseem4 , Yuantong Gu2
1 School of Computer Science, Queensland University of Technology, 2
George Street, Brisbane, 4000, Queensland, Australia.
2* School of Mechanical, Medical, and Process Engineering, Queensland
arXiv:2407.06566v1 [cs.CV] 9 Jul 2024

University of Technology, 2 George Street, Brisbane, 4000, Queensland,

Australia.
3 School of Computing, Clemson University, Clemson, 29631, South

Carolina, USA.
4 School of Computing, Macquarie University, Sydney, 2109, New South

Wales, Australia.

*Corresponding author(s). E-mail(s): [email protected];

Contributing authors: [email protected];
[email protected]; [email protected];
[email protected]; [email protected];

Abstract
Deep learning has significantly advanced automatic medical diagnostics and
released the occupation of human resources to reduce clinical pressure, yet the
persistent challenge of data scarcity in this area hampers its further improve-
ments and applications. To address this gap, we introduce a novel ensemble
framework called ‘Efficient Transfer and Self-supervised Learning based Ensem-
ble Framework’ (ETSEF). ETSEF leverages features from multiple pre-trained
deep learning models to efficiently learn powerful representations from a lim-
ited number of data samples. To the best of our knowledge, ETSEF is the
first strategy that combines two pre-training methodologies (Transfer Learning
and Self-supervised Learning) with ensemble learning approaches. Various data
enhancement techniques, including data augmentation, feature fusion, feature
selection, and decision fusion, have also been deployed to maximise the efficiency
and robustness of the ETSEF model. Five independent medical imaging tasks,
including endoscopy, breast cancer, monkeypox, brain tumour, and glaucoma

1
detection, were tested to demonstrate ETSEF’s effectiveness and robustness. Fac-
ing limited sample numbers and challenging medical tasks, ETSEF has proved
its effectiveness by improving diagnostics accuracies from 10% to 13.3% when
compared to strong ensemble baseline models and up to 14.4% improvements
compared with published state-of-the-art methods. Moreover, we emphasise the
robustness and trustworthiness of the ETSEF method through various vision-
explainable artificial intelligence techniques, including Grad-CAM, SHAP, and
t-SNE. Compared to those large-scale deep learning models, ETSEF can be
deployed flexibly and maintain superior performance for challenging medical
imaging tasks, showing the potential to be applied to more areas that lack training
data.

Keywords: Transfer learning, Self-supervised learning, Ensemble learning, XAI

1 Introduction
Deep Learning (DL), a cutting-edge branch of Machine Learning (ML) technology,
has demonstrated remarkable success in various medical imaging modalities, includ-
ing radiology images [1–3], dermatology and ophthalmology images [4–6], and 3D
images derived from multiple modalities [7, 8]. These state-of-the-art DL models have
showcased their ability to perform tasks that expert clinicians traditionally perform,
including disease classification, segmentation, localisation, and diagnosis. The integra-
tion of DL into medical systems holds the promise of enhancing diagnosing efficiency
and releasing the pressure of healthcare services, thus improving accessibility and
coverage in public healthcare [9].
However, despite the significant strides made in DL, the persistent challenge of
data scarcity impedes its widespread application in challenging medical settings. DL
algorithms rely on complex multi-layer architectures with millions of parameters to
capture intricate data distributions, which are difficult to acquire in many medical
domains [10]. The scarcity of training data in this field can be attributed to several fac-
tors: (i). The high cost associated with medical imaging tests and the expensive process
of annotating medical data through specialised expertise [11]. (ii). concerns regarding
patient privacy, which may hinder the sharing and reuse of medical datasets [12, 13].
(iii). The diverse modalities and high-resolution nature of medical images demand
meticulous data collection and reduce data generalisation [14]. (iv). The infrequent
occurrence of rare diseases limits the availability of the corresponding data samples
[15]. Consequently, these factors lead to data scarcity issues and pose a significant
barrier to the further advancement and application of DL techniques in the medical
field, increasing the risk of overfitting and unexpected performance degradation during
model deployment [16].
In response to the challenge of data scarcity, one simple approach is to invest more
resources in obtaining expert annotations and clinical samples while developing tai-
lored models from scratch to better align with the target data distribution [17, 18].

2
However, these solutions entail significant financial and temporal burdens, poten-
tially limiting the practicality and lifespan of DL models in real-world scenarios [19].
Alternatively, researchers have explored practical solutions to enhance model perfor-
mance, which can be broadly categorised as data-level and model-level approaches.
For instance, data augmentation serves as a popular data-level solution and addresses
data scarcity by generating synthetic data to enlarge the training data size [20, 21],
while pre-trained learning methods represent another popular direction that aims at
enhancing the model’s knowledge base and feature extraction capabilities [22].
In this study, we focus on pre-trained learning and ensemble learning methods.
Transfer Learning (TL) stands out as a widely adopted approach among all the pre-
trained learning methodologies. The normal TL model aims to improve the learning of
target domain data by transferring knowledge learnt from different but related source
domains [23]. This method allows users to reuse the available trained models and
existing datasets to enhance the model training process instead of spending time and
money to design a new model from scratch [24]. So far, much research has proved the
TL method’s value in improving the target model’s training efficiency, performance,
generalisation ability, and data reuse rate [25–27]. Despite its successes, the TL knowl-
edge transfer process faces a significant risk called negative transfer when transferring
knowledge between dissimilar domains [28]. The dissimilarity between domains can be
represented as a difference in feature distribution or representative symptoms, and this
difference may cause heavy degradation in the model’s performance and robustness.
Furthermore, existing TL methods are mainly based on supervised learning and rely
heavily on human instructions, which require additional support from massive source
data and may limit their applicability and scalability in areas lacking source domain
data [29].
In contrast, self-supervised learning (SSL) has recently emerged as a promis-
ing unsupervised pre-training approach and has attracted the attention of many
researchers. Unlike TL methods, SSL methods utilise unlabelled data to extract rich
features through well-designed pretext task training, thus reducing data preparation
and annotation efforts [30]. Furthermore, by using augmented training data as pseudo-
labels, SSL facilitates simulated supervised training to help the model achieve better
performance behaviour [31]. In recent years, the SSL method has achieved a series of
successes in the medical field and can even compete with or overpass the TL-based
algorithms in some scenarios [32]. However, although SSL has shown promising poten-
tial in the medical domain, it is also beset by challenges, including negative transfer
risk such as TL methods, high computational consumption compared to other super-
vised learning paradigms [33], the unreliability of lacking strong supervision signals,
and the weak ability to capture changeable characteristic information through pretext
task training [34].
To address the inefficiencies and risks associated with existing pre-trained learn-
ing methods, researchers attempted to develop multiple publicly available large-scale
datasets as source domains to support pre-trained learning across various fields. A
typical sample is ImageNet [35], a publicly available dataset containing 14 million
well-annotated image data collected from general areas, including animals, humans,
cars, plants, etc. Despite the success of ImageNet-based pre-trained models in many

3
application areas, there have been concerns regarding the suitability of ImageNet
and other generic source datasets in medical applications due to the differences in
feature distributions between general images and professional medical images. As a
response, Tan and Niu’s team [36, 37] has proposed a multi-level pre-training strat-
egy by pre-training the TL model in generic datasets first and retraining it with an
intermediate source domain that is close to the target domain, bridging the distribu-
tion gap between the large-scale generic source domain and professional target domain
to avoid negative transfer. Furthermore, Azizi et al. [38, 39] developed a unified pre-
training strategy to combine TL-based weights and SSL pre-training paradigms to
enhance model capability against unseen data. Taking advantage of a large-scale non-
medical knowledge base and retraining the model on an intermediate medical source
domain, their proposed method has shown superior performance and generalisation
against unseen medical data. However, their work lacks adequate visual analysis and
explainable evidence to support the decision-making process, risking biased predic-
tions based on unrelated representations. To address this issue, we employed ensemble
learning techniques combined with DL and pre-trained algorithms, as previous research
shows it can effectively enhance model performance and robustness [40–42]. Utilis-
ing multiple base models trained on the same tasks and complementary information
from each model, the ensemble learning strategy is able to expand the weak models’
generalisation ability and strengthen them into a more powerful integrated model.
Inspired by the pressing need for a robust and data-efficient approach to medi-
cal DL, we build upon recent advancements in this field to introduce the ETSEF, a
framework that proposes an ensemble strategy based on multiple pre-training meth-
ods to craft powerful models for medical imaging tasks (see Fig. 1 for an overview
of ETSEF). Through extensive experimentation, we demonstrated the effectiveness
and robustness of ETSEF across various clinical scenarios. The primary objective of
this framework is to learn reusable and transferable representations and enhance task
learning in data-scarced clinical environments. Our approach leverages intermediate
medical source domains to support downstream task learning and constructs robust
base feature extractors using limited target samples. Subsequently, the ensemble strat-
egy selects and merges task-relevant features from each base model and employs
higher-level ML classifiers such as Random Forest, eXtreme Gradient Boosting, and
Naive Bayes as feature classifiers to refine the understanding of the target feature
distribution and facilitate informed predictions. Finally, a voting classifier aggregates
the weighted predictions of the preceding feature classifiers to arrive at a consensual
decision.
In contrast to previous pre-training methods that focus on individual medical
imaging domains such as chest-X-ray [43, 44], brain Computed Tomography (CT)
scan [45–47], Magnetic Resonance Imaging (MRI) interpretation [48] etc, our method
emphasises generalisation ability across various medical imaging domains. To our
knowledge, ETSEF stands as the first instance of an ensemble learning framework that
combines two independent pre-training methodologies. This approach harnesses mul-
tiple DL architectures and pre-training techniques to learn comprehensive symptom
representations from the target sample, bridging the gap between disparate domains
and minimising the need for customised model designs on different data domains.

4
Notably, our strategy follows Azzi and Niu’s work [37, 39], leveraging generic source
pre-trained weights as a knowledge base and further pre-trained the base models
with intermediate medical source data to enhance the base model’s performance and
avoid potential negative transfer risks. Furthermore, our experiment also encompasses
various ensemble learning techniques to offer more insight and recommendations for
developing a powerful and efficient ensemble model.
Additionally, we prioritise the explainability of our framework, recognising the crit-
ical role of model security in medical artificial intelligence model learning. To enhance
transparency, we employed Gradient-weighted Class Activation Mapping (Grad-CAM)
[49] and SHapley Additive exPlanation (SHAP) [50] techniques to our pre-trained base
models to provide visual explanations and supportive evaluation. With the support
of these explainable artificial intelligence techniques, we highlight the model’s atten-
tion and focus areas, offering visual insights into the model’s decision-making process.
Furthermore, we utilised t-SNE analysis [51] to assess the feature classification per-
formance. With the help of the visualised feature maps, we are able to evaluate the
extracted feature quality and gain more understanding of final predictions.

Fig. 1 Overview of the ETSEF approach workflow ETSEF initialises the base model by lever-
aging pre-trained knowledge from ImageNet using supervised learning. Next, two separate pre-trained
models utilise TL and SSL methods to learn representations from the same intermediate data. These
models are then fine-tuned for specific downstream medical imaging tasks. Subsequently, features are
extracted from the models for concatenation and normalization. We use five ML classifiers to learn
the feature distribution and make predictions. Finally, a voting classifier aggregates predictions and
makes a final decision based on the most consistent outcomes.

In this section, we introduce our motivation for conducting this study and pro-
vide an overview of the ETSEF method. The second section presents our strategy

5
design, clinical environments, performance comparison with baselines and state-of-
the-art models, and visual analysis. The third section discusses our method design,
while the fourth section details the implementation and literature background of the
ETSEF. The final section concludes our findings and outlines future directions.

2 Results
In addition to outlining the overall workflow, this section offers valuable insights into
the ETSEF method, including the methodology and component design, the clinical
data environment setup, the model performance analysis, and the model predictions’
visual analysis. Both ensemble learning and pre-trained models are essential compo-
nents of our framework, and our detailed exploration of these topics serves to illuminate
the framework’s design while providing a comprehensive guide for leveraging our
findings.

2.1 An Ensemble Framework that Combines Two Pre-trained

Learning Methods
The primary objective of our method is to develop a model that achieves high per-
formance across various medical imaging tasks with limited training data. To reduce
the negative influence of data scarcity, we utilised multiple pre-trained models as base
models and combined them into a robust ensemble model. These base models are
derived from two popular pre-training methods: transfer learning and self-supervised
learning. In addition, we incorporated feature fusion, feature selection, and weighted
voting techniques in the ensemble learning process to ensure the performance and
stability of the final model.
Supervised transfer learning. Pre-trained supervised TL models have demon-
strated their effectiveness with various medical imaging tasks [52, 53], here we follow
the theoretical groundwork established by Azizi et al. [39] to leverage the publicly
available pre-trained representations as knowledge foundation and subsequently use
the double fine-tuning technique to train our predictors to map related labelled medi-
cal data before finally fine-tuning on the target domain. Notably, one major advantage
of double fine-tuning is the flexibility in selecting intermediate source data, which
need not strictly adhere to the same disease types or large dataset sizes. Addition-
ally, in scenarios where labelled source data is unavailable, our method can provide
an alternative option to utilise related unlabeled medical data as the source domain
and pre-train solely with the SSL algorithm. The experiment results demonstrate the
efficacy of using source domain data from similar medical imaging sources, even with
a relatively small dataset size that contains only hundreds of samples. Compared to
using large-scale datasets for pre-training, our approach proves effective in improving
target task performance without adding much computational cost.
In the representation learning stage, we initialise predictors h(·) by employing mul-
tiple pre-trained DL models with ImageNet weight. With most model layers unfrozen,
we retrain them for related medical representations by minimising cross-entropy loss
on the intermediate source domain data. Specifically, we selected Xception [54], Incep-
tionResNetV2 [55], and MobileNetV3 [56] as backbone encoder networks for their

6
ease of deployment, superior performance, and suitable depth for limited medical data
environments. In the fine-tuning stage, each backbone network f (·) is augmented with
a modified classification head g(·) to create the final predictor h = g ◦ f and learn
mapping representations to the target-specific label space. The trained medical knowl-
edge from the previous stage is partially frozen during this step for the convenience of
knowledge reuse. Given labelled training examples (x1 , y1 ), ..., (xn , yn ) from the down-
stream medical dataset, the predictor h(·) is fine-tuned to minimise cross-entropy and
learn domain-specific representations to support subsequent ensemble training. This
process is repeated until each backbone network has acquired the necessary knowledge
for the target samples.
Contrastive Self-supervised Learning. Model training using the SSL method
allows the base model to effectively learn visual representations from unlabeled medi-
cal images, providing a unique perspective distinct from supervised transfer learning.
To harness the advantages of SSL and complement the supervised TL method, we
adopted SimCLR [57, 58], an SSL algorithm based on contrastive learning to train
another group of base models. SimCLR learns representations by maximising agree-
ment between pseudo-labels augmented from the same training samples with different
views and original training examples.
In the representation learning stage, the initial predictors fθ (·) employ DL architec-
tures similar to the pre-trained backbone encoder networks used in the TL setting, and
a two-layer projection head is used to project the representation to a 128-dimensional
latent space. However, to maximise the diversity of base models, we select a dis-
parate architecture ResNet101 [59], along with MobileNetV3 and EfficientNetB0 [60]
as backbone encoders. Given the majority of predictor layers keep frozen, the SimCLR
algorithm learns representations z2k−1 and z2k from augmented views of the train-
ing sample and computes contrastive loss objectives. With a mini-batch of encoded
examples, the contrastive loss between a positive pair of examples i, j (different
augmentations of the same image) is given as follows:
exp(sim(zi , zj )/τ )
li,j = − log P2N
k=1 l[k̸=i] exp(sim(zi , zk )/τ )
where cosine similarity computes distances between two vectors and τ is a scalar that
denotes the temperature.
In the fine-tuning stage, the encoder network fθ (·) is replaced with a new classi-
fication head gθ (·) to map the predictions to the target class labels and compose the
final predictor hθ = gθ ◦ fθ . This predictor is trained to minimise cross-entropy loss
using the same labelled training examples (x1 , y1 ), ..., (xn , yn ) providing target domain
representations for subsequent ensemble training. Both TL and SSL training processes
are influenced by multiple hyperparameters, including optimiser type, learning rate,
training epochs, batch size, and weight decay, and we provided more details in the
Methods section.
Feature Fusion methods. In the feature fusion stage, we combine the extracted
features from pre-trained TL and SSL base models using feature fusion techniques.
Feature fusion is a process of fusing features from different layers or branches of differ-
ent sources or models to create a more informative feature representation [61]. Facing
the challenging medical imaging tasks, although recent state-of-the-art pre-trained
learning models using TL or SSL methods are able to capture useful features from

7
the target data samples, it is hard for them to handle the wide variability and scal-
ability of the data in a domain where extremely limited data are available. Thus,
the feature fusion method plays a crucial role in our framework to expand the diver-
sity and complementarity of the individual models, and promote the robustness and
accuracy of the overall decision-making process [62]. Under this situation, three main
feature fusion methods are widely applied: Concatenation, Element-wise addition and
multiplication, and Linear and nonlinear transformation.
• Concatenation simply combines the feature vectors of multiple sources by appending
them together. This method maximises the retention of original information but
may cause high-dimensional feature representations and raise feature complexity.
• Element-wise addition and multiplication method first computes the element-wise
sum or product of the feature vectors separately before combining them to reduce
the dimensionality of the fused feature. However, this method may cause a loss of
original information when calculating and selecting them.
• Linear and nonlinear transformation method works by introducing additional linear
or nonlinear functions to project the feature vectors into a common feature space
to select the most relevant features. Examples of these methods include Principal
Component Analysis (PCA), Independent Principal Component Analysis (ICA),
and Linear Discriminant Analysis (LDA).
Using feature fusion, we aim to preserve the most valuable features for the target
domain tasks and optimise fusion performance across the base pre-trained models.
Consequently, we explore the effectiveness of concatenation, PCA, ICA, and LDA
methods in enhancing feature quality. Among these methods, concatenation served
as the baseline method since it retains all the information about the characteris-
tics and allows comparison with other normalisation methods. PCA, an unsupervised
feature normalisation technique, captures maximum data variance and maintains lin-
early uncorrelated variables known as principal components [63]. In contrast, LDA is
a supervised feature normalisation method that is similar to PCA, LDA incorporates
class labels and aims to maximise class separation while minimising class variance [64].
ICA is also an unsupervised feature normalisation technique that aims to separate
a multivariate signal into additive independent components [65]. It assumes that the
observed data are mixtures of independent sources, and it tries to recover the original
sources.
Due to content constraints and research focus, we present experimental observa-
tions rather than exhaustive details. Our findings indicate superior performance of
unsupervised transformation methods such as PCA and ICA compared to the base-
line concatenation method, while supervised LDA performs comparatively worse. We
attribute this performance gap in supervised methods to their retention of only a lim-
ited number of features matching class label numbers, leading to a significant loss of
feature information and poorer performance. Additionally, concatenation is deemed
unsuitable as it may increase feature dimensions and complexity due to coincidences
in extracted features from multiple base models. Consequently, we utilised the con-
catenation method to keep all the feature information first and then applied the ICA

8
method for its ability to maximise independent feature representations and preserve
the unique attention views from diverse base models.
Unified Ensemble Learning using Pre-trained Deep Learning Archi-
tectures. In the ensemble learning stage, previously pre-trained base models are
integrated using ensemble learning technology that combines multiple base models to
create a more robust and powerful model [66]. This approach aims to leverage the
strengths of diverse models while mitigating their weaknesses, resulting in an improved
overall model [67].
In recent years, ensemble-based proposals have maintained steady growth and serve
as a state-of-the-art method in many fields [68–70]. However, traditional ensemble
learning methods that use weak base models to form a more powerful model have
been found to be computationally inefficient and less reliable. The effectiveness of
ensemble learning depends heavily on the performance of its base models. Errors in a
single base model can be amplified within the ensemble learning process, leading to
instability issues in the overall model [71]. In this circumstance, the combination of
powerful DL models with a team technique has shown potential to address this issue
and achieved great success in the medical field [72–74]. Our method uses double-fine-
tuned pre-training models to reinforce the performance of the base models, and we
referred to three main ensemble learning techniques that are commonly applied:
• Bagging [75] trains multiple base models independently on different subsets of the
same target data using random sampling with replacement. The final model’s predic-
tions are determined by aggregating the individual base models’ predictions through
voting.
• Boosting [76] employs a sequence of base models to iteratively correct the errors of
its predecessors and update them dynamically. Final predictions are also obtained
through a weighted vote of the base models.
• Stacking [77] combines the outputs of multiple base models using a higher-
level model, often referred to as a meta-model. The meta-model is trained on a
new dataset comprising the base models’ predictions, subsequently making new
predictions as output.
We employed a modified method based on the stacking method to learn represen-
tations from pre-trained base models effectively. This involved training fused features
at a higher model level with five ML models, including K-Nearest Neighbours (KNN),
eXtreme Gradient Boosting (XGBoost) [78], Support Vector Machine (SVM), Naive
Bayes (NB), and Random Forest (RF). Subsequently, a weighted vote of the ML
models’ predictions was conducted, with each model contributing equally to the final
decision and taking the majority voted predictions. The training process of the ML
model is influenced by independent parameters such as max-depth, number of neigh-
bours, random state, etc., which is based on the specific requirement of the ML model.
The following method section will discuss these parameters and their impact in further
detail.

9
2.2 Clinical Evaluation Settings
In this study, our ETSEF method was mainly trained and evaluated for four domain-
specific tasks that belong to independent medical imaging modalities (dermatology
photography, gastrointestinal tract, breast ultrasound and chest radiography), span-
ning from colourful domains to grey-scale domains. Each base model has been trained
through an initial source domain (ImageNet), a related medical dataset Din as the
intermediate source domain, and a target domain dataset Dtar . To facilitate the final
training and testing, all data sets from the target domain were randomly split into
train and test sets with a ratio of 8: 2, and we further enhanced the data of the train
sets to reduce the potential influence of data imbalance problems. Moreover, we also
included an out-of-distribution task to test the generalisability of the ETSEF strategy.
Here, we provide an overview of each task and their corresponding datasets.
Task 1: Dermatology MonkeyPox Classification. Recently, the World Health
Organization highlighted the rapid spread of MonkeyPox across Europe and Amer-
ica. During the initial peak outbreak phase of MonkeyPox, a significant challenge
emerged in distinguishing MonkeyPox patients from those with other pox or skin
diseases. This task aims to train an efficient model to detect MonkeyPox patients
T1
quickly. The intermediate source dataset Din is a publicly available dataset called the
Monkeypox Skin Images Dataset (MSID) [79], which compromised 770 unique cases
T1
collected from internet-based sources. Din includes skin and body images of patients
and healthy individuals of various sexes, ages, and races. This dataset covers three
diseases—MonkeyPox, Chickenpox, and Measles, along with images of healthy indi-
T1
viduals. The target dataset Dtar for the task is the MonkeyPox Skin Lesion Dataset
T1
Version 2.0 (MSLDv2) [80]. Specifically, Dtar focuses on various pox diseases and
T1
contains 755 original skin lesion images from 541 distinct patients. Dtar includes six
distinct classes, namely Monkypox (284 images), Chickenpox (75 images), Measles (55
images), Cowpox (66 images), Hand-foot-mouth disease or HFMD (161 images), and
Healthy (114 images). To address the data imbalance between the Monkeypox class
and the other classes, we applied data augmentation to the five smaller classes, balanc-
ing their sample numbers to mitigate the influence of the imbalance. The augmented
training set comprises 1,414 samples, with approximately 240 samples per class. Com-
T1 T1
pared to the intermediate source dataset Din , the target dataset Dtar share the same
disease types such as MonkeyPox, ChickenPox and Measles. Moreover, the target
T1
Dtar is more complex and covers additional cases, including CowPox and Hand, foot,
and mouth disease; the increased disease classes make the target classification task
more challenging and suitable for testing the performance of the ETSEF in complex
scenarios.
Task2: Gastrointestinal Endoscopy Classification. Gastrointestinal (GI)
endoscopy is an active field of research driven by the high incidence of lethal GI
cancers. Early GI cancer precursors are often missed during endoscopic surveillance,
necessitating the integration of experienced endoscopists’ knowledge to improve dis-
ease classification accuracy. The objective of this task is to train a robust model to
classify anatomical landmarks and clinically significant findings from GI tract images.
T2
For this purpose, we utilise an intermediate source dataset Din from the Medico Mul-
timedia Task at MediaEval 2018 (Medico 2018) [81] to provide a useful knowledge

10
T2
base. Din comprises 5,293 GI images, including anatomical landmarks, pathological
findings, polyp removal cases, and normal cases, categorised into 16 classes. The tar-
T2
get dataset Dtar we use in this task is Kvasirv2 [82], which contains 8,000 GI tract
T2
images of anatomical landmarks and pathological findings. Dtar is a balanced dataset
across eight classes, with 1,000 samples per class. Data augmentation was also applied
to this dataset to evaluate the effectiveness of this technique within our framework,
doubling each class sample while maintaining balance, leaving a train set with 12,800
samples and a test set with 1,600 samples. Notably, the Medico 2018 dataset also
T2
includes some samples from the Kvasir version 1.0 dataset, and we compared Din and
the target test set to ensure there are no data duplication issues.
Task3: Breast Ultrasound Classification. Breast cancer is one of the most
common causes of death among women worldwide. Early detection is crucial in reduc-
ing mortality rates. This task aims to classify whether a patient has breast cancer
and determine if the cancer is benign or malignant. Due to the lack of related source
datasets, we selected a publicly available ultrasound dataset from Kaggle, the Poly-
T3
cystic Ovary Syndrome (PCOS) detection dataset [83], as our source dataset Din .
T3
Specifically, Din is a binary dataset with 1,924 images categorised into two classes:
infected and not infected by PCOS. We chose this dataset because it uses ultrasound
T3
imaging, providing similar grey-scale medical representations. The target dataset Dtar
for this task is the Breast Ultrasound Dataset (BUSI) [84]. The BUSI dataset com-
prises 780 images collected from 600 female patients, with ground truth labels provided
by expert annotation. The samples are categorised into three classes: Normal (133
samples), Benign (437 samples), and Malignant (210 samples). To address the class
imbalance, we performed data augmentation on the normal and malignant samples,
resulting in 279 images for the normal class and 340 images for the malignant class
in the augmented training set. Additionally, the original dataset includes 798 masked
ground truth images for segmentation training, which we excluded as they are not
relevant to our target task.
Task4: Brain Tumour Detection Brain tumours are among the most aggressive
diseases, and MRI is the best technique for detecting them. This task aims to train
models on two MRI datasets to diagnose different brain tumours and determine if a
T4
patient has the disease. The source domain dataset Din is a publicly available Kaggle
dataset called Br35H [85]. Br35H contains 3,060 MRI image samples divided into two
T4
classes: with and without brain tumours. Similarly, the target dataset Dtar for this task
is also a tumour image dataset called Brain Tumor Classification (BTC) dataset [86],
which includes 3,264 image samples categorised into three types of tumours (glioma,
meningioma, and pituitary) and non-tumour cases. The sample number of each class
is generally balanced among the three tumour classes, but the non-tumour class is
underrepresented. To address this issue, we performed data augmentation on the non-
tumour samples, resulting in a more balanced sample distribution with approximately
T4 T4
750 images for each class in the training set. The Din and Dtar share a similar feature
T4
distribution and data size, while Dtar contains more classes and has a more imbalanced
class distribution, making it a challenging task to face.

11
Task5: Out-of-distribution Disease Detection. To test the generalisation
ability of ETSEF, we conduct data-efficient performance testing under realistic out-
of-distribution (OOD) settings, where OOD refers to newly appeared and previously
unseen clinical environments. Leveraging the ensemble learning structure, our ETSEF
strategy can utilise any trained base models as feature extractors to process unseen
data without requiring additional retraining or fine-tuning steps. The ML classifiers in
the ensemble learning stage build correlations between these extracted feature vectors
and map them to predicted labels. In addition to model performance, the generalisa-
tion ability of the model is particularly valuable in the medical field, where varying
imaging techniques often cause significant feature distribution shifts between different
datasets. For this purpose, we selected a glaucoma dataset called EyePACS-AIROGS-
light-V2 (EyePacs) [87] as our target dataset. We utilised pre-trained base models from
the previous task T2 as feature extractors for this task, as they were also trained on
colourful medical image datasets, and the representation knowledge they learned from
the gastrointestinal endoscopy classification task is markedly different from those for
glaucoma images. Moreover, no further source domain datasets or fine-tuning steps
were used since we want to keep the feature difference between pre-trained weights
with the target dataset to test the trained ETSEF model’s generalisation ability under
extreme environments. The EyePacs dataset can be used to conduct a binary classi-
fication task with 6,000 samples. Following the overall protocol, we split the target
dataset into training and testing sets, using 1,000 samples in the test set to perform
the performance evaluation.
Experimental setup. To test the clinical performance and analyse the effective-
ness of ETSEF under different clinical settings, standard supervised baseline models
were trained for T 1 − 4 without using any pre-trained weights. These baseline models,
which utilised the same base DL architectures, were then subjected to feature fusion
and ensemble learning techniques to create a scratch baseline ensemble model. Addi-
tionally, we built another group of TL-based ensemble models and SSL-based ensemble
models using single pre-trained method-based models for comparison. The rationale
for building separate pre-trained ensemble models is to establish a baseline with one
pre-training-method ensemble model, thereby demonstrating the performance of fusing
multiple pre-training methods.
Fig. 2 gives an overview of the datasets used for each task. The first four tasks rep-
resent common clinical scenarios involving training models for specific modalities and
tasks, while the final setting simulates an extreme situation where a domain-specific
pre-trained model for a rare disease or an unseen task is unavailable. To examine the
robustness of ETSEF when faced with various clinical data challenges, all selected
target data sets are limited in data size: the colourful training sets contain up to
6,000 images, and the greyscale training sets have up to 2,800 samples. Another chal-
lenge is data imbalance, as most of the target datasets have imbalanced data samples
between classes. Other challenges, such as low resolution, proportional distortion, and
variations in image characteristics, further complicate the task of learning powerful
representations from these data. This created a very challenging environment for us
to test the ETSEF model.

12
Fig. 2 Overview of datasets for each task The trained ETSEF model is evaluated on five
tasks that belong to different domains, as they cover a wide range of medical imaging techniques
and distribution shifts. For the first four domains, we utilised a source dataset of ImageNet, an
intermediate source dataset from the medical area. The last domain utilised an out-of-distribution
target dataset to test the ETSEF model’s generalisation ability towards unseen data, and there is no
source data in this task.

2.3 Performance of ETSEF

This section evaluates the ETSEF model using several metrics, including accuracy,
Recall, Precision, and F1-score. Additionally, we compare our method with strong
supervised baseline models, along with several state-of-the-art methods to demonstrate
its performance.
Evaluation Metrics. After the training phase, all models were evaluated using
test sets derived from the target datasets, employing the accuracy, recall, precision, and
F1 score evaluation metrics to assess their performance on each task. These metrics
are derived directly from the confusion matrix, which includes true positives (TP),
true negatives (TN), false positives (FP), and false negatives (FN). A TP represents a
positive event correctly predicted as positive, while a TN indicates a negative outcome
correctly predicted. FP occurs when a negative outcome is incorrectly predicted as
positive, and FN occurs when a positive outcome is incorrectly predicted as negative.
The evaluation metrics of accuracy, recall, precision, and F1 score are all based on
these four measurements and the equations for these metrics are listed below.

13
Accuracy = (T P + T N )/(T P + T N + F N + F P )
Recall = T P/(T P + F N )
Precision = T P/(T P + F P )
P recision ∗ Recall
F1 Score = 2 ∗
P recision + Recall
Among these evaluation metrics, accuracy indicates the proportion of correctly
predicted samples among all test samples. Recall (sensitivity) measures the proportion
of actual positive samples correctly identified. Precision represents the proportion of
true positive predictions among all positive predictions made by the model. The F1
score is a harmonic mean of recall and precision which provides a balanced metric that
considers both precision and recall.

Fig. 3 Performance of Pre-trained Base Models for Each Task: The performance of the base
models is evaluated by accuracy performance. Models pre-trained using TL methods are highlighted
in green, while those pre-trained using SSL methods are marked in blue. Overview results show that
pre-trained models using supervised TL methods perform better in target datasets with colourful
images, and the models pre-trained with self-supervised methods are generally stronger in grey-scale
image datasets.

Performance on Domain-specific Tasks. Fig. 3 provides an overview of the

pre-trained base models’ accuracy performance on Task 1-4. Based on the figure,

14
TL-based pre-trained models showed up to 7% higher accuracy in Dermatology
MonkeyPox classification(T1 ) and Gastrointestinal Endoscopy classification (T2 ). In
contrast, SSL-based models outperformed TL-based models in Breast Ultrasound Clas-
sification (T3 ) and in Brain Tumour classification (T4 ). The primary reason that causes
the performance difference between the target groups T1 , T2 and T3 , T4 lies in the
colour channels: T1 and T2 consist of colourful medical images, while T3 and T4 are
grey-scale medical images. Notably, we minimised the influence of colourful ImageNet
pre-trained weights by using similar-scale intermediate medical datasets with the same
colour channels to retrain all base models during the representation learning stage. The
results indicated that models using TL methods are more effective at learning from
colourful images compared to SSL methods. Meanwhile, SSL methods demonstrated
superior performance when learning from grey-scale images, which has been noticed
by previous research but did not raise enough attention [88, 89]. These observations
suggest that both supervised and unsupervised pre-training methods have distinct
advantages depending on the data environment. To achieve better performance, it is
beneficial to consider both methods in combination rather than relying on a single
approach.
Fig. 4 shows the performance of each training stage during the ensemble learning
process for the two baseline pre-training models and our ETSEF model. The perfor-
mance comparison between the first and second stages indicated that using pre-trained
learning techniques can generally enhance the model’s overall performance. Among
all the tasks, base models showed the most significant improvement in T2 during this
stage, likely due to its larger intermediate source dataset compared to the other tasks.
However, the performance of the TL and SSL baseline models declined in T3 after
the feature fusion stage. This issue did not occur in other tasks, possibly because
the features extracted from ultrasound medical images in T3 can hardly distinguish
each other and have a high chance of being redundant overlays. In the subsequent
feature selection step, the ICA technique has also brought benefits to the ensemble
model’s performance, underscoring the importance of feature selection in maintaining
the independence and generalisation ability of features. The final weighted voting stage
further improved overall decision-making performance and ensured the robustness of
the entire ensemble learning process.
Unlike the two baseline models, we hid the ETSEF model’s early-stage performance
and began comparison at the feature fusion stage, as it utilised the same TL and
SSL pre-trained base models as the baseline models shown in the first two training
stages. Initially, ETSEF’s performance at the feature fusion stage was lower than
that of the baseline ensemble models, possibly due to the same feature duplication
issue. However, ETSEF outperformed both baseline models in each task after the
feature selection and weighted voting stages. The results indicated that combining
features from multiple pre-training methods and base model architectures leads to a
statistically significant improvement compared to using a single pre-training method.
The ETSEF model benefits from enhanced generalisation ability and a better capacity
to capture useful representations. Notably, the performance of ETSEF showed greater
improvement compared to the baselines in T1 and T3 , which have smaller target dataset

15
Fig. 4 Ensemble models’ five-stage performance on each task: This figure compares the per-
formance of the baseline models and the ETSEF model across four main training stages: pre-training,
feature fusion, feature selection, and majority voting. The results indicate a trend of performance
improvement at each stage, benefiting both the baseline and ETSEF models. Notably, the ETSEF
model outperforms the baseline models after the feature selection stage in every task.

sizes, demonstrating the ETSEF method’s effectiveness in scenarios with limited data
samples.
Table. 1 presents a performance comparison between ETSEF and the three base-
line models. While extended Fig. A2, Fig. A3, and Fig. A4 provide a corresponding
confusion matrix based on the ETSEF model and two pre-trained ensemble base-
lines. According to the test performance, the ETSEF method showed substantial
improvements compared to all baseline models, whether trained from scratch or with
pre-trained weights. In particular, for dermatology, endoscopy, and breast cancer tasks,
ETSEF’s performance increased by 7.5% to 8.5% across all evaluation metrics when
compared to the scratch baseline model. The improvements were even more pro-
nounced in the brain tumour task, with an 11.3% increase in accuracy and recall.
Furthermore, ETSEF showed a significant improvement of up to 2.9% in accuracy
for each task compared to the two strong pre-trained ensemble baseline models. The
experimental results demonstrated the effectiveness of combining multiple pre-training
methods, as the combination method enhanced overall performance by expanding the
base model architectures and pre-training method variety, thereby improving their

16
Accuracy Recall Precision F1score
Tasks Method
(%) (%) (%) (%)
Supervised Ensemble (Scratch) 89.2 (88.7,89.7) 89.2 (87.8,90.7) 89.5 (89.3,89.7) 88.9
Supervised Ensemble(TL) 94.2 (94.0,94.4) 94.2 (94.0,94.4) 95.2 (94.4,96.0) 94.3
Task 1
Unsupervised Ensemble (SSL) 91.7 (91.2,92.2) 90.8 (90.5,91.1) 92.3 (91.9,92.7) 91.1
ETSEF (Proposed) 96.7 (96.3,97.1) 96.7 (96.2,97.2) 96.9 (96.7,97.1) 96.7
Supervised Ensemble (Scratch) 89.3 (89.0,89.7) 89.3 (89.1,89.5) 89.2 (89.0,89.4) 89
Supervised Ensemble(TL) 97.1 (96.9,97.2) 97.4 (97.2,97.5) 97.4 (97.3,97.5) 97.4
Task 2
Unsupervised Ensemble (SSL) 93.4 (93.1,93.6) 93.2 (93.1,93.3) 93.2 (93.0,93.3) 93.1
ETSEF (Proposed) 97.8 (97.5,98.0) 97.8 (97.8,97.9) 97.8 (97.8,97.8) 97.8
Supervised Ensemble (Scratch) 87.5 (85.8,89.2) 87.5 (87.2,87.8) 87.5 (87.5,87.6) 87.4
Supervised Ensemble(TL) 93.3 (93.0,93.6) 90.8 (90.6,91.1) 91.5 (91.0,92.0) 90.9
Task 3
Unsupervised Ensemble (SSL) 94.1 (94.0,94.2) 93.3 (93.1,93.5) 93.6 (93.2,94.0) 93.3
ETSEF (Proposed) 95.0 (94.8,95.2) 95.0 (95.0,95.0) 95.0 (94.9,95.1) 95.0
Supervised Ensemble (Scratch) 95.7 (95.2,96.3) 95.4 (93.7,97.1) 95.7 (92.8,97.6) 95.5
Supervised Ensemble(TL) 98.2 (97.9,98.5) 98.1 (98.0,98.2) 98.3 (98.1,98.5) 98.2
Task 4
Unsupervised Ensemble (SSL) 98.7 (98.3,99.1) 98.9 (98.7,99.1) 98.4 (97.7,99.1) 98.6
ETSEF (Proposed) 99.4 (99.2,99.6) 99.4 (99.3,99.5) 99.4 (99.1,99.5) 99.4
Table 1 Comparison of ETSEF with baselines The baseline performance consists of three
ensemble scenarios: base models trained for target tasks without using pre-trained weights and
models trained with TL or SSL pre-trained knowledge. The results are presented in terms of
accuracy, recall, precision, and F1-score, with average performance reported as floating-point values.
From the result comparison, ETSEF outperformed all baseline scenarios in every task, regardless of
whether TL or SSL pre-trained methods were used.

generalisation ability on target datasets without requiring large amounts of train-

ing data. Compared with Fig. A4 with Fig. A1, the confusion matrix of the baseline
ensemble model trained without data augmentation showed less balanced performance
across different classes. This experimental finding suggests that the data imbalance
issue observed in T1 , T3 , and T4 has been well addressed and did not affect ETSEF’s
final performance. Highlighting the robustness of our approach when dealing with
imbalanced medical datasets, which is particularly valuable in medical imaging tasks.
Performance on Out-of-distribution Domain Task. The experimental results
of the ETSEF model and baseline models in the OOD data detection task are presented
in Table. 2. Specifically, we compared the performance of scratch-trained and pre-
trained baseline models to assess their effectiveness in unseen data and provide a
contrast to the ETSEF model. These baseline models are built from corresponding
base models of T2 . Compared to previous domain-specific tasks, the performance on
the OOD task showed a more significant rise in accuracy with an average of 9.5% when
applying pre-trained weights to base models. This finding suggests that pre-trained
methods can significantly enhance a model’s capability to capture useful features when
encountering new data with different feature distributions. Additionally, we noticed
that the SSL baseline model outperformed the TL baseline model in this task, which
contrasts with the previous observation that the TL method has better performance
on colourful datasets in domain-specific tasks. This suggests that the features captured
from the SSL method may cover a broader space, thus supporting better generalisation
ability.
Similar to the domain-specific tasks, ETSEF achieved the highest performance
compared to all baselines. There was a substantial improvement of 12.4% from the
scratch baseline model to our ETSEF model and a 2% improvement in accuracy com-
pared to the pre-trained baseline models. The performance improvement demonstrated

17
Accuracy Recall Precision F1score
Tasks Method
(%) (%) (%) (%)
Supervised Ensemble (Scratch) 63.2 (61.8,64.5) 63.2 (63.0,63.4) 63.2 (62.8,63.6) 63.2
Supervised Ensemble(TL) 71.8 (71.2,72.4) 71.8 (71.3,72.3) 71.8 (71.4,72.2) 71.8
Task 5
Unsupervised Ensemble (SSL) 73.6 (73.2,73.9) 73.6 (73.6,73.6) 73.7 (73.4,74.0) 73.6
ETSEF (Proposed) 75.6 (74.7,76.3) 75.6 (74.7,76.4) 75.7 (95.3,96.0) 75.6
Table 2 OOD task performance comparison The performance table indicates that the
baseline ensemble model without any pre-training methods significantly underperforms compared to
models utilising pre-training techniques. In contrast, the ETSEF model, which combines multiple
pre-training methods has demonstrates superior performance when handling unseen data compared
to the two baseline models that use a single pre-training method.

that ETSEF not only excels in domain-specific tasks but also handles unseen data more
efficiently. Based on all these observations and analyses, we learned that pre-training
methods are necessary for ensemble learning, which brings significant benefits to the
final model. The superior performance of ETSEF further suggests that expanding the
range of pre-trained knowledge can yield even greater advantages.
Performance Comparison with State-Of-The-Art Methods. To further
demonstrate the performance and efficiency of ETSEF, we compare it with recent
state-of-the-art models. Table 3 lists the performance of selected methods alongside
our ETSEF method. We selected reference methods based on their publication year
and citation count, focusing on research papers published within the recent three years
to ensure their relevance. Pre-print methods without citations were excluded to main-
tain the trustworthiness of the comparisons. Notably, this comparison is conducted
only within domain-specific tasks, as the main focus of the OOD task is to test the
generalisation ability of the ETSEF model instead of achieving higher performance.

Accuracy Recall Precision F1score

Tasks Method Model
(%) (%) (%) (%)
Ali et al. (2023) [80] DenseNet121 82.3 80.0 85.0 83.0
Naysk et al. (2024) [90] Mpox Classifier 94.0 93.0 93.5 93.0
Task 1
Biswas et al. (2024) [91] BinaryDNet53 95.0 94.4 91.5 92.9
ETSEF (Proposed) CNN Ensemble 96.7 96.7 96.7 96.7
Dong et al. (2023) [92] Transformer Ensemble 90.6 - - 90.7
Mukhtorov et al. (2023) [93] ResNet-152 93.5 - - -
Wang et al. (2023) [94] VIT 95.4 - - 95.4
Task 2
Patel et al. (2024) [95] EfficientNetB5 92.6 93.0 93.0 93.0
Khan et al. (2024) [96] LG-CNN 95.0 95.0 95.0 95.0
ETSEF (Proposed) CNN Ensemble 97.8 97.8 97.8 97.8
Gheglati et al. (2022) [97] VIT 85.3 - - -
Deb et al. (2023) [98] CNN Ensemble 85.2 - - -
Sahu et al. (2023) [99] Hybrid CNN 93.7 94.6 93.2 -
Task 3
Yue et al. (2024) [100] MedMamba 87.2 - - -
Zhou et al. (2024) [101] VIT-Ensemble 93.8 91.7 95.0 -
ETSEF (Proposed) CNN Ensemble 95.0 95.0 95.0 95.0
Kadam et al. (2021)[102] CNN 94.0 - - 94.0
Ravinder et al. (2023) [103] GCN 95.0 - - -
Task 4 Sahu et al. (2024) [104] CNN 94.2 93.5 93.5 93.5
Sachdeva et al. (2024) [105] Xception 97.4 - - 94.9
ETSEF (Proposed) CNN Ensemble 99.4 99.4 99.4 99.4
Table 3 Performance comparison with state-of-the-art models We compared the ETSEF
model with recent research using the same target datasets. The results show that the proposed
ETSEF framework demonstrates advanced performance across various medical imaging domains
and superior capability in handling challenging multi-classification tasks.

18
ETSEF show superior performance when compared to state-of-the-art
methods. For T1 , the proposed ETSEF method outperformed all reference methods,
showing an accuracy advantage ranging from 1.7% to 14.5%. Although Nayak et al. [90]
and Biswas et al. [91] achieved a 10% performance advantage over our pre-trained base
models with specifically designed architectures for the MSLDv2 dataset, the ETSEF
method leverages an ensemble framework to enhance the base models’ performance
significantly, and finally surpassing these specialised methods.
We observed numerous related research efforts in the gastrointestinal endoscopy
classification task (T2 ). The listed methods include popular Convolutional Neural Net-
work (CNN) architectures like ResNet and EfficientNet, as well as transformer-based
models. Compared to the works of Mukhtorov [93] and Patel [95], our base models
which utilised similar CNN architectures with fewer layers have achieved higher per-
formance through the double fine-tuning steps. Combining these powerful base models
with the ensemble learning technique resulted in a 2.5% improvement over the highest
reference method, which used the advanced vision transformer model [94].
For the breast cancer classification task, we noticed that more research is being
done to employ ensemble or hybrid methods to enhance model performance. This is
possibly due to the limited sample size of the target BusI dataset that has constrained
the single DL model’s performance. We observed that methods with ensemble learning
generally performed better than single-model approaches. The key difference between
our method and these referenced ensemble methods is that we combined multiple base
model architectures and utilised various pre-training methods simultaneously to learn
more generalised representations from the source domain. The results highlight the
efficiency of our method in learning useful knowledge.
In the brain tumour classification task, the reference methods that used different
CNN models achieved a high average performance, with the highest accuracy being
97.4%. In contrast, our ETSEF method achieved a perfect accuracy of 99.4% in test
cases, demonstrating that the ETSEF method can correctly classify almost all sam-
ples across four classes, offering a significant performance advantage over advanced
methods.

2.4 Ablation Study

To understand the importance and contributions of different base models in the ensem-
ble learning stage, we conducted an ablation study to analyse their impact on the
higher-level ML classifiers.
We set up the ablation study by excluding one base model at a time and using
the remaining models to conduct ensemble learning. We compared the performance of
these modified models with the ETSEF model that included all base models to assess
the influence of each base model. Specifically, we extracted features from each base
model, then concatenated and selected them using the same methods as in the ETSEF
model. We trained these processed features with the same higher-level ML classifiers
(SVM, XGBoost, RF, NB, and KNN) as in the previous experiment. Finally, we aver-
aged the performance of each ML classifier to obtain reliable results and compared
these averages to determine the impact of excluding each base model on the ETSEF
model’s performance.

19
Fig. 5 Ablation study of pre-trained base models: An ablation study was conducted to under-
stand the influence of each base model in the ensemble learning stage, providing evidence of their
value to the overall model performance. The results indicate that TL base models generally have a
positive influence on tasks 1 and 2, while SSL base models are more effective in supporting ensemble
learning on tasks 3 and 4.

Fig. 5 illustrates the performance changes observed in each task, highlighting the
impact of each base model on the ML classifiers’ performance. Notably, if excluding
a base model decreases the ML classifiers’ performance, it indicates that the chosen
base model is necessary for the ensemble learning process and improved integrated
feature quality. Conversely, if excluding a base model enhances the ML classifiers’
performance, it suggests that including the base model negatively impacts the ML
classifiers’ input feature quality.
The figure also reveals that the influence of different pre-training methods varies
across different tasks. For example, the MobileNet model pre-trained with TL posi-
tively influences tasks 1, 2, and 4 but negatively impacts task 3. In contrast, the same
MobileNet model pre-trained with SSL positively influences tasks 3 and 4 but neg-
atively affects tasks 1 and 2. Generally, TL-based models positively influence tasks
involving colourful medical image datasets (tasks 1-2), while SSL-based models posi-
tively influence tasks with grey-scale datasets (tasks 3-4). This observation aligns with
the performance trend from Fig. 3: TL-based models perform better with colourful
datasets, providing higher-quality features, and SSL-based models perform better with
grey-scale datasets.

20
However, this general observation does not cover all base models, as the Xception
model pre-trained with TL consistently shows a positive influence, while the SSL-
based ResNet model can also positively impact colourful datasets (task 2), and the
TL-based InceptionResNet model has a negative influence on task 1. These findings
suggest that the final performance of base models depends not only on the pre-training
method and architecture but also on the characteristics of the target dataset and
many other factors. The influence between different architectures, training methods,
and data environments remains unresearched, waiting for more exploration in the
future. At this stage, our ablation study mainly indicates that choosing suitable pre-
training methods for different target datasets may maximise the positive impact on
ensemble learning. Additionally, incorporating diverse base model architectures and
pre-training methods can continue to enhance the ensemble model’s ability to learn
and extract high-quality representation features and achieve more stable performance
across varying environments.

2.5 Visual Analysis

Medical image analysis involves high-stakes decision-making that is traditionally
reliant on human domain experts, where the decision-making process can be described
and understood. However, neural networks, consisting of numerous layers contributing
to the final predictions, present a challenge in fully comprehending their decision-
making process [106]. To test the robustness of the trained ETSEF model and
understand how the model makes predictions, we employed several explainable arti-
ficial intelligence (XAI) techniques, including Grad-CAM, SHAP, and t-SNE. These
XAI techniques allow us to visualise the attention areas of the trained models that
support their decision-making and examine the extracted feature distributions to
understand how the model classifies the features. By comparing the attention areas
of the base models with the ground truth areas that indicate disease, we can evaluate
how well the models have learned to recognise the true disease symptoms. Addition-
ally, this helps us determine whether the model’s predictions are based on reliable
judgments or influenced by unrelated elements.
Deep SHAP map analysis. The SHAP method [50] calculates the marginal
contribution of each feature by comparing the model’s prediction with and without the
feature. This process involves masking the feature, observing the change in the model’s
output, and aggregating the contributions using Shapley values. Specifically, a feature
point in the SHAP map with a higher SHAP value indicates a greater contribution
to the model’s final prediction, while a lower SHAP value means it detracts from the
model’s decision for the corresponding label.
As mentioned previously in Fig. 3, SSL pre-trained base models generally outper-
form TL pre-trained models in grey-scale medical imaging tasks, whereas TL models
excel with colourful medical images. However, it is premature to conclude that one
pre-training method is superior to the other under specific conditions, as performance
metrics alone do not fully reflect how well a model has learned from a dataset. There-
fore, we compare the decision-making processes of each base model using deep SHAP
maps to assess their robustness with the target dataset. We selected the EfficientNet
model, which performed best among the SSL base models, and compared it with the

21
Fig. 6 Explainable visualisation of base models using SHAP maps: The SHAP explainable
AI technique is used to visualise the base model’s attention areas and captured representations in the
target image samples. The maps demonstrate that when pre-trained SSL base models occasionally
fail to capture useful disease representations from the target sample, pre-trained TL base models can
provide additional support.

22
three TL base models in the grey-scale imaging environment using the BUSI dataset
from T3 . Fig. 6 provides a sample group of SHAP maps generated from four pre-trained
base models, each containing four target image sub-figures. The first column shows
the original test sample image, along with the true label and the predicted label made
by each base model. The target image is a sample of malignant breast cancer using
ultrasound imaging, where the symptom area appears as a shadow hole in the mid-
dle of the image. The remaining three columns represent all the classes in the target
dataset, with the percentage number above each sub-figure indicating the probability
that each model assigns to classifying the test image into the corresponding label.
From the SHAP maps, we observed that the EfficientNet model incorrectly pre-
dicted the target image as benign, focusing its attention on the top left of the image.
Similarly, the InceptionResNet model, which pre-trained using the TL method, made
the same incorrect prediction based on the middle part of the shadow hole and the top
left area. In contrast, the TL-based Xception and MobileNet models correctly clas-
sified the target sample as malignant. The Xception model distributed its attention
between the middle of the ground truth shadow area and the top left of the image,
while the MobileNet model concentrated more on the shadow hole itself, covering the
left half of the ground truth disease area.
Fig. 7 provides a visualisation sample from T4 . The ground truth label for the
sample image is a pituitary tumour caused by unusual growths in the pituitary gland.
In the original image, the pituitary tumour symptoms are marked with a red circle
located behind the nose at the base of the brain. For this case, we compared two
heatmaps generated from the TL base models (InceptionResNet and Xception) with
two heatmaps from SSL base models (MobileNet and EfficientNet). The SHAP maps
show that all four pre-trained base models made correct predictions. However, the
feature map shows that the TL-based models made their predictions by capturing the
ground truth symptoms within the pituitary gland area. In contrast, the SSL-based
models focused on the right bottom and meninges areas, making their predictions less
convincing and reliable compared to the TL-based models.
To summarise, we observed samples suggesting that SSL pre-trained base mod-
els do not always learn better symptoms than TL-based models under a grey-scale
image environment, and TL models can provide useful insights when SSL models fail.
Also, a similar situation happened when pre-trained base models faced the colourful
image environment as the extended Fig. A5 and Fig. A6 present. Importantly, from
all these sample test cases, we noticed that none of the single pre-trained base models
successfully captured the entire ground truth area, likely due to the limited size of the
target dataset constraining the model’s ability to learn high-quality representations
and make reliable predictions.
These observations have raised our attention as they pose a potential risk in real
clinical settings, and they suggest that using a single deep learning architecture or
pre-training method may be insufficient in limited and challenging medical environ-
ments. Fusing features from various base models shows promise in bridging this gap,
as they combine the captured representation areas from each base model. Moreover,
we observed more samples demonstrate that by combining different models’ attention
and extending the coverage, an ensemble model can make more reliable predictions.

23
Fig. 7 Explainable visualisation of base models using SHAP maps: Here we provide another
example where SSL base models fail to capture the representative disease symptoms. In this case,
TL-based models offer a better understanding of the target disease symptoms, compensating for the
SSL models’ lack of knowledge.

24
In the following subsection, we use Grad-CAM and t-SNE methods to provide fur-
ther evidence of the power of feature fusion and ensemble learning in improving model
robustness and reliability in the decision-making process.
Grad-CAM heatmap analysis. The Grad-CAM method used here is an
advancement of the Class Activation Mapping (CAM) technique, which was originally
developed by Zhou et al. [107]. CAM integrates a global average pooling layer into a
standard CNN, linking critical feature contributions to the CNN’s specific predictions
and providing insight into the model’s decision-making process. Furthermore, Grad-
CAM builds on this by using gradients from the network’s final convolutional layer to
produce a coarse localisation map. This map highlights the influential regions within
an image that contribute to the prediction of specific concepts or classes. We used the
Grad-CAM technique to generate heatmaps to pinpoint the critical area affecting the
model’s prediction.

Fig. 8 Explainable visualisation using Grad-CAM maps: To provide a comprehensive insight

into the transformation of the model’s decision-making process after ensemble training, we utilised
Grad-CAM to map the models’ attention regions. The sample heatmaps are from each pre-trained base
model, compared with the ETSEF model that fused their features, and have clearly illustrated how
the ensemble learning method improved the model’s capability to capture target disease symptoms
accurately.

25
Fig. 8 shows a heatmap sample from T4 of the BTC dataset. The target image
is of a meningioma tumour, which is a common type of tumour that grows from
the membranes surrounding the brain and spinal cord. The symptoms of this type
of tumour include abnormal swelling that presses on the nearby brain, nerves, and
vessels.
From Fig. 8, all pre-trained base models correctly predicted the ground truth label.
However, TL-based models did not accurately capture the symptomatic area. Specifi-
cally, the Xception model focused on the left side of the brain, while the MobileNet and
InceptionResNet models focused outside the brain. Conversely, the SSL-based models
performed better, with all three identifying the abnormal area and concentrating on
the ground truth region. The EfficientNet model partially covered the disease area,
but also focused on two regions outside it. The MobileNet and ResNet models also
partially captured the symptomatic area but did not fully reveal it.
The ETSEF model demonstrated a superior understanding of the target disease.
Its heatmap perfectly captured the disease area and minimised the negative influence
of the TL-based models, avoiding attention outside the brain.
Fig. 9 compares Grad-CAM heatmaps between pre-trained base models and
the ETSEF model using a sample image from the Kvasirv2 dataset of T2 . This
image belongs to the normal Z-line class, where classification performance is often
confused with esophagitis. The Z-line is the squamocolumnar junction where the
esophageal squamous epithelium transitions to the stomach’s columnar epithelium,
while esophagitis indicates inflammation of the esophagus, characterised by an
irregular, thickened Z-line with erythema, visible erosions, and ulcers.
In the figure, the first column shows the ground truth area marked on the origi-
nal sample image, where the esophageal squamous epithelium is located in the middle
of the image. The second column displays heatmaps from three TL-based pre-trained
models. The Xception and InceptionResnet models only partially focused on the
ground truth area and failed to make the correct prediction. In contrast, the MobileNet
heatmap successfully covered most of the ground truth area and made the correct
prediction. Self-supervised base models also struggled, with only the Efficient model
successfully capturing the upper side of the ground truth area, while MobileNet and
ResNet misfocused on areas outside the target.
The final column presents the heatmap of the ETSEF model after feature fusion.
This model combined features from six base models, successfully capturing the target
ground truth area and making an accurate prediction based on the enhanced features.
The heatmap samples demonstrated that the ETSEF method extended the attention
area of the target image and improved the feature quality to cover the whole ground
truth area and make reasonable predictions.
Extended Fig. A7 presents additional samples from target tasks. These heatmaps
demonstrate that the ETSEF model has enhanced the overall model’s generalisation
ability and reliability. It broadens the attention area when individual base models only
capture parts of it or by correcting the focus when base models incorrectly focus on
areas outside the disease symptoms.

26
Fig. 9 Explainable visualisation using Grad-CAM maps: Another representative sample
shows that both TL and SSL-based models are unable to capture the target representation region
fully. In contrast, the ETSEF model improves overall understanding and perfectly captures the disease
symptom area.

t-SNE feature distribution analysis. The t-SNE is a technique that visualises

high-dimensional data by giving each data point a location in a two or three-
dimensional map [51]. Our experiment used the t-SNE technique to map each feature
point at a two-dimensional map based on the feature point vectors. Fig. 10 and Fig.
11 provide a visualisation map that compares t-SNE maps between the ETSEF model
and baseline ensemble models, where the baseline model’s features have also undergone
fusion and selection processes.
For task 1, the baseline model trained from scratch shows a biased feature distri-
bution between classes. The yellow points representing the monkeypox class are less
centralised than other classes, with some features mixed with the healthy class. The
TL baseline model group features better than the SSL baseline, as the SSL base-
line misclassified chickenpox features into monkeypox and healthy classes. In contrast,
the ETSEF model’s feature map is clearly classified, with distinct boundaries and
similar-sized feature groups.

27
Fig. 10 Explainable visualisation using t-SNE maps: To analyse the differences in feature
quality after the feature fusion and selection steps, we used the t-SNE technique to visualise the
processed feature distribution. We compared the feature classification performance of all baseline
models with the ETSEF model. The feature map shows improved classification behaviour of the
ETSEF method in T1 and T2 .

28
Fig. 11 Explainable visualisation using t-SNE maps: This figure compares the baseline models
with the ETSEF model using the t-SNE visualisation technique. The ETSEF method’s feature map
shows clearer boundaries in T3 and T4 when compared to the baselines that do not use multiple
pre-training methods.

29
For Task 2, the baseline model trained from scratch produces a map with no clear
boundaries, mixing features from all groups, making them hard to distinguish. The
SSL baseline’s feature quality is worse than the TL baseline, as the SSL feature map
mixes features from esophagitis with normal-z-line class and dyed-resection-margins
with dyed-lifted-polyps class. In contrast, the TL baseline classifies each class more
distinctly, but the ETSEF model provides a more centralised and well-classified feature
group, with clear boundaries and distances between each class.
For Task 3, the baseline model from scratch poorly groups features, splitting malig-
nant and normal class features into two subgroups with many features decentralised
from the group centre. The TL baseline misclassified benign and malignant features,
mixing some features between the two groups. At the same time, the SSL baseline has
similar issues with features from benign and malignant classes mixed. The ETSEF
model produces the best-classified feature map, with clear boundaries for all three
groups and only a few feature points outside their groups.
For Task 4, all baseline and ETSEF models clearly define and classify feature
groups according to their classes. These maps align with our observations from the
performance table 1, as all models achieved extremely high performance in this task.
Still, the ETSEF method has a more centralized feature group with fewer outlier
feature points compared to the TL and SSL baselines.
In summary, comparing the ETSEF model with three baseline models shows that
our ETSEF strategy not only provides better model performance but also yields
higher feature quality. This is a crucial finding because we have learned that a sin-
gle pre-trained base model cannot fully understand and learn a satisfactory target
representation with limited target data. While the ETSEF method can enhance the
base model’s generalisation and provide more robust predictions. The analysis of t-
SNE feature maps demonstrates that the ETSEF method, which combines multiple
DL architectures and pre-training methods, has better feature quality than standard
ensemble methods using a single pre-training method and a few architectures. The
combination of various pre-training methods and architectures makes the ETSEF
method more robust and reliable in real-world clinical environments.

3 Discussion
The challenge of data scarcity has long constrained the application of DL tech-
niques in domains requiring specialised knowledge, such as medical, satellite, and
industrial fields. Recently, advancements in computer hardware have provided large-
scale computational capabilities, enabling the creation of more robust and powerful
models using extensive data volumes. Additionally, the rise of self-supervised and semi-
supervised approaches, which leverage unlabelled data at scale, has further driven
this trend [108, 109]. Researchers are increasingly focusing on modified large language
models, such as vision transformers and other large-scale transformer-based architec-
tures, in conjunction with self-supervised methods to address these issues [110, 111].
However, the high computational costs associated with these approaches often make
them inaccessible to individual users, limiting their use within large companies and
laboratories.

30
Previous studies to compare TL and SSL methods [32] reveal that TL methods offer
advantages such as superior performance on colourful images, faster training speeds,
and lower computational costs. These benefits make TL methods difficult to replace
entirely with SSL approaches. Moreover, our ablation study suggests that the final
performance of the base model varies with different DL architectures, making it hard
to select the best choice under different circumstances. Especially when the ensemble
learning process introduces more feature relationships and factors that may influence
the final prediction.
Based on these insights, we developed this generalist ensemble framework ETSEF,
that leverages both TL and SSL methods to create a more efficient and robust model
capable of performing well in challenging clinical environments with limited data sam-
ples. Experimental results presented in Fig. 1 and Fig. 3 demonstrate the superior
performance of our ETSEF method, which promotes a significant improvement in
each evaluation setting when compared to the powerful baselines and state-of-the-art
methods. With the support of various powerful pre-trained base models, the ETSEF
framework shows promising efficiency in representation learning from limited data
samples.
In this work, we focused on supervised transfer learning and contrastive self-
supervised learning due to their ease of implementation, domain-agnostic nature, and
proven success in natural image classification tasks [112, 113]. To minimise model
implementation difficulty and computational cost, we utilised ImageNet pre-trained
weights, which are readily available from TensorFlow and PyTorch libraries to reduce
the pre-training cost. The subsequent double fine-tuning process for each base model
took an average of 20-40 minutes using TL and 2-3 hours using SSL on personal desk-
top equipment. We also conducted feature fusion and selection steps to reduce the
dimensionality and complexity of extracted features, ensuring an efficient ensemble
learning process. Training higher-level ML classifiers with fused features took only
10-40 seconds, making the final deployment quick and efficient.
Beyond improving domain-specific representation learning and the ensemble learn-
ing process, our ETSEF method demonstrated effectiveness in out-of-distribution
settings. Experimental results (Fig. 2) show that ETSEF outperforms all three base-
line ensemble models when facing unseen medical domains, highlighting the significant
enhancement in model generalisation ability achieved by combining various pre-
training methods. This improvement is crucial for reducing feature domain-specific
designs and enhancing model reusability. The following visual analyses of pre-trained
base models, powerful baseline models, and the ETSEF model reveal that combining
different network architectures with pre-training methods can offset their individual
limitations and improve overall model robustness.
Given the trend towards developing more complex and deeper domain-specific
models, we emphasise the importance of using existing TL and SSL approaches to
minimise structural modifications and simplify the deployment process. Our focus is on
enhancing the extracted feature quality of base pre-trained models through fusion and
selection before further ensemble learning steps. By leveraging standard pre-training
paradigms, we aim to build an efficient and robust model capable of handling various
medical tasks and domains. Compared to large-scale deep models, our method offers

31
reduced computational costs and time while delivering state-of-the-art performance
and robust prediction in real-world scenarios.

3.1 Advantages of ETSEF

Based on the observed performance and visual analysis of ETSEF, we have summarised
the advantages of our ETSEF method as follows:
• Superior Performance in Medical Imaging Tasks: The primary advantage of
ETSEF is its exceptional performance in challenging medical imaging tasks, which
is the main objective of this study. Our experiments went through various medical
domains, which typically lack large-scale training data, and demonstrated that our
ETSEF model achieves state-of-the-art performance. This advantage is crucial in
medical diagnostics, as the main requirement in this area is to provide more accurate
predictions and reduce the risk of misdiagnoses.
• Enhanced Robustness: The enhanced robustness of ETSEF is another significant
advantage, largely due to the utilisation of ensemble learning techniques. The feature
fusion and selection steps effectively leverage each base model’s strengths, leading
to more generalised and reliable predictions. Additionally, the majority voting step
integrates predictions from multiple ML classifiers, providing a final prediction with
the most agreement, thereby reducing the bias risks associated with using a single
classifier.
• Flexible Deployment Environment: ETSEF offers flexibility in the deployment
environment, benefiting from the double-fine-tuning technique. This approach allows
users to pre-train base models with medium-sized or limited-scale source domain
datasets, reducing the risk of negative transfer. Our method enables users to pre-
train base models with unlabeled data using the SSL method, followed by ensemble
learning steps to achieve better performance in scenarios with extremely limited
labelled source data.
• Modular Structure: The modular structure of ETSEF enhances its adaptability
to different domains and tasks. Each pre-trained base model acts as a feature extrac-
tor, providing useful features and representations to support higher-level classifier
ensemble learning. This reusability of pre-trained models reduces waste of compu-
tational costs and accelerates implementation in new environments, making ETSEF
versatile and efficient for various applications.

3.2 Limitations of ETSEF

Meanwhile, we outline the limitations of ETSEF that require attention and future
resolution. Also, we provide suggestions for addressing these issues as directions for
future research.
• High computational cost of SSL pre-trained models: As mentioned ear-
lier, SSL pre-trained base models generally require 3-4 times more training time
compared to TL methods using the same source domain datasets. This issue is exac-
erbated when multiple base models are pre-trained for a single downstream task,
wasting computational resources and raising deployment costs. To address this,

32
future work could focus on utilising lightweight model structures to build more effi-
cient base models, thereby reducing training costs. Another approach is to leverage
the SSL method’s advantage in pre-training with multi-modality data and con-
duct multi-task training to enhance each base model’s reusability and generalisation
capability, thus minimising the pre-trained base models required.
• Lack of explainable tools for higher-level ML classifiers: The integrated
structure of ETSEF poses a challenge in explaining the decision-making process of
all components. While various XAI techniques like Grad-CAM, SHAP, and LIME
can provide visual support for the pre-trained base models, understanding the
learning process of higher-level ML classifiers remains difficult. This is because the
input feature data undergo extraction, concatenation, and selection through several
techniques, losing their sequential and positional information, which complicates
restoring them back to the sample image for visual analysis. In this study, we con-
ducted an ablation study to draw a contribution map for each ML classifier and
reveal their influence. However, it raises the difficulty of fully understanding the
decision process, and there is a lack of XAI tools to provide more intuitive insights.
A potential solution is to use linear XAI tools to analyse the importance of fea-
tures and their effects on the final prediction. Additionally, developing specifically
designed XAI tools to support the model’s explainability in this scenario could be
a good research direction in the future.

4 Methods
ETSEF involves six main training steps: supervised representation learning on the
large-scale non-medical dataset, supervised and self-supervised representation learning
on an intermediate source domain dataset, supervised fine-tuning on the target domain
dataset, feature extraction and fusion from pre-trained base models, feature selection
and ensemble learning with higher-level ML classifiers, and majority voting of ML
classifiers. The baseline models do not combine as many architectures and pre-training
methods as ETSEF. Extended data Fig. A8 and Fig. A9 outline the workflows of
the three baseline models. The two pre-trained baselines share the same ensemble
structure, differing only in the pre-training methods and base models used during the
second representation learning stage.
In this section, we provide implementation details of the ETSEF method and
baseline models, including the experimental environment setup, the hyperparameter
configuration for each DL architecture and higher-level ML classifier, and the addi-
tional useful works of the technologies we used. Moreover, we want to explain the
design choices made in our study and provide the users with a better understanding
of building an efficient ensemble model.

4.1 Experimental Environment Setup

The implementation environment for model training is based on Python 3.10.12 on a
Windows 11 Pro system under a desktop PC equipped with an Intel (R) Core (TM)
i5-12600K CPU at 4.8 GHz, an NVIDIA Geforce RTX 4080 GPU, and 64 GB of RAM.

33
All pre-trained ImageNet weights are publicly available and were downloaded from
the Keras and Torchvision libraries.
Base network choosing and pre-trained models. In this study, we utilised
multiple DL architectures and pre-training methods to train our base models. Given
the significant impact of base network performance on the final model’s effectiveness,
we aimed to select base models with superior performance and data efficiency. We used
ImageNet classification performance as a reference for choosing base architectures,
as it is a widely accepted benchmark for measuring model performance. Initially, we
considered seven prominent neural network architectures: ResNet, Inception, VGG,
InceptionResNet, MobileNet, NasNet, and EfficientNet. Additionally, we evaluated
popular transformer-based models such as Vision Transformer and Swin Transformer.
Among the CNN architectures, we excluded VGG due to its bad performance
and Inception due to its similar structure but lower performance when compared
with InceptionResNet. We also excluded the transformer-based models as the prelim-
inary experiments indicated that they struggled with extremely limited data samples,
even with pre-training methods. Issues such as overfitting, gradient collapse, and
the inability to update layer parameters were observed due to the conflict between
these models’ large training data requirements and the limited target data available.
Eventually, we selected Xception, InceptionResNetv2, MobileNetv3, ResNet101, and
EfficientNetB0 as our base models. These medium-sized versions of base architec-
tures, such as ResNet101 and EfficientNetB0, were chosen to balance performance and
computational cost.
For initialising the base models, we loaded ImageNet [35] weights into the back-
bone architectures. ImageNet was chosen as it is one of the most widely used source
domains for pre-training methods, providing a generalised knowledge base to prevent
base models from overfitting. Additionally, pre-trained ImageNet weights are easily
accessible from various Python libraries and can be prepared within a few minutes.
The subsequent representation learning steps were designed to learn domain-specific
knowledge from intermediate source domain datasets, selected based on the target
datasets’ feature distribution and imaging techniques. Both the ETSEF model and the
baseline models shared the same pre-trained base models to save computational costs
and experimental time. This setup also allowed us to compare ETSEF with the base-
lines under the same conditions, demonstrating the benefits of the integrated method
over using a single pre-training method.
Higher-level ML classifiers choosing. In addition to the pre-trained base mod-
els, we trained a group of higher-level ML classifiers using the extracted features to
learn correlations from feature distributions and make more robust predictions. For
the ML classifiers, we considered popular and widely approved architectures, including
Logistic Regression (LR), Adaptive Boosting (AdaBoost), Support Vector Machines,
K-nearest Neighbours, XGBoost, Random Forest, and Naive Bayes. These classifiers
were chosen for their simple structures and efficient statistical methods, enabling them
to learn effectively from input data.
Similar to the selection process for the base models, we excluded Logistic Regres-
sion due to its lower performance compared to other methods. We also excluded
Adaptive Boosting because both it and XGBoost are boosting-based methods, and

34
XGBoost offers better data efficiency and training speed. Ultimately, we selected five
ML classifiers: SVM, KNN, XGBoost, RF, and NB. This selection covers various statis-
tical approaches, including boosting, regression, tree-based methods, and probabilistic
models, providing a more generalised measurement and robust prediction capability.
Data preprocess and augmentation details. Considering the varying sizes of
image samples due to different medical imaging techniques and collection strategies
used to form the training datasets, we applied standard data preprocessing to all data
within the source and target domain datasets. We resized all image samples to 224 x
224 pixels to meet the input size requirements of the base models. Additionally, we
performed one-hot encoding on all labels of the target dataset samples to facilitate the
models’ decision-making process and simplify probability prediction. For grey-scale
images, we extended their colour channel to three by repeating the original channel,
ensuring compatibility with the base models’ input requirements.
Data imbalance is a common issue in medical domains, and many of our target
domain datasets also exhibit significant class imbalance. To address this, we imple-
mented a series of data augmentation strategies to reduce the imbalance and improve
the quality of our training data. Specifically, we used rotation, zoom, flip, and blur
techniques. The rotation range was set from -15 to +15 degrees, and the zoom range
was from 0.8 to 1.0 of the original image size. Both horizontal and vertical flips were
applied, and the kernel size for the blur strategy was set to 9.
However, we avoided using complex or specifically designed augmentation strate-
gies as this was not the focus of our study. Instead, we relied on standard augmentation
techniques that have been proven effective in previous research to ensure their relia-
bility. Additionally, we limited the use of data augmentation to the training data by
only augmenting classes with fewer samples to the size of the class with most samples
to maintain the real-world data scale and preserve the authenticity of learned features,
thereby supporting the test of the ETSEF method’s efficiency in limited real-world
clinical scenarios.

4.2 Hyperparameter and Classification Head Configuration

Generally speaking, fine-tuning a base model requires more than minimising the loss
function with labelled data; it also necessitates a deep understanding of the model’s
components. The selection of learning hyperparameters and the configuration of fully
connected layers can significantly impact the final model’s performance. The following
sentences provide details of our hyperparameter choices and classification head design.
Base network’s configuration. The pre-training and fine-tuning procedures of
pre-trained base models are influenced by several hyperparameters, including batch
size, optimiser, weight decay, learning rate, and training epochs. We selected the Adam
optimiser for the TL base models and set the initial learning rate to 0.001 to prevent the
gradient from dropping too quickly and causing overfitting. We set the batch size to 64
to manage memory usage effectively and the training epochs to 50 to ensure sufficient
representation learning. Additionally, we used a weight decay of 10−5 and monitored
the training loss to adjust the learning rate if the lowest loss level did not improve
within three epochs. We used a similar setup for the SSL base models to maintain
consistency and minimise the influence of changing hyperparameters on performance.

35
The Adam optimiser was also used for the SSL base models, with the learning rate
initialised at 0.001. The batch size and training epochs remained the same as for the
TL base models, with the only difference being the weight decay monitoring, which
started directly at 40 epochs instead of supervising training loss.
The fully connected layers, also known as the classification head in classification
tasks, have played a crucial role in influencing a model’s prediction performance. A
typical classification head includes a flatten layer to convert the learned features from
the base models into a one-dimensional array, followed by a pooling layer to down-
sample and highlight significant features. After pooling, a dense layer with numerous
interconnected nodes links the pooling layer to the final output layer, establishing the
relationships between learned features. The final output layer is usually a linear layer
and works by mapping the model predictions to their corresponding labels.
For TL base models, the classification head during both the pre-training and fine-
tuning stages includes a flattened layer and a global average pooling layer to process
the learned features, followed by three dense layers to refine and connect these fea-
tures. We utilised another kind of layer called the dropout layer that drops out nodes
by randomly setting them to 0 to mitigate overfitting and enhance generalisation. A
dropout layer with a dropout rate of 0.3 is used after the dense layers. The final out-
put layer employs the softmax function to select the label with the highest prediction
probability. Conversely, the classification head for SSL base models differs between the
pre-training and fine-tuning stages. During the pre-training stage, which is unsuper-
vised, the classification head includes a flattened layer, a max pooling layer, and two
dense layers but excludes the final linear layer, with performance evaluated directly
by the loss function. In the fine-tuning stage, the classification head retains the same
layer configuration but incorporates an output layer with a softmax function to enable
supervised fine-tuning.
Higher-level ML classifiers’ configuration. Different from the base networks,
our ML classifiers are based on different statistical methods and require tailored set-
tings for each classifier’s hyperparameter. For the SVM classifier, key hyperparameters
included penalty, tolerance, loss function, random state, and max iterations. The
penalty and the tolerance determine the decision boundary of the SVM classifier, as
the penalty is a regularization term that controls the model’s complexity and the
degree of tolerance specifies how much misclassification the SVM can tolerate. We
tested different penalty functions and various tolerance numbers from 0.1 to 0.0001,
and we selected the l2 penalty term and 0.001 as the optimal tolerance, as it provided
the best average performance across all test cases. We kept the other hyperparame-
ters, like the loss function and max iterations, at their default settings to reduce the
model’s variable complexity and maintain performance stability.
The XGBoost classifier uses a booster hyperparameter to decide the statistical
method; several options exist, including gradient boost tree (gbtree), gradient boost
linear (gblinear), and dart. We selected the tree-based booster based on test case
performance for its higher performance behaviour. Additionally, we fine-tuned the
tree booster’s hyperparameters, including eta (the learning rate) and max depth (the
maximum depth of the tree). We tested eta settings ranging from 0 to 1 and chose 0.9

36
to optimise performance. For max depth, we set it to 10 to balance model complexity
and prevent overfitting.
The NB classifier follows a Gaussian distribution and relies more on the data
distribution than the hyperparameters. It has two hyperparameters: priors, which
set prior probabilities for the classes, and var smoothing, which adjusts the largest
variance portion of all features. For this classifier, we used the default settings, as the
test experiment shows that they have a minor influence on the final performance.
The RF classifier is another tree-based classifier, and it involves various hyperpa-
rameters such as the number of estimators, criterion, max depth, and min-sample-split
that affect the classifier’s performance. The estimators, criterion, and max depth are
tree-specific parameters. Estimators determine the number of trees in the classifier,
max depth defines the maximum depth of each tree, and the criterion measures the
quality of a split. Additionally, the Min-Sample-Split parameter specifies the mini-
mum number of samples required to split an internal node. Similar to the XGBoost
classifier setup, we set the max depth of the tree classifier to 10 and kept the number
of estimators at the default of 100 to control the classifier size and computational cost.
We also tested min-sample-split values ranging from 1 to 9 and evaluated criteria such
as Gini, entropy, and log-loss functions. Ultimately, we chose a min-sample-split value
of 3 and the Gini criterion to maximise the classifier’s performance.
The KNN classifier is a classic ML method for classification and regression prob-
lems, using proximity to group similar data points. The main hyperparameters of a
KNN classifier include the number of neighbours, the weight function, and the algo-
rithm. The number of neighbours K determines how many instances are assigned to
the same class as the nearest neighbour, and we set the k = 3 to reduce bias between
feature points and avoid overfitting. The choice of weight function also influences
predictions: the uniform weight function assigns equal weight to all points in each
neighbourhood, while the distance function weights points by the inverse of their dis-
tance, giving closer points more importance. We selected the distance function for our
classifier to prioritise closer feature points and centralise the feature group. We also
chose ”auto” for the algorithm option to allow the classifier to automatically select
the most appropriate algorithm based on the input values.

4.3 Additional Related Work

In addition to the methods used in this work, numerous research efforts have employed
pre-trained and ensemble learning approaches to address data limitations in the
medical field. These approaches can serve as valuable inspiration, broadening our per-
spective and offering additional insights. In this section, we review previous works
related to medical studies to provide a deeper context and inform our model design.
Transfer learning. Despite the scarcity of large-scale labelled medical source
domains, TL has been widely applied to medical imaging tasks using large generic
datasets as source domain [114]. However, research shows that a lightweight model
trained from scratch on medical images performs comparably to a pre-trained model on
ImageNet [115], although the pre-trained method offers extra benefits such as speeding
up convergence and reducing overfitting [38]. At the same time, studies indicate that
pre-training with medium-sized labelled medical datasets from the same domain yields

37
better performance compared to training from scratch, demonstrating the value of TL
in transferring knowledge from in-distribution domains [116–118]. Compared to the
supervised TL approaches in which labelled data are expensive and time-consuming
to obtain, unsupervised TL approaches that use unsupervised pre-training strategies
seem to be a better option to make use of large-scale unlabelled medical datasets
[119]. Nevertheless, traditional unsupervised TL often fails to achieve state-of-the-art
performance, whereas the SSL method as a new branch of the unsupervised TL has
garnered increasing attention [120, 121].
Additionally, multi-source TL has achieved notable success, enhancing the model’s
representation learning by increasing the modalities of patients and datasets [122, 123].
While the double fine-tuning technique used in this work leverages both large-scale
generic datasets and medical domain datasets, extending feature generalisation while
maintaining domain-specific knowledge [36, 37]. Our approach also takes advantage of
this double fine-tuning method to reduce the pre-trained model’s dependency on large-
scale medical datasets, and we found that utilising the large-scale generic knowledge
base allows the use of small medical datasets as intermediate source domains without
causing overfitting.
Self-supervised learning. Previous work has categorised the SSL method into
three groups, including contrastive self-supervised learning, predictive self-supervised
learning, and generative self-supervised learning [30, 124], with all three groups have
been successfully applied to the medical domain [125]. The primary distinction among
these methods lies in their use of different pretext tasks to guide the pre-training pro-
cess, which in turn offers various advantages and limitations for downstream tasks
[126]. For instance, a typical generative SSL method called Generative Adversarial
Network (GAN) [127] employs encoders to learn from original images and generate
synthetic images based on the learned features. Another set of encoders then classifies
the generated images as real or fake, enabling the GAN model to enhance its fea-
ture learning through adversarial competition. Additionally, this approach provides a
potential solution to address data scarcity issues in the medical domain by generating
large-scale synthetic medical images for training purposes. Meanwhile, the predictive
SSL method employs puzzle-based pretext tasks to pre-train the model, enabling it to
learn extensive knowledge of positional representations from the training data, making
this approach particularly suitable for downstream tasks involving medical localisation
and prediction.
Contrastive self-supervised methods offer more advantages in classification tasks
than predictive and generative self-supervised learning. These methods pre-train mod-
els by comparing and distinguishing input data with pseudo labels, enabling them
to learn powerful representations that effectively discriminate the original data [128].
Contrastive self-supervised methods such as RELIC [129], MoCo [130], SimCLR [58]
and SwAV [131], have been applied to medical imaging tasks, achieving significant
improvements in label efficiency [132, 133].
Moreover, some works have attempted to design domain-specific pretext tasks to
achieve better performance [134, 135], while others focus on multi-task and modality
methods [136, 137]. Compared to supervised TL methods, SSL methods using an
unsupervised paradigm have shown better potential in multi-task and multi-modality

38
learning, as they do not rely on data labels, which enlarges the availability of the
training data and avoids increasing the model’s complexity. Currently, there is limited
work considering combining SSL with ensemble learning methods, and we are the first
work to integrate both TL and SSL pre-trained models as the base models to explore
their effectiveness with ensemble learning techniques.
Ensemble learning. Novel and high-performance medical image classification
pipelines increasingly utilise ensemble learning strategies. While many ensemble learn-
ing approaches combined with deep neural networks have been successful in the
medical domain [138, 139], it remains an open question to what extent and which
specific strategies are most beneficial for deep learning-based medical image classifi-
cation [140]. Meanwhile, traditional ensemble strategies using simple ML classifiers
have also proven effective in medical tasks [141, 142]. However, some studies suggest
that ensembling deep neural networks may not be necessary because it significantly
increases total computational costs due to the need to train multiple networks [143].
Additionally, further ensembling with pre-trained models has been viewed as a waste
of computational resources, adding to the training burden.
Despite these concerns, recent work on ensembling pre-trained models has shown
significant performance improvements [144, 145], highlighting the potential of ensem-
ble learning strategies combined with powerful pre-trained models. Based on these
previous works, our study filled this knowledge gap by testing the combination of
ensemble learning with recently appeared SSL pre-training methods. We also intro-
duce popular TL pre-training methods to complement the SSL approach’s limitation.
More importantly, the experiment shows the superior performance and robustness of
combining ensemble learning with multiple pre-training methods and architectures.

5 Conclusion
In this work, our objective is to develop an effective deep learning strategy to address
the challenge of data scarcity in medical imaging tasks. As a result, we present a novel
ensemble framework called ‘ETSEF’ that combines pre-trained base models using TL
and SSL methods. The ETSEF strategy has demonstrated state-of-the-art perfor-
mance and robustness when handling limited training data and challenging medical
tasks. Additionally, it is convenient and easy to deploy, and it does not need additional
modification of the base model architectures.
We evaluated the effectiveness of using an ensemble learning strategy with pre-
training methods by comparing pre-training baseline ensemble models with scratch-
trained baseline ensemble models. The results showed that combining pre-trained base
models significantly improved the whole ensemble model’s final performance, under-
scoring the value and necessity of incorporating pre-trained methods to enhance model
performance. Our findings also highlighted that TL and SSL-based ensemble models
have distinct advantages and limitations towards different data environments, which
depend on the training data’s colour channel, size, background, imaging techniques,
synthetic representations and other factors. Thus, we stress the need for careful consid-
eration of the data environment when combining pre-training methods with ensemble
learning.

39
Furthermore, our experiments explored multiple ensemble techniques, including
feature fusion, feature selection, and weighted voting. The results demonstrated that
feature fusion and selection techniques are crucial for maintaining feature quality and
significantly improving the final performance of the ensemble model. Weighted voting
also improved decision-making accuracy and enhanced the overall reliability of the
ensemble model, and we recommend that followers consider these useful techniques
when building ensemble models.
Due to the limitations of our experiment scale and computational resources, we did
not extensively explore multi-task and modality learning for continuously enhancing
pre-trained base models’ generalisation, nor did we test more pre-training strategies
and powerful architectures to strengthen the base models’ feature learning ability. In
the future, we plan to extend and test ETSEF in other areas that are bothered by
limited training data, and to explore other potential techniques to build powerful and
efficient ensemble models.
Data availability. All the datasets used in this study are based on publicly
available datasets, including source domain datasets and target domain datasets. For
the dermatology monkeypox classification task, the source dataset MSID is under
the CC by 4.0 licence and the target dataset MSLDv2 is under the CC by-NC 4.0
licence, and both of them can be downloaded from Kaggle library. For the gastroin-
testinal endoscopy classification task, the source dataset Medico2018 is based on a
public medical task, and the corresponding data are available to participants and other
multimedia researchers without restriction. The target dataset Kvasirv2 is under the
CC BY-SA 4.0 licence and can be accessed from Kaggle library. For the breast ultra-
sound classification task, the source dataset POCS and target dataset BusI are both
available from Kaggle library. For the brain tumour detection task, the source dataset
Br35H is shared by its author Ahmed Hamada through Kaggle library, while the tar-
get dataset BTC is under the MIT licence and can be accessed from the author’s
Github website. The out-of-distribution test dataset EyePacs is a publicly available
dataset share within Kaggle library. Moreover, the source domain dataset ImageNet
Large Scale Visual Recognition Challenge (ILSVRC) that has been used for each base
model’s pre-training is publicly available via ImageNet Website.
Code availability. To train supervised transfer learning base models, guidelines
and tutorials are available in open-source repositories, including the Keras library and
Tensorflow library. For building self-supervised learning base models, the SimCLR
GitHub Website provides a good example of implementing contrastive self-supervised
pre-training with SimCLR. For more information and implementation samples of
GradCAM in PyTorch, refer to the Pytorch-GradCAM Website. For GradCAM in Ten-
sorFlow, visit the Keras-GradCAM Website. Additionally, information on the SHAP
explainer is available in the Deep SHAP Documentation Website. Detailed source
code and implementation samples of our work can be found on our GitHub Website
facilitating better understanding and utilisation of our ETSEF method.

References
[1] Fernandez-Quilez, A.: Deep learning in radiology: ethics of data and on the value

40
of algorithm transparency, interpretability and explainability. AI and Ethics
3(1), 257–265 (2023) https://doi.org/10.1007/s43681-022-00161-9

[2] Saba, L., Biswas, M., Kuppili, V., Godia, E.C., Suri, H.S., Edla, D.R., Omerzu,
T., Laird, J.R., Khanna, N.N., Mavrogeni, S., et al.: The present and future
of deep learning in radiology. European journal of radiology 114, 14–24 (2019)
https://doi.org/10.1016/j.ejrad.2019.02.038

[3] Ardila, D., Kiraly, A.P., Bharadwaj, S., Choi, B., Reicher, J.J., Peng, L.,
Tse, D., Etemadi, M., Ye, W., Corrado, G., et al.: End-to-end lung cancer
screening with three-dimensional deep learning on low-dose chest computed
tomography. Nature medicine 25(6), 954–961 (2019) https://doi.org/10.1038/
s41591-019-0447-x

[4] Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun,
S.: Dermatologist-level classification of skin cancer with deep neural networks.
nature 542(7639), 115–118 (2017) https://doi.org/10.1038/nature21056

[5] Mazhar, T., Haq, I., Ditta, A., Mohsan, S.A.H., Rehman, F., Zafar, I., Gansau,
J.A., Goh, L.P.W.: The role of machine learning and deep learning approaches
for the detection of skin cancer. In: Healthcare, vol. 11, p. 415 (2023). https:
//doi.org/10.3390/healthcare11030415 . MDPI

[6] Panahi, A., Askari Moghadam, R., Tarvirdizadeh, B., Madani, K.: Simpli-
fied u-net as a deep learning intelligent medical assistive tool in glaucoma
detection. Evolutionary Intelligence 17(2), 1023–1034 (2024) https://doi.org/
10.1007/s12065-022-00775-2

[7] Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Land-
man, B., Roth, H.R., Xu, D.: Unetr: Transformers for 3d medical
image segmentation. In: Proceedings of the IEEE/CVF Winter Confer-
ence on Applications of Computer Vision, pp. 574–584 (2022). https:
//openaccess.thecvf.com/content/WACV2022/html/Hatamizadeh UNETR
Transformers for 3D Medical Image Segmentation WACV 2022 paper.html

[8] Khader, F., Müller-Franzes, G., Tayebi Arasteh, S., Han, T., Haarburger, C.,
Schulze-Hagen, M., Schad, P., Engelhardt, S., Baeßler, B., Foersch, S., et al.:
Denoising diffusion probabilistic models for 3d medical image generation. Scien-
tific Reports 13(1), 7303 (2023) https://doi.org/10.1038/s41598-023-34341-2

[9] Chan, H.-P., Samala, R.K., Hadjiiski, L.M., Zhou, C.: Deep learning in med-
ical image analysis. Deep learning in medical image analysis: challenges and
applications, 3–21 (2020) https://doi.org/10.1007/978-3-030-33128-3 1

[10] Alzubaidi, L., Bai, J., Al-Sabaawi, A., Santamarı́a, J., Albahri, A.S., Al-
dabbagh, B.S.N., Fadhel, M.A., Manoufali, M., Zhang, J., Al-Timemy, A.H., et

41
al.: A survey on deep learning tools dealing with data scarcity: definitions, chal-
lenges, solutions, tips, and applications. Journal of Big Data 10(1), 46 (2023)
https://doi.org/10.1186/s40537-023-00727-2

[11] Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X.: Embracing
imperfect datasets: A review of deep learning solutions for medical image seg-
mentation. Medical image analysis 63, 101693 (2020) https://doi.org/10.1016/
j.media.2020.101693

[12] Kaissis, G., Ziller, A., Passerat-Palmbach, J., Ryffel, T., Usynin, D., Trask,
A., Lima Jr, I., Mancuso, J., Jungmann, F., Steinborn, M.-M., et al.: End-to-
end privacy preserving deep learning on multi-institutional medical imaging.
Nature Machine Intelligence 3(6), 473–484 (2021) https://doi.org/10.1038/
s42256-021-00337-8

[13] Dhar, T., Dey, N., Borra, S., Sherratt, R.S.: Challenges of deep learning in
medical image analysis—improving explainability and trust. IEEE Transactions
on Technology and Society 4(1), 68–75 (2023) https://doi.org/10.1109/TTS.
2023.3234203

[14] Zhou, S.K., Greenspan, H., Davatzikos, C., Duncan, J.S., Van Ginneken, B.,
Madabhushi, A., Prince, J.L., Rueckert, D., Summers, R.M.: A review of deep
learning in medical imaging: Imaging traits, technology trends, case studies with
progress highlights, and future promises. Proceedings of the IEEE 109(5), 820–
838 (2021) https://doi.org/10.1109/JPROC.2021.3054390

[15] Banerjee, J., Taroni, J.N., Allaway, R.J., Prasad, D.V., Guinney, J., Greene,
C.: Machine learning in rare disease. Nature Methods 20(6), 803–814 (2023)
https://doi.org/10.1038/s41592-023-01886-z

[16] Bansal, M.A., Sharma, D.R., Kathuria, D.M.: A systematic review on data
scarcity problem in deep learning: solution and applications. ACM Computing
Surveys (CSUR) 54(10s), 1–29 (2022) https://doi.org/10.1145/3502287

[17] Javaid, M., Haleem, A., Singh, R.P., Suman, R., Rab, S.: Significance of machine
learning in healthcare: Features, pillars and applications. International Journal
of Intelligent Networks 3, 58–73 (2022) https://doi.org/10.1016/j.ijin.2022.05.
002

[18] Chen, X., Wang, X., Zhang, K., Fung, K.-M., Thai, T.C., Moore, K., Mannel,
R.S., Liu, H., Zheng, B., Qiu, Y.: Recent advances and clinical applications of
deep learning in medical image analysis. Medical Image Analysis 79, 102444
(2022) https://doi.org/10.1016/j.media.2022.102444

[19] Petersen, E., Potdevin, Y., Mohammadi, E., Zidowitz, S., Breyer, S., Nowotka,
D., Henn, S., Pechmann, L., Leucker, M., Rostalski, P., et al.: Responsible
and regulatory conform machine learning for medicine: a survey of challenges

42
and solutions. IEEE Access 10, 58375–58418 (2022) https://doi.org/10.1109/
ACCESS.2022.3178382

[20] Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for
deep learning. Journal of big data 6(1), 1–48 (2019) https://doi.org/10.1186/
s40537-019-0197-0

[21] Kebaili, A., Lapuyade-Lahorgue, J., Ruan, S.: Deep learning approaches for data
augmentation in medical imaging: a review. Journal of Imaging 9(4), 81 (2023)
https://doi.org/10.3390/jimaging9040081

[22] Laurer, M., Van Atteveldt, W., Casas, A., Welbers, K.: Less annotating, more
classifying: Addressing the data scarcity issue of supervised machine learning
with deep transfer learning and bert-nli. Political Analysis 32(1), 84–100 (2024)
https://doi.org/10.1017/pan.2023.20

[23] Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q.: A
comprehensive survey on transfer learning. Proceedings of the IEEE 109(1),
43–76 (2020) https://doi.org/10.1109/JPROC.2020.3004555

[24] Iman, M., Arabnia, H.R., Rasheed, K.: A review of deep transfer learning and
recent advancements. Technologies 11(2), 40 (2023) https://doi.org/10.3390/
technologies11020040

[25] Malik, H., Farooq, M.S., Khelifi, A., Abid, A., Qureshi, J.N., Hussain, M.: A
comparison of transfer learning performance versus health experts in disease
diagnosis from medical imaging. IEEE Access 8, 139367–139386 (2020) https:
//doi.org/10.1109/ACCESS.2020.3004766

[26] Kathamuthu, N.D., Subramaniam, S., Le, Q.H., Muthusamy, S., Panchal, H.,
Sundararajan, S.C.M., Alrubaie, A.J., Zahra, M.M.A.: A deep transfer learning-
based convolution neural network model for covid-19 detection using computed
tomography scan images for medical applications. Advances in Engineering
Software 175, 103317 (2023) https://doi.org/10.1016/j.advengsoft.2022.103317

[27] Salehi, A.W., Khan, S., Gupta, G., Alabduallah, B.I., Almjally, A., Alsolai,
H., Siddiqui, T., Mellit, A.: A study of cnn and transfer learning in medical
imaging: Advantages, challenges, future scope. Sustainability 15(7), 5930 (2023)
https://doi.org/10.3390/su15075930

[28] Zhang, W., Deng, L., Zhang, L., Wu, D.: A survey on negative transfer. IEEE/-
CAA Journal of Automatica Sinica 10(2), 305–329 (2022) https://doi.org/10.
1109/JAS.2022.106004

[29] Niu, S., Liu, Y., Wang, J., Song, H.: A decade survey of transfer learning (2010–
2020). IEEE Transactions on Artificial Intelligence 1(2), 151–166 (2020) https:
//doi.org/10.1109/TAI.2021.3054609

43
[30] Krishnan, R., Rajpurkar, P., Topol, E.J.: Self-supervised learning in medicine
and healthcare. Nature Biomedical Engineering 6(12), 1346–1352 (2022) https:
//doi.org/10.1038/s41551-022-00914-1

[31] Navarro, F., Watanabe, C., Shit, S., Sekuboyina, A., Peeken, J.C., Combs, S.E.,
Menze, B.H.: Evaluating the robustness of self-supervised learning in medical
imaging. arXiv preprint arXiv:2105.06986 (2021) https://doi.org/https://arxiv.
org/abs/2105.06986

[32] Zhao, Z., Alzubaidi, L., Zhang, J., Duan, Y., Gu, Y.: A comparison review of
transfer learning and self-supervised learning: Definitions, applications, advan-
tages and limitations. Expert Systems with Applications, 122807 (2023) https:
//doi.org/10.1016/j.eswa.2023.122807

[33] Ericsson, L., Gouk, H., Loy, C.C., Hospedales, T.M.: Self-supervised represen-
tation learning: Introduction, advances, and challenges. IEEE Signal Processing
Magazine 39(3), 42–62 (2022) https://doi.org/10.1109/MSP.2021.3134634

[34] Ericsson, L., Gouk, H., Hospedales, T.: Why do self-supervised models transfer?
on the impact of invariance on downstream tasks. In: The 33rd British Machine
Vision Conference, 2022, p. 509 (2022). https://doi.org/https://arxiv.org/abs/
2111.11398 . BMVA Press

[35] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-
scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision
and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.
2009.5206848 . Ieee

[36] Tan, B., Zhang, Y., Pan, S., Yang, Q.: Distant domain transfer learning. In:
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017).
https://doi.org/10.1609/aaai.v31i1.10826

[37] Niu, S., Liu, M., Liu, Y., Wang, J., Song, H.: Distant domain transfer learning
for medical imaging. IEEE Journal of Biomedical and Health Informatics 25(10),
3784–3793 (2021) https://doi.org/10.1109/JBHI.2021.3051470

[38] Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., Loh,
A., Karthikesalingam, A., Kornblith, S., Chen, T., et al.: Big self-supervised
models advance medical image classification. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp. 3478–3488 (2021). https:
//openaccess.thecvf.com/content/ICCV2021/html/Azizi Big Self-Supervised
Models Advance Medical Image Classification ICCV 2021 paper.html

[39] Azizi, S., Culp, L., Freyberg, J., Mustafa, B., Baur, S., Kornblith, S., Chen,
T., Tomasev, N., Mitrović, J., Strachan, P., et al.: Robust and data-efficient
generalization of self-supervised machine learning for diagnostic imaging.
Nature Biomedical Engineering 7(6), 756–779 (2023) https://doi.org/10.1038/

44
s41551-023-01049-7

[40] Yang, S., Chen, L.-F., Yan, T., Zhao, Y.-H., Fan, Y.-J.: An ensemble classifi-
cation algorithm for convolutional neural network based on adaboost. In: 2017
IEEE/ACIS 16th International Conference on Computer and Information Sci-
ence (ICIS), pp. 401–406 (2017). https://doi.org/10.1109/ICIS.2017.7960026 .
IEEE

[41] Zhou, K., Yang, Y., Qiao, Y., Xiang, T.: Domain adaptive ensemble learning.
IEEE Transactions on Image Processing 30, 8008–8018 (2021) https://doi.org/
10.1109/TIP.2021.3112012

[42] Mungoli, N.: Adaptive ensemble learning: Boosting model performance

through intelligent feature fusion in deep neural networks. arXiv preprint
arXiv:2304.02653 (2023) https://doi.org/10.48550/arXiv.2304.02653

[43] Reddy, A.S.K., Rao, K.B., Soora, N.R., Shailaja, K., Kumar, N.S., Sridha-
ran, A., Uthayakumar, J.: Multi-modal fusion of deep transfer learning based
covid-19 diagnosis and classification using chest x-ray images. Multimedia
Tools and Applications 82(8), 12653–12677 (2023) https://doi.org/10.1007/
s11042-022-13739-6

[44] Wani, I.M., Arora, S.: Osteoporosis diagnosis in knee x-rays by transfer learning
based on convolution neural network. Multimedia Tools and Applications 82(9),
14193–14217 (2023) https://doi.org/10.1007/s11042-022-13911-y

[45] Tang, Y., Yang, D., Li, W., Roth, H.R., Landman, B., Xu, D., Nath, V.,
Hatamizadeh, A.: Self-supervised pre-training of swin transformers for 3d medi-
cal image analysis. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 20730–20740 (2022). https://openaccess.
thecvf.com/content/CVPR2022/html/Tang Self-Supervised Pre-Training of
Swin Transformers for 3D Medical Image Analysis CVPR 2022 paper.html

[46] Li, C., Yang, Y., Liang, H., Wu, B.: Transfer learning for establishment of recog-
nition of covid-19 on ct imaging using small-sized training datasets. Knowledge-
Based Systems 218, 106849 (2021) https://doi.org/10.1016/j.knosys.2021.
106849

[47] Wolf, D., Payer, T., Lisson, C.S., Lisson, C.G., Beer, M., Götz, M., Ropinski, T.:
Self-supervised pre-training with contrastive and masked autoencoder methods
for dealing with small datasets in deep learning for medical imaging. Scientific
Reports 13(1), 20260 (2023) https://doi.org/10.1038/s41598-023-46433-0

[48] Ghassemi, N., Shoeibi, A., Rouhani, M.: Deep neural network with generative
adversarial networks pre-training for brain tumor classification based on mr
images. Biomedical Signal Processing and Control 57, 101678 (2020) https://
doi.org/10.1016/j.bspc.2019.101678

45
[49] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.:
Grad-cam: Visual explanations from deep networks via gradient-based local-
ization. In: Proceedings of the IEEE International Conference on Computer
Vision, pp. 618–626 (2017). https://openaccess.thecvf.com/content iccv 2017/
html/Selvaraju Grad-CAM Visual Explanations ICCV 2017 paper.html

[50] Ribeiro, M.T., Singh, S., Guestrin, C.: ” why should i trust you?” explaining
the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 1135–
1144 (2016). https://doi.org/10.1145/2939672.2939778

[51] Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning
research 9(11) (2008)

[52] Asif, S., Zhao, M., Tang, F., Zhu, Y.: An enhanced deep learning method for
multi-class brain tumor classification using deep transfer learning. Multimedia
Tools and Applications 82(20), 31709–31736 (2023) https://doi.org/10.1007/
s11042-023-14828-w

[53] Chow, L.S., Tang, G.S., Solihin, M.I., Gowdh, N.M., Ramli, N., Rahmat,
K.: Quantitative and qualitative analysis of 18 deep convolutional neural net-
work (cnn) models with transfer learning to diagnose covid-19 on chest x-ray
(cxr) images. SN Computer Science 4(2), 141 (2023) https://doi.org/10.1007/
s42979-022-01545-8

[54] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pp. 1251–1258 (2017). https://openaccess.thecvf.com/content cvpr 2017/
html/Chollet Xception Deep Learning CVPR 2017 paper.html

[55] Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet
and the impact of residual connections on learning. In: Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 31 (2017). https://doi.org/10.1609/
aaai.v31i1.11231

[56] Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W.,
Zhu, Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: Proceed-
ings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–
1324 (2019). https://openaccess.thecvf.com/content ICCV 2019/html/Howard
Searching for MobileNetV3 ICCV 2019 paper.html

[57] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A Simple Framework for
Contrastive Learning of Visual Representations, pp. 1597–1607 (2020). PMLR.
https://proceedings.mlr.press/v119/chen20j.html

[58] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-
supervised models are strong semi-supervised learners. Advances in neural

46
information processing systems 33, 22243–22255 (2020)

[59] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pp. 770–778 (2016). https://openaccess.thecvf.com/content cvpr 2016/
html/He Deep Residual Learning CVPR 2016 paper.html

[60] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional
neural networks. In: International Conference on Machine Learning, pp.
6105–6114 (2019). PMLR. https://proceedings.mlr.press/v97/tan19a.html?ref=
jina-ai-gmbh.ghost.io

[61] Mangai, U.G., Samanta, S., Das, S., Chowdhury, P.R.: A survey of decision
fusion and feature fusion strategies for pattern classification. IETE Technical
review 27(4), 293–307 (2010) https://doi.org/10.4103/0256-4602.64604

[62] Zhang, R., Nie, F., Li, X., Wei, X.: Feature selection with multi-view data: A
survey. Information Fusion 50, 158–167 (2019) https://doi.org/10.1016/j.inffus.
2018.11.019

[63] Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent
developments. Philosophical transactions of the royal society A: Mathematical,
Physical and Engineering Sciences 374(2065), 20150202 (2016) https://doi.org/
10.1098/rsta.2015.0202

[64] Choubey, D.K., Kumar, M., Shukla, V., Tripathi, S., Dhandhania, V.K.:
Comparative analysis of classification methods with pca and lda for dia-
betes. Current diabetes reviews 16(8), 833–850 (2020) https://doi.org/10.2174/
1573399816666200123124008

[65] Calhoun, V.D., Liu, J., Adalı, T.: A review of group ica for fmri data and ica
for joint inference of imaging, genetic, and erp data. Neuroimage 45(1), 163–172
(2009) https://doi.org/10.1016/j.neuroimage.2008.10.057

[66] Zhou, Z.-H.: Ensemble Methods: Foundations and Algorithms. CRC press, ???
(2012). https://doi.org/10.1201/b12207

[67] Dong, X., Yu, Z., Cao, W., Shi, Y., Ma, Q.: A survey on ensemble learn-
ing. Frontiers of Computer Science 14, 241–258 (2020) https://doi.org/10.1007/
s11704-019-8208-z

[68] González, S., Garcı́a, S., Del Ser, J., Rokach, L., Herrera, F.: A practical tuto-
rial on bagging and boosting based ensembles for machine learning: Algorithms,
software tools, performance study, practical perspectives and opportunities.
Information Fusion 64, 205–237 (2020) https://doi.org/10.1016/j.inffus.2020.07.
007

47
[69] Wu, X., Wang, J.: Application of bagging, boosting and stacking ensemble and
easyensemble methods for landslide susceptibility mapping in the three gorges
reservoir area of china. International Journal of Environmental Research and
Public Health 20(6), 4977 (2023) https://doi.org/10.3390/ijerph20064977

[70] Louk, M.H.L., Tama, B.A.: Dual-ids: A bagging-based gradient boosting decision
tree model for network anomaly intrusion detection system. Expert Systems with
Applications 213, 119030 (2023) https://doi.org/10.1016/j.eswa.2022.119030

[71] Rokach, L.: Decision forest: Twenty years of research. Information Fusion 27,
111–125 (2016) https://doi.org/10.1016/j.inffus.2015.06.005

[72] Khozeimeh, F., Sharifrazi, D., Izadi, N.H., Joloudari, J.H., Shoeibi, A.,
Alizadehsani, R., Tartibi, M., Hussain, S., Sani, Z.A., Khodatars, M., et al.: Rf-
cnn-f: random forest with convolutional neural network features for coronary
artery disease diagnosis based on cardiac magnetic resonance. Scientific Reports
12(1), 11178 (2022) https://doi.org/10.1038/s41598-022-15374-5

[73] Bhuiyan, M., Islam, M.S.: A new ensemble learning approach to detect malaria
from microscopic red blood cell images. Sensors International 4, 100209 (2023)
https://doi.org/10.1016/j.sintl.2022.100209

[74] Kaya, M.: Feature fusion-based ensemble cnn learning optimization for auto-
mated detection of pediatric pneumonia. Biomedical Signal Processing and
Control 87, 105472 (2024) https://doi.org/10.1016/j.bspc.2023.105472

[75] Breiman, L.: Bagging predictors. Machine learning 24, 123–140 (1996) https:
//doi.org/10.1007/BF00058655

[76] Schapire, R.E., Freund, Y.: Boosting: Foundations and algorithms. Kybernetes
42(1), 164–166 (2013) https://doi.org/10.1108/03684921311295547

[77] Wolpert, D.H.: Stacked generalization. Neural networks 5(2), 241–259 (1992)
https://doi.org/10.1016/S0893-6080(05)80023-1

[78] Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Pro-
ceedings of the 22nd Acm Sigkdd International Conference on Knowledge
Discovery and Data Mining, pp. 785–794 (2016). https://doi.org/10.1145/
2939672.2939785

[79] Bala, D., Hossain, M.S., Hossain, M.A., Abdullah, M.I., Rahman, M.M., Man-
avalan, B., Gu, N., Islam, M.S., Huang, Z.: Monkeynet: A robust deep convolu-
tional neural network for monkeypox disease detection and classification. Neural
Networks 161, 757–775 (2023) https://doi.org/10.1016/j.neunet.2023.02.022

[80] Ali, S.N., Ahmed, M.T., Jahan, T., Paul, J., Sani, S., Noor, N., Asma,

48
A.N., Hasan, T.: A web-based mpox skin lesion detection system using state-
of-the-art deep learning models considering racial diversity. arXiv preprint
arXiv:2306.14169 (2023) https://doi.org/https://arxiv.org/abs/2306.14169

[81] Jha, D., Ali, S., Hicks, S., Thambawita, V., Borgli, H., Smedsrud, P.H., Lange,
T., Pogorelov, K., Wang, X., Harzig, P., et al.: A comprehensive analysis of clas-
sification methods in gastrointestinal endoscopy imaging. Medical image analysis
70, 102007 (2021) https://doi.org/10.1016/j.media.2021.102007

[82] Pogorelov, K., Randel, K.R., Griwodz, C., Eskeland, S.L., Lange, T., Johansen,
D., Spampinato, C., Dang-Nguyen, D.-T., Lux, M., Schmidt, P.T., et al.: Kvasir:
A multi-class image dataset for computer aided gastrointestinal disease detec-
tion. In: Proceedings of the 8th ACM on Multimedia Systems Conference, pp.
164–169 (2017). https://doi.org/10.1145/3083187.3083212

[83] Anagha, C., Aishwarya, K.: PCOS detection using ultrasound images.
Kaggle (2021). https://www.kaggle.com/datasets/anaghachoudhari/
pcos-detection-using-ultrasound-images

[84] Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of breast ultra-
sound images. Data in brief 28, 104863 (2020) https://doi.org/10.1016/j.dib.
2019.104863

[85] Hamada, A.: Br35H: Brain Tumor Detection 2020. Kaggle (2020). https://www.
kaggle.com/datasets/ahmedhamada0/brain-tumor-detection

[86] Bhuvaji, S., Kadam, A., Bhumkar, P., Dedge, S., Kanchan, S.: Brain Tumor
Classification (MRI). Kaggle (2020). https://doi.org/10.34740/KAGGLE/DSV/
1183165 . https://www.kaggle.com/dsv/1183165

[87] Kiefer, R., Abid, M., Steen, J., Ardali, M.R., Amjadian, E.: A catalog of public
glaucoma datasets for machine learning applications: A detailed description and
analysis of public glaucoma datasets available to machine learning engineers
tackling glaucoma-related problems using retinal fundus images and oct images.
In: Proceedings of the 2023 7th International Conference on Information System
and Data Mining, pp. 24–31 (2023). https://doi.org/10.1145/3603765.3603779

[88] Xie, Y., Richmond, D.: Pre-training on grayscale imagenet improves medical
image classification. In: Proceedings of the European Conference on Com-
puter Vision (ECCV) Workshops, pp. 0–0 (2018). https://openaccess.thecvf.
com/content eccv 2018 workshops/w33/html/Xie Pre-training on Grayscale
ImageNet Improves Medical Image Classification ECCVW 2018 paper.html

[89] Ericsson, L., Gouk, H., Hospedales, T.M.: How well do self-supervised
models transfer? In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 5414–5423 (2021).
https://openaccess.thecvf.com/content/CVPR2021/html/Ericsson How Well

49
Do Self-Supervised Models Transfer CVPR 2021 paper.html

[90] Nayak, A.K., Bisoyi, S.K., Banerjee, A., Mahanta, D., Swain, A.: Mpox classifier:
A deep transfer learning model for monkeypox disease detection and classifica-
tion. In: 2024 1st International Conference on Cognitive, Green and Ubiquitous
Computing (IC-CGU), pp. 1–6 (2024). https://doi.org/10.1109/IC-CGU58078.
2024.10530778 . IEEE

[91] Biswas, D., Tesic, J.: Large-margin saliency-aware binarized cnn for monkeypox
virus image classification (2024) https://doi.org/10.21203/rs.3.rs-3990337/v1

[92] Dong, Z., Xu, B., Shi, J., Zheng, L.: Local and global feature interaction network
for endoscope image classification. In: International Conference on Image and
Graphics, pp. 412–424 (2023). https://doi.org/10.1007/978-3-031-46314-3 33 .
Springer

[93] Mukhtorov, D., Rakhmonova, M., Muksimova, S., Cho, Y.-I.: Endoscopic image
classification based on explainable deep learning. Sensors 23(6), 3176 (2023)
https://doi.org/10.3390/s23063176

[94] Wang, W., Yang, X., Tang, J.: Vision transformer with hybrid shifted win-
dows for gastrointestinal endoscopy image classification. IEEE Transactions on
Circuits and Systems for Video Technology (2023) https://doi.org/10.1109/
TCSVT.2023.3277462

[95] Patel, V., Patel, K., Goel, P., Shah, M.: Classification of gastrointestinal dis-
eases from endoscopic images using convolutional neural network with transfer
learning. In: 2024 5th International Conference on Intelligent Communica-
tion Technologies and Virtual Mobile Networks (ICICV), pp. 504–508 (2024).
https://doi.org/10.1109/ICICV62344.2024.00085 . IEEE

[96] Khan, S.D., Basalamah, S., Lbath, A.: Multi-module attention-guided deep
learning framework for precise gastrointestinal disease identification in endo-
scopic imagery. Biomedical Signal Processing and Control 95, 106396 (2024)
https://doi.org/10.1016/j.bspc.2024.106396

[97] Gheflati, B., Rivaz, H.: Vision transformers for classification of breast ultrasound
images. In: 2022 44th Annual International Conference of the IEEE Engineering
in Medicine & Biology Society (EMBC), pp. 480–483 (2022). https://doi.org/
10.1109/EMBC48229.2022.9871809 . IEEE

[98] Deb, S.D., Jha, R.K.: Breast ultrasound image classification using fuzzy-rank-
based ensemble network. Biomedical Signal Processing and Control 85, 104871
(2023) https://doi.org/10.1016/j.bspc.2023.104871

[99] Sahu, A., Das, P.K., Meher, S.: High accuracy hybrid cnn classifiers for breast
cancer detection using mammogram and ultrasound datasets. Biomedical Signal

50
Processing and Control 80, 104292 (2023) https://doi.org/10.1016/j.bspc.2022.
104292

[100] Yue, Y., Li, Z.: Medmamba: Vision mamba for medical image classification.
arXiv preprint arXiv:2403.03849 (2024) https://doi.org/https://arxiv.org/abs/
2403.03849

[101] Zhou, G., Mosadegh, B.: Distilling knowledge from an ensemble of vision trans-
formers for improved classification of breast ultrasound. Academic Radiology
31(1), 104–120 (2024) https://doi.org/10.1016/j.acra.2023.08.006

[102] Kadam, A., Bhuvaji, S., Deshpande, S.: Brain tumor classification using deep
learning algorithms. International Journal for Research in Applied Science &
Engineering Technology 9 (2021) https://doi.org/10.22214/ijraset.2021.39280

[103] Ravinder, M., Saluja, G., Allabun, S., Alqahtani, M.S., Abbas, M., Othman, M.,
Soufiene, B.O.: Enhanced brain tumor classification using graph convolutional
neural network architecture. Scientific Reports 13(1), 14938 (2023) https://doi.
org/10.1038/s41598-023-41407-8

[104] Sahu, P.K., Jain, K.: The deep learning model for the examination of brain
tumor. In: 2024 IEEE International Conference on Interdisciplinary Approaches
in Technology and Management for Social Innovation (IATMSI), vol. 2, pp. 1–6
(2024). https://doi.org/10.1109/IATMSI60426.2024.10502750 . IEEE

[105] Sachdeva, J., Sharma, D., Ahuja, C.K.: Comparative analysis of different deep
convolutional neural network architectures for classification of brain tumor on
magnetic resonance images. Archives of Computational Methods in Engineering,
1–20 (2024) https://doi.org/10.1007/s11831-023-10041-y

[106] Velden, B.H., Kuijf, H.J., Gilhuijs, K.G., Viergever, M.A.: Explainable artificial
intelligence (xai) in deep learning-based medical image analysis. Medical Image
Analysis 79, 102470 (2022) https://doi.org/10.1016/j.media.2022.102470

[107] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learn-
ing deep features for discriminative localization. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 2921–2929 (2016). https://openaccess.thecvf.com/content cvpr 2016/html/
Zhou Learning Deep Features CVPR 2016 paper.html

[108] Goyal, P., Mahajan, D., Gupta, A., Misra, I.: Scaling and benchmark-
ing self-supervised visual representation learning. In: Proceedings of
the Ieee/cvf International Conference on Computer Vision, pp. 6391–
6400 (2019). https://openaccess.thecvf.com/content ICCV 2019/html/
Goyal Scaling and Benchmarking Self-Supervised Visual Representation
Learning ICCV 2019 paper.html

51
[109] Rong, Y., Bian, Y., Xu, T., Xie, W., Wei, Y., Huang, W., Huang, J.: Self-
supervised graph transformer on large-scale molecular data. Advances in neural
information processing systems 33, 12559–12571 (2020)

[110] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin,
A.: Emerging properties in self-supervised vision transformers. In: Proceed-
ings of the IEEE/CVF International Conference on Computer Vision, pp.
9650–9660 (2021). https://openaccess.thecvf.com/content/ICCV2021/html/
Caron Emerging Properties in Self-Supervised Vision Transformers ICCV
2021 paper

[111] Chen, X., Xie, S., He, K.: An empirical study of training self-supervised
vision transformers. In: Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pp. 9640–9649 (2021). https://openaccess.
thecvf.com/content/ICCV2021/html/Chen An Empirical Study of Training
Self-Supervised Vision Transformers ICCV 2021 paper.html

[112] Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F.: A survey on
contrastive self-supervised learning. Technologies 9(1), 2 (2020) https://doi.org/
10.3390/technologies9010002

[113] Abou Baker, N., Zengeler, N., Handmann, U.: A transfer learning evaluation of
deep neural networks for image classification. Machine Learning and Knowledge
Extraction 4(1), 22–41 (2022) https://doi.org/10.3390/make4010002

[114] Kora, P., Ooi, C.P., Faust, O., Raghavendra, U., Gudigar, A., Chan, W.Y.,
Meenakshi, K., Swaraja, K., Plawiak, P., Acharya, U.R.: Transfer learning tech-
niques for medical image analysis: A review. Biocybernetics and Biomedical
Engineering 42(1), 79–107 (2022) https://doi.org/10.1016/j.bbe.2021.11.004

[115] Raghu, M., Zhang, C., Kleinberg, J., Bengio, S.: Transfusion: Understanding
transfer learning for medical imaging. Advances in neural information processing
systems 32 (2019)

[116] Alzubaidi, L., Al-Shamma, O., Fadhel, M.A., Farhan, L., Zhang, J., Duan, Y.:
Optimizing the performance of breast cancer classification by employing the
same domain transfer learning from hybrid deep convolutional neural network
model. Electronics 9(3), 445 (2020) https://doi.org/10.3390/electronics9030445

[117] Zoetmulder, R., Gavves, E., Caan, M., Marquering, H.: Domain-and task-specific
transfer learning for medical segmentation tasks. Computer Methods and Pro-
grams in Biomedicine 214, 106539 (2022) https://doi.org/10.1016/j.cmpb.2021.
106539

[118] Juan Ramon, A., Parmar, C., Carrasco-Zevallos, O.M., Csiszer, C., Yip, S.S.,
Raciti, P., Stone, N.L., Triantos, S., Quiroz, M.M., Crowley, P., et al.: Devel-
opment and deployment of a histopathology-based deep learning algorithm for

52
patient prescreening in a clinical trial. Nature Communications 15(1), 4690
(2024) https://doi.org/10.1038/s41467-024-49153-9

[119] Chang, H., Han, J., Zhong, C., Snijders, A.M., Mao, J.-H.: Unsupervised transfer
learning via multi-scale convolutional sparse coding for biomedical applications.
IEEE transactions on pattern analysis and machine intelligence 40(5), 1182–
1194 (2017) https://doi.org/10.1109/TPAMI.2017.2656884

[120] Huang, S.-C., Pareek, A., Jensen, M., Lungren, M.P., Yeung, S., Chaudhari, A.S.:
Self-supervised learning for medical image classification: a systematic review
and implementation guidelines. NPJ Digital Medicine 6(1), 74 (2023) https:
//doi.org/10.1038/s41746-023-00811-0

[121] Theodoris, C.V., Xiao, L., Chopra, A., Chaffin, M.D., Al Sayed, Z.R., Hill, M.C.,
Mantineo, H., Brydon, E.M., Zeng, Z., Liu, X.S., et al.: Transfer learning enables
predictions in network biology. Nature 618(7965), 616–624 (2023) https://doi.
org/10.1038/s41586-023-06139-9

[122] Li, J., Qiu, S., Shen, Y.-Y., Liu, C.-L., He, H.: Multisource transfer learning for
cross-subject eeg emotion recognition. IEEE transactions on cybernetics 50(7),
3281–3293 (2019) https://doi.org/10.1109/TCYB.2019.2904052

[123] Yang, Y., Li, X., Wang, P., Xia, Y., Ye, Q.: Multi-source transfer learning via
ensemble approach for initial diagnosis of alzheimer’s disease. IEEE Journal of
Translational Engineering in Health and Medicine 8, 1–10 (2020) https://doi.
org/10.1109/JTEHM.2020.2984601

[124] Xie, Y., Xu, Z., Zhang, J., Wang, Z., Ji, S.: Self-supervised learning of graph
neural networks: A unified review. IEEE Transactions on Pattern Analysis and
Machine Intelligence (2022) https://doi.org/10.1109/TPAMI.2022.3170559

[125] Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara, M., Rueckert, D.: Self-
supervised learning for medical image analysis using image context restoration.
Medical image analysis 58, 101539 (2019) https://doi.org/10.1016/j.media.2019.
101539

[126] Albelwi, S.: Survey on self-supervised learning: auxiliary pretext tasks and
contrastive learning methods in imaging. Entropy 24(4), 551 (2022) https:
//doi.org/10.3390/e24040551

[127] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications
of the ACM 63(11), 139–144 (2020) https://doi.org/10.1145/3422622

[128] Wu, L., Lin, H., Tan, C., Gao, Z., Li, S.Z.: Self-supervised learning on graphs:
Contrastive, generative, or predictive. IEEE Transactions on Knowledge and
Data Engineering (2021) https://doi.org/10.1109/TKDE.2021.3131584

53
[129] Mitrovic, J., McWilliams, B., Walker, J., Buesing, L., Blundell, C.: Representa-
tion learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922
(2020) https://doi.org/10.48550/arXiv.2010.07922

[130] Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum
contrastive learning. arXiv preprint arXiv:2003.04297 (2020) https://doi.org/10.
48550/arXiv.2003.04297

[131] Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XI 16, pp. 776–794 (2020). https://doi.org/10.1007/
978-3-030-58621-8 45 . Springer

[132] Liao, W., Xiong, H., Wang, Q., Mo, Y., Li, X., Liu, Y., Chen, Z., Huang, S.,
Dou, D.: Muscle: Multi-task self-supervised continual learning to pre-train deep
models for x-ray images of multiple body parts. In: International Conference
on Medical Image Computing and Computer-Assisted Intervention, pp. 151–161
(2022). https://doi.org/10.1007/978-3-031-16452-1 15 . Springer

[133] Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-
level detection of pathologies from unannotated chest x-ray images via self-
supervised learning. Nature Biomedical Engineering 6(12), 1399–1406 (2022)
https://doi.org/10.1038/s41551-022-00936-9

[134] Dong, N., Kampffmeyer, M., Voiculescu, I.: Self-supervised multi-task represen-
tation learning for sequential medical images. In: Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pp. 779–794 (2021).
https://doi.org/10.1007/978-3-030-86523-8 47 . Springer

[135] Zhang, T., Wei, D., Zhu, M., Gu, S., Zheng, Y.: Self-supervised learning for med-
ical image data with anatomy-oriented imaging planes. Medical Image Analysis
94, 103151 (2024) https://doi.org/10.1016/j.media.2024.103151

[136] Taleb, A., Lippert, C., Klein, T., Nabi, M.: Multimodal self-supervised learn-
ing for medical image analysis. In: International Conference on Information
Processing in Medical Imaging, pp. 661–673 (2021). https://doi.org/10.1007/
978-3-030-78191-0 51 . Springer

[137] Zhang, X., Wu, C., Zhang, Y., Xie, W., Wang, Y.: Knowledge-enhanced visual-
language pre-training on chest radiology images. Nature Communications 14(1),
4542 (2023) https://doi.org/10.1038/s41467-023-40260-7

[138] An, N., Ding, H., Yang, J., Au, R., Ang, T.F.: Deep ensemble learning for
alzheimer’s disease classification. Journal of biomedical informatics 105, 103411
(2020) https://doi.org/10.1016/j.jbi.2020.103411

[139] Sreelakshmi, S., Malu, G., Sherly, E., Mathew, R.: M-net: An encoder-decoder

54
architecture for medical image analysis using ensemble learning. Results in
Engineering 17, 100927 (2023) https://doi.org/10.1016/j.rineng.2023.100927

[140] Müller, D., Soto-Rey, I., Kramer, F.: An analysis on ensemble learning optimized
medical image classification with deep convolutional neural networks. Ieee Access
10, 66467–66480 (2022) https://doi.org/10.1109/ACCESS.2022.3182399

[141] Raza, K.: Improving the prediction accuracy of heart disease with ensemble
learning and majority voting rule, pp. 179–196. Elsevier (2019). https://doi.org/
10.1016/B978-0-12-815370-3.00008-6

[142] Namamula, L.R., Chaytor, D.: Effective ensemble learning approach for large-
scale medical data analytics. International Journal of System Assurance
Engineering and Management 15(1), 13–20 (2024) https://doi.org/10.1007/
s13198-021-01552-7

[143] Ganaie, M.A., Hu, M., Malik, A.K., Tanveer, M., Suganthan, P.N.: Ensemble
deep learning: A review. Engineering Applications of Artificial Intelligence 115,
105151 (2022) https://doi.org/10.1016/j.engappai.2022.105151

[144] El Gannour, O., Hamida, S., Lamalem, Y., Cherradi, B., Saleh, S., Raihani,
A.: Enhancing skin diseases classification through dual ensemble learning and
pre-trained cnns. International Journal of Advanced Computer Science and
Applications 14(6) (2023)

[145] Remzan, N., Tahiry, K., Farchi, A.: Advancing brain tumor classification accu-
racy through deep learning: harnessing radimagenet pre-trained convolutional
neural networks, ensemble learning, and machine learning classifiers on mri brain
images. Multimedia Tools and Applications, 1–29 (2024) https://doi.org/10.
1007/s11042-024-18780-1
Acknowledgements
We thank all the colleagues who provided us with valuable feedback and sugges-
tions.
Author contribution
Zehui Zhao: Conceptualisation, Methodology, Validation, Writing: original draft,
Laith Alzubaidi: Conceptualisation, Methodology, Resources, Formal analysis, Writ-
ing: review & editing, Validation, Supervision, Jinglan Zhang:Writing: review &
editing, Formal analysis, Supervision Ye Duan: Formal analysis, Writing: review &
editing, Yuantong Gu: Writing: review & editing, Validation, Funding acquisition.
Consent for publication
All authors have given consent for publication.
Conflict of interest/Competing interests
All authors have declared that no conflict of interest exists.
Funding

55
The authors acknowledge the support received through the following funding
schemes of the Australian Government: Australian Research Council (ARC) Indus-
trial Transformation Training Centre (ITTC) for Joint Biomechanics under grant
(IC190100020).

Appendix A Extended data

Fig. A1 Extended Data| Confusion matrix of baseline model without using data aug-
mentation: This confusion matrix reflects the classification performance of the baseline model
on each target task, where all base models were trained directly on the target datasets without
pre-training or data augmentation. The matrix reveals significant performance imbalance between
classes.

56
Fig. A2 Extended Data| Confusion matrix of TL baseline model: This confusion matrix
reflects the classification performance of the TL baseline model, which ensembles pre-trained TL base
models. The baseline model demonstrates strong overall performance across each task.

57
Fig. A3 Extended Data | Confusion matrix of SSL baseline model: This confusion matrix
reflects the classification performance of the SSL baseline model, which ensembles pre-trained SSL
base models. The baseline model demonstrates more balanced performance towards grey-scale image
target tasks (T3 and T4 ) than colourful image target tasks (T1 and T2 ).

58
Fig. A4 Extended Data| Confusion matrix of ETSEF model: This confusion matrix reflects
the classification performance of the ETSEF model, which ensembles pre-trained base models using
both TL and SSL methods. The ETSEF model performs best among each target task compared to
the three baseline models.

59
Fig. A5 Extended Data| Additional visualisation sample of base models using SHAP
maps: This sample image from T1 demonstrates the differences in attention areas between TL and
SSL base models. The TL base models failed to make correct predictions, focusing on unrelated
areas, whereas the SSL base models successfully captured part of the symptom area and made correct
predictions.

60
Fig. A6 Extended Data| Additional visualisation sample of base models using SHAP
maps: This sample image from T2 depicts a normal pylorus image without any disease. The TL base
models wrongly focused on unrelated areas and misclassified the sample as a z-line image. In contrast,
the SSL model correctly identified the symptoms of a normal pylorus and made accurate predictions
based on this understanding.

61
Fig. A7 Extended Data| Additional visualisation samples of base models using Grad-
CAM maps: This figure presents additional samples from each task, providing evidence that the
ensemble learning technique has enhanced the model’s ability to learn and accurately capture relevant
representations from the target data. Compared to individual pre-trained base models, the ensemble
model extends its attention to cover more symptom areas and provides more precise localisation of
disease regions.

62
Fig. A8 Extended Data| Scratch baseline model workflow: The scratch baseline model
employs base models trained directly on the target datasets. Features extracted from each base model
are fused and selected through a pipeline. After ensemble learning with ML classifiers, a weighted
voting mechanism is applied to ensure the stability and performance of the baseline model.

63
Fig. A9 Extended Data| Pre-trained baseline model workflow: The pre-trained baseline
model follows a comparable ensemble pipeline to the ETSEF method, differing only in its base models
pre-trained with a single pre-training method. These baseline models are constructed for comparison
with our ETSEF method and demonstrate the benefits of integrating multiple pre-training methods.