A Multimodal Transfer Learning Approach Using PubMedCLIP For Medical Image Classification
A Multimodal Transfer Learning Approach Using PubMedCLIP For Medical Image Classification
ABSTRACT Medical image data often face the problem of data scarcity and costly annotation
processes. To overcome this, our study introduces a novel transfer learning method for medical image
classification. We present a multimodal learning framework that incorporates the pre-trained PubMedCLIP
model and multimodal feature fusion. Prompts of different complexities are combined with images
as inputs to the proposed model. Our findings demonstrate that this approach significantly enhances
image classification tasks while reducing the burden of annotation costs. Our study underscores the
potential of PubMedCLIP in revolutionizing medical image analysis through its prompt-based approach
and showcases the value of multi-modality for training robust models in healthcare. Code is available
at:https://github.com/HongJapan/MTL_prompt_medical.git.
INDEX TERMS Pre-trained model, medical image, classification task, contrastive language-image
pre-training, feature fusion, multimodal model, prompt engineering.
for the medical domain. Their study revealed that leverag- multi-task learning has been used to improve the accuracy
ing the pre-trained PubMedCLIP features enhances visual of models by leveraging the relationship between different
question-answering (VQA) performance, surpassing current medical imaging tasks [29], [30]. (iv) Domain adaptation:
state-of-the-art baseline models. Domain adaptation in TL involves adapting a model trained
In this work, we propose a model that takes advantage of on a source domain to a target domain with different
PubMedCLIP’s image and text feature representations. The distributions. In medical image classification, this approach
robust visual-language representations allow our model to has been used to address the problem of data imbalance and
handle cases with limited training data. Experimental results improve model performance on specific target domains [31].
demonstrate that the proposed multimodal model achieves TL has shown practicality in improving the performance
excellent results in classifying medical images from different of medical image classification models. However, these
datasets. This paper is an extended version of our previous techniques result in high computational costs as discussed
work [19]. Compared to [19], the main extensions are as in [13] and [32]. Besides, not all pre-trained models that
follows. First, multiple prompts of different complexities are have been trained on large-scale natural image datasets
considered. Interestingly, it is shown that a richer prompt perform optimally across all medical image modalities. For
leads to much higher gains in classification accuracy. Second, instance, a review paper by Morid et al. [33] highlighted
a better feature fusion method is employed to further improve that Inception models were commonly utilized in analyzing
the performance. Third, two more datasets are used and more X-rays, endoscopic images, and ultrasound images, while
experiments are carried out, resulting in many insights into GoogLeNet and AlexNet were frequently employed for MRI
the behaviors of the model and reference methods. analysis. On the other hand, VGGNet models were mostly
The remainder of this paper is organized as follows. used in studying skin lesions, fundus images, and OCT
Section II presents related work on transfer learning and mul- (optical coherence tomography) data.
timodal learning models. Section III describes the proposed Recently, more advanced pre-trained models have been
approach and experimental setup. Extensive experimental investigated (see Table 1). In [36], Ohata et al. considered
results and discussions are provided in Section IV. Finally, 18 different image encoders in transfer learning for Path
conclusions are given in Section V images. They showed that the best result of the experi-
ment was provided by the DenseNet. In the research of
II. RELATED WORK Jimenez et al. [37] on breast tumor classification, DenseNet
In this section, we review previous work related to TL in also demonstrated high accuracy in diagnosing benign and
medical image classification, including multimodal models malignant tumors when compared with different pre-trained
and the applications of pre-trained models. models. Similarly, Sharma et al. [39] employed DenseNet
model with preprocessing techniques like normalization and
A. TRANSFER LEARNING IN MEDICAL IMAGE data augmentation for Blood images. Meanwhile, in the study
CLASSIFICATION of Shaban et al. [34], they demonstrated that MobileNet
Transfer learning has been employed in medical image exhibits superior performance, achieving the highest average
classification to enhance model performance, particularly accuracy compared to various classifiers on Path images.
when training data is limited. This approach enables models Also, Eroglu et al. [35] found that the highest accuracy
to leverage knowledge of a pre-trained model learned on large was obtained with MobileNet features for Breast images.
datasets to improve the performance on smaller, domain- Kallipolitis et al. [38] utilized transfer learning with various
specific datasets. This saves time and costs, which is crucial pre-trained models on a dataset that is augmented by the
in the medical imaging domain where datasets can be Grad-CAM technique to highlight visual patterns relevant to
relatively small. Previous work related to TL in medical each class. The experimental results showed that EfficientNet
image classification can be categorized as follows. (i) Feature outperformed other models. In the study of Chola et al.
extraction: A common approach is to use a pre-trained [40], they employed EfficientNet as the backbone for
model such as VGG [20], MobileNet [21], DenseNet [22], Blood images which are pre-processed by image processing.
or EfficientNet [23] as the feature extractors and then In a comparison of different deep learning models for
train a classifier on top of the extracted features. This mamography breast images, Jafari et al. [41] demonstrated
approach has been shown to improve classification accuracy that among the individual models, EfficientNet consistently
in various cases [24], [25], [26]. (ii) Fine-tuning a pre-trained outperformed the others. Our study in [26] was the first to
model: This approach involves adapting a pre-trained model employ PubMedCLIP for medical image classification on
specifically for the medical image classification task. The various image types of MedMNIST dataset. However, the that
parameters of a pre-trained model are updated by training solution is still unimodal, relying solely on image modality.
on the target dataset. Fine-tuning has proven to be effective
in medical image classification tasks, such as colonoscopy
frame classification [27], [28]. (iii) Multi-task learning: B. MULTIMODAL LEARNING
This approach involves training a model simultaneously In recent years, there has been increasing interest in
on multiple related tasks. In medical image classification, using both text and image data as input for medical
image analysis. Combining these two modalities allows for result, the proposed method can work with a small number
capturing both visual and semantic information, leading of data samples and has good performance across different
to improved accuracy and interpretability of classification datasets.
results. Several recent studies have utilized medical reports
to provide supervision information and learn multimodal III. METHODOLOGY
representations by maximizing mutual information between A. THE PROPOSED MULTIMODAL MODEL
the two input modalities [42], [43], [44]. Extracting labels As mentioned, our method aims to utilize the powerful
from reports using natural language processing (NLP) has multimodal representations of the PubMedCLIP. The method
also been explored as a means to leverage information from takes as input both an image and a description text. First,
the text [45], [46]. Transformer-based vision-and-language the image and text are encoded using PubMedCLIP, which
models are used for learning multimodal representations from produces a vector representation for each modality. These
image and associated reports, which outperform traditional vector representations are then fed into a fusion module to
CNN and RNN methods [47]. Attention mechanism have produce a combined feature vector, which is used to predict
also been used to facilitate interactions between visual and a similarity score. Finally, the similarity scores are employed
semantic information [48]. for classification.
Recently, Contrastive Language-Image Pretraining (CLIP) The proposed model consists of three main stages: feature
is an advanced pretrained model developed by OpenAI [17]. extraction, feature fusion, and class prediction. As shown
It applies contrastive learning with a huge dataset of in Figure 1, in the first stage, the image feature extraction
400 million image-text pairs obtained from the Internet. component takes a medical image v as input and outputs an
As a consequence of this multimodal training, CLIP can image feature vector v⃗i . Similarly, the text feature extraction
be used to find the text snippet that best represents a component will generate the text feature vector q⃗j for an input
given image, or the most suitable image given a text query. text description of image class qj . For each pair of (v⃗i , q⃗j ), the
One of the interesting advantages of CLIP is its ability to feature fusion component produces a combined vector Z⃗vi ,qj
perform zero-shot learning [17]. Also, the high performance which is used to compute the similarity score between the
of CLIP features enables many new exciting applications, image and the text.
for example, pre-training model to address the challenge of We perform image feature extraction with two options,
limited labeled data [49], art classification [50], and image PubMedCLIP-RN50 and PubMedCLIP-ViT32. These two
captioning [51]. encoders are based on different technologies, namely CNN
In the medical domain, Eslami et al. [18] investigates (PubMedCLIP-RN50) and Vision Transformer
the effectiveness of the pre-trained CLIP model for visual (PubMedCLIP-ViT32). This helps to see behaviors of CNN
question answering (VQA) task. To tailor the CLIP model and Vision Transformer over different medical imaging
for applications in the medical field, the authors introduced modes in our study, including microscopic imaging and
the PubMedCLIP model by fine-tuning the original CLIP ultrasound scan imaging.
model. This approach employs pairs of medical images In our approach to effectively utilize image labels for
and associated text of various anatomical regions from the model training, we draw inspiration from the methodology
medical ROCO dataset [52]. described in Radford et al.’s paper [17]. This approach
In line with the new trend of using LMM in machine acknowledges the importance of connecting text prompts
learning, our preliminary work [19] introduced the first with image content, a technique that has demonstrated
multimodal transfer learning approach using PubMedCLIP, enhanced performance compared to using simple labels
where text and image features are combined for classifying alone [17]. In particular, it is shown that adding a simple word
Breast images. In this paper, we present an extended solution like ‘‘image’’ into the prompt can improve the performance.
with a new fusion method and prompt engineering. As a So, in this work, we consider a heuristic approach that
FIGURE 1. Overview of our model. We feed the original image and label templates to the PubMedCLIP-text encoder and PubMedCLIP-Image encoder.
Fusion technique is used to combine the two vectors. Finally, the softmax layer is added for classification the disease.
gradually increases the contextual information in the prompt rely on the idea of making bilinear pooling computationally
templates. The words we select for the prompts are commonly feasible. In this study, we employs the Multimodal Factorized
found in electrical health record (EHR), such as medical, Bilinear Pooling (MFB) method [54] for multimodal feature
image, disease, illness, symptom, sign, patient [53]. fusion because of its simplicity, ease of implementation, and
For each dataset, we have developed three distinct text a high convergence rate. MFB [54] is a pooling method
prompt templates to guide the proposed model in the task of that combines information from multiple modalities (e.g.,
medical image classification. In addition to these prompts, image and text) by computing the outer product of their
we also include Prompt-0, which is simply the name of the feature vectors and then factorizing the resulting matrix
label for each class. Specifically, the prompt templates are as using a low-rank decomposition. This approach allows for
follows. efficient modeling of pairwise interactions between different
1) Prompt-0: ‘‘{label}’’ modalities while reducing the feature dimensionality after
2) Prompt-1: ‘‘This image shows {label} disease’’. pooling [55], [56]. A comparison of MFB with other fusion
3) Prompt-2: ‘‘In this medical image, there are indications methods will be discussed in the next section.
of {label}’’. In the third stage, class prediction is done based on
4) Prompt-3: ‘‘Based on this medical image, it appears combined feature vectors. Given a set of combined vectors
that the patient may be exhibiting signs or symptoms {Z⃗vi ,qj } for each pair of (v⃗i , q⃗j ), we employed a set of
related to the {label} disease or illness’’. fully-connected layer blocks, each of which independently
As can be seen, these prompts offer varying levels of transforms Z⃗vi ,qj to a scalar. These output scalar values will
information, allowing the model to capture different aspects form the similarity scores between the image v⃗i and the text
of the image. Specifically, Prompt-0 does not provide any description q⃗j . The blocks are denoted as Similarity Score
additional context about the image, while more information Extraction modules in Figure 1b. Finally, a softmax layer
is increasingly added to Prompt-1, Prompt-2, and Prompt-3. normalizes the scores, yielding a probability distribution
To facilitate this process, each dataset has a dictionary with indicating the likelihood of the input image belonging
descriptions of all the diseases present. These descriptions are to a description from the dictionary. The prediction is
encoded into text vectors, resulting in a set of text vectors chosen by selecting the highest probability element from the
specific to each dataset. distribution.
In the second stage, we combine the image and text
features into a single feature vector using the feature fusion B. DATASETS
block. A straightforward approach for combining feature To conduct this research, we use three different medical
vectors is to multiply them element-wise. However, this datasets with different classes and imaging modes. The
method has limitations due to the poor interaction of the first is the Blood dataset, consisting of 17,092 microscopic
elements between the two vectors. Various fusion techniques peripheral blood cell images [11]. The images of this dataset
have been developed to combine text and image feature are categorized into eight classes: neutrophils, eosinophils,
vectors to maximize interactions. These approaches usually basophils, lymphocytes, monocytes, immature granulocytes,
erythroblasts, and platelets or thrombocytes. The second one different datasets. We also perform extensive experiments
is the Path dataset, containing 100,000 images of human with different fusion techniques, prompt templates, and
colorectal cancer and healthy tissues [57]. The tissue images different numbers of training samples.
are organized into nine classes: adipose (ADI), background
(BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), A. EXPERIMENTAL SETTINGS
smooth muscle (MUS), normal colon mucosa (NORM), A key focus of our research was to examine how our model
cancer-associated stroma (STR), colorectal adenocarcinoma performs under conditions of limited training data. To achieve
epithelium (TUM). The third is the Breast dataset containing this, we gradually increase the number of training samples
780 medical images of breast cancer using ultrasound of each class. Specifically, we start with small numbers of
scans [58]. The Breast dataset is organized into three classes: training images per class, namely 10, 50, 100, and so on
normal, benign, and malignant. until eventually reaching 80% of the dataset. The images
not used for training in each case are set aside for testing.
C. REFERENCE MODELS AND IMPLEMENTATION DETAILS We maintained the same setting for all evaluated models.
In order to evaluate the improvements of the proposed The incremental increase in training data size enables us to
multimodal model with respect to previous multimodal explore the models’ learning behaviors as they have access
and unimodal models, the following reference models are to more training samples. This provides valuable insights into
employed for our experiments. the trade-off between training data volume and performance.
• The multimodal model of [19], which is the pre- Our experiments evaluate the model’s performance using
liminary version of our work. This model uses Pub- accuracy as the primary metric to assess its ability to
MedCLIP’s image and text encoders without prompt distinguish between various classes. The accuracy metric,
engineering. Note that, in this model, we use only represented by Equation 1, provides a comprehensive mea-
the Transformer-based encoder (PubMedCLIP-ViT32) sure of the overall correctness of the model’s predictions. The
because, as shown in [19] and [26], it is always better formula for accuracy metric is represented as follows:
than the Resnet-based encoder. In the following, this TP + TN
model is denoted as PubMedCLIP-Multi. Accuracy = (1)
TP + FN + TN + FP
• The unimodal model of [26] that only uses the image
modality of PubMedCLIP. In the following, this model where TP , TN , FP , FN represent true positive, true negative,
is denoted with two options PubMedCLIP-ViT32 and false positive, false negative, respectively.
PubMedCLIP-RN50. Here, the image encoders of this
unimodal model are exactly the same as those of the B. EXPERIMENTAL RESULTS
multimodal models. 1) FUSION TECHNIQUE COMPARISON
• Three unimodal models using a popular pretrained In the proposed model, to fuse the text and image vectors
model, namely DenseNet, MobileNet, or EfficientNet. for prediction, we employed the MFB fusion technique.
As mentioned above, recent studies (e.g. [35], [36], [40]) To show the benefit of this fusion technique, we compared
just focus on a certain image type (e.g. Blood or Path), this technique to two other popular fusion techniques, namely
so their findings on the best pretrained model vary. In our Multimodal Compact Bilinear Pooling(MCB) [60] and
evaluation, these models will be compared on the three Multimodal Tucker Fusion (MUTAN) [61]. For simplicity,
datasets, using the same setting as the above unimodal template Prompt-1 is used in this evaluation. In Figure 3,
and multimodal models. the performances of the proposed model using one of two
To clearly see the performance differences of the models, vision backbones, PubMedCLIP-RN50 and PubMedCLIP-
our experiments use the same setup for all models. Especially, ViT32, together with the three fusion techniques are shown
because we want to see the performances with a small amount for the three datasets. For the Blood dataset, the results are
of training data, no techniques of data augmentation and shown in Figure 3(a), where both PubMedCLIP-RN50 and
preprocessing are applied. The workflow of the unimodal PubMedCLIP-ViT32 with MFB exhibit increasing accuracy
models is shown in Figure 2, where the feature vector pro- as the number of shots is increased. When the number
vided by a pre-trained model is input into a fully-connected of shots exceeds 100, the curves reach high accuracy,
layer for classification. For training of both multimodal and around 90% for PubMedCLIP-ViT32 and around 85%
unimodal models, the learning rate is set to 1 × 10−3 , and for PubMedCLIP-RN50. However, when employing the
the batch size is 16. All implementations are based on the MCB and MUTAN fusion techniques, the curves remain
PyTorch framework [59]. To obtain stable results, we repeat relatively flat, showing minimal improvement even when
all experiments ten times and report the average scores over the number of shots is high. Moreover, the accuracies
all experiment runs. achieved by MCB and Mutan fusion techniques are signif-
icantly lower, approximately 70% for PubMedCLIP-ViT32
IV. EXPERIMENTS with Mutan, 58% for PubMedCLIP-RN50 with Mutan,
In this section, we shows the performance comparison 66% for PubMedCLIP-ViT32 with MCB, and 43% for
between the proposed model and the reference models on PubMedCLIP-RN50 with MCB. With the Path dataset in
Figure 3(b), PubMedCLIP-ViT32 with MFB provides the 2) PROMPT TEMPLATE EVALUATION
highest curve among all the combinations. When the number In this part, our evaluation involves testing each prompt
of shots exceeds 200, the accuracy surpasses 90%. Besides, template’s performance as the number of training samples
the MCB fusion technique provides highly unstable results. is increased from 10 samples per class up to 80% of the
With the Breast dataset (Figure 3(c)), the behavior is similar class. For simplicity, only PubMedCLIP-ViT32 is used the
to that in the Blood dataset. The MFB fusion technique image encoder. The results presented in Table 2 highlight
demonstrates favorable results for both PubMedCLIP-RN50 the different performances of the prompt templates (i.e.
and PubMedCLIP-ViT32, with increasing accuracy as the Prompt-0, Prompt-1, Prompt-2, Prompt-3). Futhermore, the
number of shots increased. However, the other fusion tech- results consistently demonstrate that Prompt-3 outperformed
niques show much lower results; the accuracy of Mutan with Prompt-0, Prompt-1 and Prompt-2 in all datasets. Especially,
PubMedCLIP-ViT32 (PubMedCLIP-RN50) is consistently on the Path dataset, the performance of Prompt-3 quickly
around 78% (70%). The MCB fusion technique results jumps to a high level after 500 shots. Meanwhile, on the
in about only 65% for both backbones. Among the three Breast dataset, the performance of Prompt-3 saturates after
fusion techniques, MUTAN is only better than MFB at very 100 shots.
small number of shots (e.g. 10 shots in Path and Breast Additionally, the visualization in Figure 4 confirms the
datasets). Prompt-3’s consistent and superior performance. The results
In summary, based on the experiment results, the MFB show that the performance of Prompt-0 is the lowest. More
fusion technique in general shows the best performance specifically, in Fig. 4, we can see that adding the words
across the Blood, Path, and Breast datasets, for both ‘‘image’’ and ‘‘disease’’ in Prompt-1 can help improve the
PubMedCLIP-RN50 and PubMedCLIP-ViT32 backbones. In performance on Blood and Breast datasets when the number
the following evaluations, we will exclusively present the of shots is high, and on Path dataset when the number of shots
results obtained using the MFB fusion technique. is medium (from 1500 shots to 5000 shots). Also, in general,
VOLUME 12, 2024 75501
H. N. Dao et al.: Multimodal Transfer Learning Approach Using PubMedCLIP
TABLE 2. Accuracy values for different prompts. experimental results are given in Table 3. The performances
of the models vary across the datasets. Here, we specifically
explore the performances when the number of training
samples (shots) gradually increases.
For the 10-shot learning scenario, we trained the models
using ten images per class from each dataset and utilized the
remaining images for testing. The results indicate that the
proposed model (PubMedCLIP-ViT32) achieves the highest
or second-highest accuracy across the three datasets. In the
Path dataset, our model achieves the highest accuracy score
among the models. However, all models perform poorly in
the Breast dataset under the ten-shot learning setting.
Notably, PubMedCLIP-ViT32 exhibits superior perfor-
mance compared to PubMedCLIP-RN50. So, in the follow-
ing, the proposed model that employs PubMedCLIP-ViT32
is mostly referred to in the discussion.
For the 50-shot learning scenario, we increased the training
data to 50 images per class. The results show that as the
number of training images increases, the overall accuracy
of the models improves. Our multimodal model achieves
relatively high scores across all three datasets, with accuracy
exceeding 80%. Notably, DenseNet and MobileNet perform
well on the Blood and Path datasets but poorly on the Breast
dataset.
Moving on to the 100-shot learning scenario, we fed
100 images per class into the models for training. The results
indicate that our model’s accuracy increases slower than
Prompt-2 has better performance than Prompt-1 at most MobileNet and DenseNet when transitioning from 50 to
numbers of shots. This observation emphasizes the pivotal 100 training images per class in the Blood and Path
role of prompt engineering in the model’s performance. datasets. Specifically, MobileNet achieves an accuracy of
The success of Prompt-3 can be attributed to its provision approximately 90% in the Blood and Path datasets, while
of richer contextual information, which better guides the DenseNet achieves a similar accuracy in the Path dataset.
model in associating image content with the corresponding Nevertheless, our model performs well across all three
medical condition. In our future work, we will further datasets, with the accuracy surpassing 88%. Notably, in the
investigate the potential of leveraging more intricate and Breast dataset, our model achieves an accuracy of over 92%,
informative language constructs to enhance the performance whereas other models fall below 80%.
of multimodal models in medical image classification. Further increasing the training data to 200 images per class,
In the upcoming evaluation experiments, we will exclusively our model demonstrates outstanding performance across all
present results using Prompt-3 in the proposed model. three datasets. It achieves an accuracy of 92.1% in the Blood
dataset, 90.3% in the Path dataset, and 92.8% in the Breast
3) MODEL PERFORMANCE ACCURACY RESULTS dataset, comparable to those of MobileNet. Compared to
In this section, we compare the performances of the proposed DenseNet, our model performs better by approximately 3%
model and reference models on the three datasets. The in the Blood dataset, 14% in the Breast dataset, and slightly
lower by 0.2% in the Path dataset. When we increase the Path dataset, initially the proposed model again performs
training data to 300 images per class, our model excels across worse than MobileNet. However, at 500 shots, the result
all the datasets. of MobileNet is lower than the proposed model. Especially
The dependence of model performances on the number of with the Breast dataset (Figure 5(c)), the proposed model
training samples and datasets can be seen more clearly in consistently achieved the highest accuracy across all numbers
Figure 5. With the Blood dataset (Figure 5(a)), our model of shots. Meanwhile, all other models, including MobileNet,
initially obtained the second-highest accuracy at 200 shots, have much lower performances on this dataset. It can be
trailing behind MobileNet. However, from 300 shots onward, concluded that the proposed model can consistently achieve
the proposed model outperformed all other models. With the good results across different datasets.
TABLE 4. Ablation study’s settings and results. modality. The accuracy results of the above four cases when
training data is 80% of a dataset are shown in Table.4. It can
be seen that the gains by the new fusion can only be up to
1.7%. Meanwhile, the gains by the new prompt are up to 1.3%
and lower than the gains by the new fusion. When both new
fusion and new prompt are used, the gains are 2.4%, 1.8%,
and 3% on the Blood, Path, Breast datasets, respectively.
These results mean that each new component can improve
the performance, and when they are combined, the joint
improvement is higher than individual improvements. So, the
two new components are complementary to each other, and
both are beneficial for the high performance of the proposed
Regarding the multimodal model PubMedCLIP-Multi, its model.
performances on Path and Blood datasets are comparable
to the unimodal PubMedCLIP-ViT32 (Figure 5(a) and (b)); D. DISCUSSIONS
however, on Breast dataset, it is much better than The above results demonstrate the capabilities of the
PubMedLCIP-ViT32 and other unimodal models proposed model, which outperforms reference models in two
(Figure 5(c)). Among the unimodal models, aspects:
PubMedCLIP-ViT32 is, in general, the best one over all • The superior performances are consistent across three
three datasets, except at some small numbers of shots. different image types. Whereas previous studies just
Meanwhile, the performances of DenseNet, MobileNet, focus on a certain type (e.g., either Blood, Path,
and EfficientNet vary across the datasets. Moreover, the or Breast).
proposed model’s results are consistently highest over • The behavior is also consistent over a wide range of the
a wide range of the number of shots. This shows the number of shots. It should be noted that existing studies
promising capabilities of both multimodal and unimodal mostly try to enlarge the amount of training data (e.g.
solutions based on PubMedCLIP, thanks to its very large by various data augmentation techniques) to improve the
scale. performance.
The advantages of the proposed model can be attributed
C. ABLATION STUDY to the robustness (or generalizability) of the large-scale and
In this part, we investigate the contributions of the two new multimodal nature of the pre-trained PubMedCLIP model,
components of the proposed model, including the new fusion together with prompt engineering and feature fusion.
and the new best prompt (i.e. Prompt-3). So, the comparison It should be noted that the image encoder in the pro-
includes the following cases: posed model is the same (i.e., unmodified) as those used
• Case-1: No new components (i.e. our preliminary model in unimodal models (using either PubMedCLIP-RN50 or
in [19]) PubMedCLIP-ViT32). However, thanks to the processing of
• Case-2: Using the new fusion only. both image input and text input, the proposed multimodal
• Case-3: Using the new prompt only. model always outperforms the corresponding unimodal
• Case-4: Using the new fusion and the new prompt (i.e. model. This is an interesting benefit of large multimodal
the proposed model) models like PubMedCLIP.
Here, for simplicity, we also employ only In addition, the experiments show that
PubMedCLIP-ViT32, which is the best encoder for image PubMedCLIP-ViT32 always performs better than
PubMedCLIP-RN50 in both unimodal and multimodal [2] H. Sharma, N. Zerbe, I. Klempert, O. Hellwich, and P. Hufnagl,
cases. On the Blood dataset, the unimodal model using ‘‘Deep convolutional neural networks for automatic classification of
gastric carcinoma using whole slide images in digital histopathology,’’
PubMedCLIP-ViT32 is only worse than the multimodal Computerized Med. Imag. Graph., vol. 61, pp. 2–13, Nov. 2017.
model using PubMedCLIP-ViT32, which is even better than [3] D. S. W. Ting et al., ‘‘Development and validation of a deep learning system
all other unimodal and multimodal models. This means the for diabetic retinopathy and related eye diseases using retinal images from
multiethnic populations with diabetes,’’ JAMA, vol. 318, no. 22, p. 2211,
vision transformer technology is more effective than CNN in Dec. 2017.
this classification task. [4] R. Ribani and M. Marengoni, ‘‘A survey of transfer learning for
Our results also emphasize the importance of text prompt convolutional neural networks,’’ in Proc. 32nd SIBGRAPI Conf. Graph.,
engineering to enhance a model’s performance. In our study, Patterns Images Tuts., Oct. 2019, pp. 47–57.
[5] D. Sarkar, R. Bali, and T. Ghosh, Hands-On Transfer Learning With
adding more medical context into the prompt template helps Python: Implement Advanced Deep Learning and Neural Network Models
the model understand more about the image that the model Using TensorFlow and Keras. Birmingham, U.K.: Packt Publishing Ltd,
needs to classify. The improved performance when incorpo- 2018.
rating such keywords into the prompt can be attributed to the [6] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are
features in deep neural networks?’’ in Proc. Adv. Neural Inf. Process. Syst.,
unique capabilities of the PubMedCLIP model, which is a vol. 27, 2014, pp. 1–18.
fine-tuned version of CLIP tailored for medical applications. [7] O. Hadad, R. Bakalo, R. Ben-Ari, S. Hashoul, and G. Amit, ‘‘Classification
PubMedCLIP has been trained with a huge amount of images of breast lesions using cross-modal deep learning,’’ in Proc. IEEE 14th Int.
and associated text. A text prompt can be considered as a Symp. Biomed. Imag., Apr. 2017, pp. 109–112.
[8] S. Saxena, S. Shukla, and M. Gyanchandani, ‘‘Pre-trained convolutional
context input into the multimodal model. It seems that when neural networks as feature extractors for diagnosis of breast cancer using
appropriate words are provided in the prompt, the context histopathology,’’ Int. J. Imag. Syst. Technol., vol. 30, no. 3, pp. 577–591,
will be clearer to the model, and thus, the performance at Sep. 2020.
the output will be higher. So, it is important to empower the [9] A. Rakhlin, A. Shvets, V. Iglovikov, and A. A. Kalinin, ‘‘Deep
convolutional neural networks for breast cancer histology image analysis,’’
model with a richer context rather than a simple label or short in Proc. Int. Conf. Image Anal. Recognit., 2018, pp. 737–744.
description. [10] A. Mahbod, I. Ellinger, R. Ecker, Ö. Smedby, and C. Wang, ‘‘Breast cancer
Furthermore, our model’s robustness in image classifi- histological image classification using fine-tuned deep network fusion,’’ in
Proc. 15th Int. Conf, 2018, pp. 754–762.
cation accuracy is fortified by fusing feature vectors of
[11] A. Acevedo, S. Alférez, A. Merino, L. Puigví, and J. Rodellar,
image and text inputs. This fusion of image and text ‘‘Recognition of peripheral blood cell images using convolutional neural
vectors, coupled with an extensive text vector dictionary, networks,’’ Comput. Methods Programs Biomed., vol. 180, Oct. 2019,
equips our model to tackle a broad spectrum of medical Art. no. 105020.
[12] Y. Wang, E. J. Choi, Y. Choi, H. Zhang, G. Y. Jin, and S.-B.
conditions, ensuring consistent high accuracy across diverse Ko, ‘‘Breast cancer classification in automated breast ultrasound
image classification tasks. This multifaceted solution has using multiview convolutional neural network with transfer
been shown to be beneficial in medical image classification, learning,’’ Ultrasound Med. Biol., vol. 46, no. 5, pp. 1119–1132,
May 2020.
with limited training data and adaptability across various
[13] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,
datasets. D. Mollura, and R. M. Summers, ‘‘Deep convolutional neural networks for
computer-aided detection: CNN architectures, dataset characteristics and
transfer learning,’’ IEEE Trans. Med. Imag., vol. 35, no. 5, pp. 1285–1298,
V. CONCLUSION AND FUTURE WORK May 2016.
In this work, we have investigated the capability of transfer [14] G. Liang and L. Zheng, ‘‘A transfer learning method with deep residual
learning based on PubMedCLIP for medical image classi- network for pediatric pneumonia diagnosis,’’ Comput. Methods Programs
fication. We proposed a multimodal model that harnesses Biomed., vol. 187, Apr. 2020, Art. no. 104964.
[15] A. Tiwari, S. Srivastava, and M. Pant, ‘‘Brain tumor segmentation and
text prompts and images to achieve high accuracy even classification from magnetic resonance images: Review of selected meth-
with limited training data, surpassing the performance of ods from 2014 to 2019,’’ Pattern Recognit. Lett., vol. 131, pp. 244–260,
traditional transfer learning models. The advantages of the Mar. 2020.
proposed model could be attributed to the multimodal pre- [16] M. Salvi, U. R. Acharya, F. Molinari, and K. M. Meiburger,
‘‘The impact of pre- and post-image processing techniques on
trained backbones, prompt engineering, and feature fusion. deep learning frameworks: A comprehensive review for digital
Especially, the effective use of prompt templates in our pathology image analysis,’’ Comput. Biol. Med., vol. 128, Jan. 2021,
model highlights its potential for various image classification Art. no. 104129.
[17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G.
domains. For future work, we will extend this approach by Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever,
enhancing prompts through developing automated or context- ‘‘Learning transferable visual models from natural language
aware prompts, which may improve the model’s performance supervision,’’ in Proc. Int. Conf. Mach. Learn., vol. 139, 2021,
pp. 8748–8763.
across diverse domains. Additionally, we will further evaluate
[18] S. Eslami, G. de Melo, and C. Meinel, ‘‘Does CLIP benefit visual question
the adaptability of the proposed model to various medical answering in the medical domain as much as it does in the general
subfields and exploring cross-domain applications. domain?’’ 2021, arXiv:2112.13906.
[19] H. N. Dao, T. Nguyen, C. Mugisha, and I. Paik, ‘‘A multimodal
transfer learning approach for medical image classification,’’ in Proc.
REFERENCES IEEE Int. Conf. Consum. Electronics-Asia (ICCE-Asia), Oct. 2023,
[1] A. Janowczyk and A. Madabhushi, ‘‘Deep learning for digital pathology pp. 1–18.
image analysis: A comprehensive tutorial with selected use cases,’’ [20] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
J. Pathol. Informat., vol. 7, no. 1, p. 29, Jan. 2016. large-scale image recognition,’’ 2014, arXiv:1409.1556.
[21] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, [41] Z. Jafari and E. Karami, ‘‘Breast cancer detection in mammography
M. Andreetto, and H. Adam, ‘‘MobileNets: Efficient convolutional neural images: A CNN-based approach with feature selection,’’ Information,
networks for mobile vision applications,’’ 2017, arXiv:1704.04861. vol. 14, no. 7, p. 410, Jul. 2023.
[22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely [42] T.-M. Harry Hsu, W.-H. Weng, W. Boag, M. McDermott, and P. Szolovits,
connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis. ‘‘Unsupervised multimodal representation learning across medical images
Pattern Recognit. (CVPR), Jul. 2017, pp. 2261–2269. and reports,’’ 2018, arXiv:1811.08615.
[23] M. Tan and Q. Le, ‘‘EfficientNet: Rethinking model scaling for con- [43] G. Chauhan, R. Liao, W. Wells, J. Andreas, X. Wang, S. Berkowitz, S.
volutional neural networks,’’ in Proc. Int. Conf. Mach. Learn., 2019, Horng, P. Szolovits, and P. Golland, ‘‘Joint modeling of chest radiographs
pp. 6105–6114. and radiology reports for pulmonary edema assessment,’’ in Proc. Int.
[24] D. Varshni, K. Thakral, L. Agarwal, R. Nijhawan, and A. Mittal, Conf. Med. Image Comput. Comput. Assist. Intervent (MICCAI), 2020,
‘‘Pneumonia detection using CNN based feature extraction,’’ in Proc. pp. 529–539.
IEEE Int. Conf. Electr., Comput. Commun. Technol. (ICECCT), Feb. 2019, [44] S.-C. Huang, L. Shen, M. P. Lungren, and S. Yeung, ‘‘GLoRIA: A
pp. 1–7. multimodal global-local representation learning framework for label-
efficient medical image recognition,’’ in Proc. IEEE/CVF Int. Conf.
[25] T. Kaur and T. K. Gandhi, ‘‘Automated brain image classification based
Comput. Vis. (ICCV), Oct. 2021, pp. 3922–3931.
on VGG-16 and transfer learning,’’ in Proc. Int. Conf. Inf. Technol. (ICIT),
[45] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund,
Dec. 2019, pp. 94–98.
B. Haghgoo, R. Ball, K. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi,
[26] H. N. Dao, T. N. Quang, and I. Paik, ‘‘Transfer learning for medical image J. K. Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N. Patel,
classification on multiple datasets using PubMedCLIP,’’ in Proc. IEEE Int. M. P. Lungren, and A. Y. Ng, ‘‘CheXpert: A large chest radiograph dataset
Conf. Consum. Electronics-Asia (ICCE-Asia), Oct. 2022, pp. 1–4. with uncertainty labels and expert comparison,’’ in Proc. AAAI Conf. Artif.
[27] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, Intell., vol. 33, no. 1, 2019, pp. 590–597.
M. B. Gotway, and J. Liang, ‘‘Convolutional neural networks for medical [46] A. E. W. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren,
image analysis: Full training or fine tuning?’’ IEEE Trans. Med. Imag., C.-Y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng,
vol. 35, no. 5, pp. 1299–1312, May 2016. ‘‘MIMIC-CXR-JPG, a large publicly available database of labeled chest
[28] Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway, and J. Liang, ‘‘Fine- radiographs,’’ 2019, arXiv:1901.07042.
tuning convolutional neural networks for biomedical image analysis: [47] Y. Li, H. Wang, and Y. Luo, ‘‘A comparison of pre-trained vision-and-
Actively and incrementally,’’ in Proc. IEEE Conf. Comput. Vis. Pattern language models for multimodal representation learning across medical
Recognit. (CVPR), Jul. 2017, pp. 4761–4772. images and reports,’’ in Proc. IEEE Int. Conf. Bioinf. Biomed. (BIBM),
[29] J. Li, G. Zhao, Y. Tao, P. Zhai, H. Chen, H. He, and T. Cai, ‘‘Multi-task Dec. 2020, pp. 1999–2004.
contrastive learning for automatic CT and X-ray diagnosis of COVID-19,’’ [48] Z. Zhang, P. Chen, X. Shi, and L. Yang, ‘‘Text-guided neural network
Pattern Recognit., vol. 114, Jun. 2021, Art. no. 107848. training for image recognition in natural scenes and medicine,’’ IEEE
[30] J. Gan, L. Xiang, Y. Zhai, C. Mai, G. He, J. Zeng, Z. Bai, R. Donida Labati, Trans. Pattern Anal. Mach. Intell., vol. 43, no. 5, pp. 1733–1745,
V. Piuri, and F. Scotti, ‘‘2M BeautyNet: Facial beauty prediction based on May 2021.
multi-task transfer learning,’’ IEEE Access, vol. 8, pp. 20245–20256, 2020. [49] N. Mu, A. Kirillov, D. Wagner, and S. Xie, ‘‘SLIP: Self-supervision meets
[31] P. Zhang, J. Li, Y. Wang, and J. Pan, ‘‘Domain adaptation for medical image language-image pre-training,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV),
segmentation: A meta-learning method,’’ J. Imag., vol. 7, no. 2, p. 31, Oct. 2022, pp. 529–544.
Feb. 2021. [50] M. V. Conde and K. Turgutlu, ‘‘CLIP-Art: Contrastive pre-training for fine-
[32] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and grained art classification,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
H. Greenspan, ‘‘GAN-based synthetic medical image augmentation for Recognit. Workshops (CVPRW), Jun. 2021, pp. 3951–3955.
increased CNN performance in liver lesion classification,’’ Neurocomput- [51] M. Barraco, M. Cornia, S. Cascianelli, L. Baraldi, and R. Cucchiara, ‘‘The
ing, vol. 321, pp. 321–331, Dec. 2018. unreasonable effectiveness of CLIP features for image captioning: An
[33] M. A. Morid, A. Borjali, and G. Del Fiol, ‘‘A scoping review of transfer experimental analysis,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
learning research on medical image analysis using ImageNet,’’ Comput. Recognit. Workshops (CVPRW), Jun. 2022, pp. 4661–4669.
Biol. Med., vol. 128, Jan. 2021, Art. no. 104115. [52] O. Pelka, S. Koitka, J. Rückert, F. Nensa, and C. M. Friedrich, ‘‘Radiology
objects in context (ROCO): A multimodal image dataset,’’ in Proc. 7th
[34] M. Shaban, R. Awan, M. M. Fraz, A. Azam, Y.-W. Tsang, D. Snead, and
Joint Int. Workshop, Sep. 2018, pp. 180–189.
N. M. Rajpoot, ‘‘Context-aware convolutional neural network for grading
of colorectal cancer histology images,’’ IEEE Trans. Med. Imag., vol. 39, [53] H. N. Dao and I. Paik, ‘‘Patient similarity using electronic health records
no. 7, pp. 2395–2405, Jul. 2020. and self-supervised learning,’’ in Proc. IEEE 16th Int. Symp., Dec. 2023,
pp. 1–15.
[35] Y. Eroğlu, M. Yildirim, and A. Çinar, ‘‘Convolutional neural networks
[54] Z. Yu, J. Yu, J. Fan, and D. Tao, ‘‘Multi-modal factorized bilinear pooling
based classification of breast ultrasonography images by hybrid method
with co-attention learning for visual question answering,’’ in Proc. IEEE
with respect to benign, malignant, and normal using mRMR,’’ Comput.
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1839–1848.
Biol. Med., vol. 133, Jun. 2021, Art. no. 104407.
[55] C. Zhang, Z. Yang, X. He, and L. Deng, ‘‘Multimodal intelligence:
[36] E. F. Ohata, J. V. S. D. Chagas, G. M. Bezerra, M. M. Hassan, Representation learning, information fusion, and applications,’’ IEEE J.
V. H. C. de Albuquerque, and P. P. R. Filho, ‘‘A novel transfer learning Sel. Topics Signal Process., vol. 14, no. 3, pp. 478–493, Mar. 2020.
approach for the classification of histological images of colorectal cancer,’’ [56] D. Sharma, S. Purushotham, and C. K. Reddy, ‘‘MedFuseNet: An
J. Supercomput., vol. 77, no. 9, pp. 9494–9519, Sep. 2021. attention-based multimodal deep learning model for visual question
[37] Y. Jiménez Gaona, M. J. Rodriguez-Alvarez, H. Espino-Morato, answering in the medical domain,’’ Sci. Rep., vol. 11, no. 1, p. 19826,
D. Castillo Malla, and V. Lakshminarayanan, ‘‘Densenet for breast tumor Oct. 2021.
classification in mammographic images,’’ in Proc. Int. Conf. Bioeng. [57] J. N. Kather, N. Halama, and A. Marx, ‘‘100,000 histological images
Biomed. Signal Image Process., 2021, pp. 166–176. of human colorectal cancer and healthy tissue (v0.1)’’, Zenodo, 2018,
[38] A. Kallipolitis, K. Revelos, and I. Maglogiannis, ‘‘Ensembling Efficient- doi: 10.5281/zenodo.1214456.
Nets for the classification and interpretation of histopathology images,’’ [58] W. Al-Dhabyani, M. Gomaa, H. Khaled, and A. Fahmy, ‘‘Dataset of breast
Algorithms, vol. 14, no. 10, p. 278, Sep. 2021. ultrasound images,’’ Data Brief, vol. 28, Feb. 2020, Art. no. 104863.
[39] S. Sharma, S. Gupta, D. Gupta, S. Juneja, P. Gupta, G. Dhiman, and [59] A. Paszke et al., ‘‘Pytorch: An imperative style, high-performance deep
S. Kautish, ‘‘Deep learning model for the automatic classification of learning library,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019,
white blood cells,’’ Comput. Intell. Neurosci., vol. 2022, pp. 1–13, pp. 1–11.
Jan. 2022. [60] A. Fukui, D. Huk Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach,
[40] C. Chola, A. Y. Muaad, M. B. Bin Heyat, J. V. B. Benifa, W. R. Naji, ‘‘Multimodal compact bilinear pooling for visual question answering and
K. Hemachandran, N. F. Mahmoud, N. A. Samee, M. A. Al-Antari, visual grounding,’’ 2016, arXiv:1606.01847.
Y. M. Kadah, and T.-S. Kim, ‘‘BCNet: A deep learning computer-aided [61] H. Ben-younes, R. Cadene, M. Cord, and N. Thome, ‘‘MUTAN:
diagnosis framework for human peripheral blood cell identification,’’ Multimodal tucker fusion for visual question answering,’’ in Proc. IEEE
Diagnostics, vol. 12, no. 11, p. 2815, Nov. 2022. Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2631–2639.
HONG N. DAO received the M.B.A. degree from CHERUBIN MUGISHA (Member, IEEE) recei-
Chungnam National University, South Korea. She ved the bachelor’s degree in computer science
is currently pursuing the Ph.D. degree with The from Université Lumiére de Bujumbura, Burundi,
University of Aizu. Her research interests include in 2013, and the M.Sc. degree in computer science
data analytics for economics and big data of smart from The University of Aizu, Japan, in 2020,
city. where he is currently pursuing the Ph.D. degree.
His research interests include algorithms for
multimodal machine learning methods integrating
NLP, structured and unstructured data, and its
application to medical data.
TUYEN NGUYEN received the bachelor’s degree INCHEON PAIK (Senior Member, IEEE) received
in computer science from The University of the M.E. and Ph.D. degrees in electronic engi-
Aizu, in 2022. He is currently pursuing the neering from Korea University in 1987 and 1992,
Ph.D. degree with the University of Technology respectively. He is currently a Professor with the
Sydney, Australia. His research interests include University of Aizu, Japan. His research interests
image processing, computer vision, and quantum include deep learning applications, ethical LLMs,
computing. machine learning, big data science, and semantic
web services. He is a member of the IEICE, IEIE,
and IPSJ.