Research Paper - Formatted
Research Paper - Formatted
1. INTRODUCTION
This research was undertaken to develop a model for predicting diseases from chest x-ray images that can quickly adapt to
new classes with just a few training examples. The goal was to find an efficient and accurate method for detecting chest
diseases which have a smaller number of training images in particular. Few-Shot Learning is where a learner is trained on
several related tasks, during the meta-training phase, so that it can generalise well to unseen (but related) tasks with just a few
examples during the meta-testing phase. N-way K shot means to classify new tasks presented in front of it having N classes
with K examples each (N- Way K- Shot).
There are 4 main reasons why Few shot learning is an ideal choice for the Medical Field:
● There are Limited sources for medical images and are not readily available to the public domain.
● Manual labelling of data is time consuming, not always practical and needs medical experts.
● Some diseases are rare and just don’t have enough data present
● Few shot models will be more robust if we want to predict some new disease with it, it can quickly adapt with just a few
training examples.
We are using Siamese Network + Transfer Learning. It can be used for few shot learning because during the training phase
we are not interested in learning the labels of classes, instead we are teaching our model how to know if two classes are
similar or not. So, if we have a rare disease say X for which we only have few training examples and there are other similar
classes of disease for which we do have large amount of data then essentially, we can use those classes as base classes and
then we give our model query example of the rare disease X which it might not have seen before but it can still compare it
with other classes in our support set and select the class which have least Euclidean distance with query image.
Siamese Networks have become popular in recent years, especially for tasks related to image similarity or text similarity. The
network architecture consists of two identical neural networks that take in two inputs, producing embeddings that are
compared using a distance metric. However, one issue with Siamese Networks is the identification of "hard pairs" or pairs
that are difficult to distinguish. The identification of hard pairs is crucial for improving the performance of Siamese Networks.
Hard pairs represent cases where the network is struggling to learn and may be a source of error. By identifying these pairs,
we can retrain the network to improve its performance. The main challenge in identifying hard pairs is determining which
pairs are truly hard. Simply selecting pairs with large distances may not be effective, as some pairs may have large distances
due to noise in the data rather than true differences in similarity.
Several approaches have been proposed for identifying hard pairs in Siamese Networks. These include distance-based
sampling and margin-based sampling. However, the effectiveness of these techniques has not been thoroughly evaluated. In
this research report, we investigate various CNN architectures for Siamese Networks and compare their performance and
computation time. We propose a method for identifying hard pairs based on Euclidean distance differences and evaluate its
effectiveness.
We have used covid-19 radiography dataset which contains 3 classes Covid-19, Normal and Pneumonia. We have used NIH
as an auxiliary dataset which consists of 8 classes to get more variety in X-ray classes. We see a very small increase in 3 ways
as the base model accuracy was already quite high but as we increase the N value, we can see an increase from base model
to be increasing. For 10-way it’s around 2% increase in accuracy while in 20-way it's 3% improvement.We also compared
some of the popular CNN architectures in terms of accuracy and computational efficiency. MobileNet and DenseNet are more
efficient in terms of time taken per epoch, making them suitable for real-time or resource-constrained applications. On the
other hand, VGG-16 and ResNet got better accuracy and were comparable to each other but were relatively slower compared
to MobileNet and DenseNet.
Major contributions:
● This is the first work that uses the Siamese Network with hard pairs sampling in the X-rays domain.
● Our research provides insights into the effectiveness of various CNN models for few shot learning via Siamese Networks.
● We propose an effective method for identifying hard pairs. Our findings can help improve the performance of Siamese
Networks on similarity tasks.
2. RELATED WORK
There are four main categories for few shot learning. Transfer learning based, Meta-learning based, Data augmentation based,
and Multimodal based methods. Transfer learning-based methods are used to transfer the knowledge it learned from training
on target domain and then fine-tune it to required tasks. Meta-learning-based methods employ past prior knowledge to guide
the learning of new tasks. Data augmentation is used when the amount of data is less, we use it to augment the data by
rotation, cropping etc. and generate new data. Multimodal based methods use auxiliary info such as text, audio, video to make
up for less data.There certain advantages and disadvantages discussed in Table.1 over choosing one method over other.
Transfer Learning Transfer of the useful prior Knowledge Alleviate of overfitting Negative transfer
COVID-19 Radiography dataset consists of chest X-ray images for COVID-19 positive cases along with Normal and Viral
Pneumonia images, created by a team of researchers from Qatar University, Doha, Qatar and the University of Dhaka,
Bangladesh along with their collaborators from Pakistan and Malaysia in collaboration with medical doctors.
In the current release (as of June 6, 2020), there are 219 COVID-19 positive images, 1341 normal images and 1345 viral
pneumonia images.The dataset maintainers continue to update this database as soon as they have new x-ray images for
COVID-19 pneumonia patients.
3.2 Pre-processing
For Pre-processing we have resized all the images to 100 x 100 x3, and applied Histogram equalization to normalize the
contrast of the image makes the image clearer in darker areas. We are using no images of Covid-19 class during the training
process as it is our novel class and we are using 30 images each from the rest of the classes in our radiography dataset. To
make up for less data we have used NIH as an auxiliary dataset. We are using it as base classes. Although it has a significant
number of images per class, we only took around 30-50 from each class. Figure.1,Figure.2 and Figure.3 shows the sample of
images of chest X-ray.
We are using Siamese Network + Transfer Learning. The reason for choosing Metric based approach over other alternatives
in Meta Learning such as optimization based and model based is because it works very well when combined with transfer
learning because if 2 images are similar their feature vectors should also be similar and if have a good feature extractor then
our Siamese network performance will also improve and we can get good feature vector by using transfer learning or
pretraining on similar task
So, combining transfer learning with the Siamese Network we can increase performance and save a lot of time by getting
ImageNet or NIH weights as initializers.
Covid-19 Radiography dataset consists of 3 classes Covid-19, Normal and Pneumonia.
An important choice we had to make was choosing Base and Novel classes combination. Main goal is to transfer the
knowledge of base classes to novel classes. Feature Extractor is trained using a base class and then we do few-shot predictions
on novel classes.
Figure.6 shows the architecture of our model. We don’t have to choose hard pairs for every epoch. For most of the epochs we
still pick pairs randomly but after every 10th epoch we use our trained model till now to find and generate hard pairs equal to
half our batch size and the other half is randomly selected. For selecting hard pairs from the same class, we randomly choose
a class label and then iterate over all the pairs from that class and use our trained model till now to predict which of them have
the largest margin between them. One more thing to note is that we don’t always have to select the largest margin pair as it
might just be an exception or noise. We select the pairs that differ by a certain threshold distance. By always selecting the
maximum or minimum distance pairs we might overfit our model to outliers so we defined a threshold and selected randomly
the pairs which crossed that threshold.
And for finding the hard pairs from different classes it is not optimal to generate all possible pairs from a dataset because that
will require a lot of time and memory. So instead, we select 2 random class labels and then form all possible pairs from those
2 different classes and use our trained model till the current epoch to find the pairs with distance less than our threshold value
between their feature vectors.
Figure.6 Overview of Model
We are not using the classical N-way K shot where there are N classes and K examples each. As we have only 3 query classes
so a 20-way few shot means that if query example is COVID-19 image then we will generate 20 pairs such that only 1 pair is
(Covid-19, Covid-19) and in rest 19 pairs we are pairing it with some other class such as Normal or Pneumonia.
We do this to make our model task harder as now there are 20 pairs with only 1 possible correct answer for the model to
choose from.
The Siamese network is a neural network architecture that is often used for few-shot learning tasks, where there are only a
few labelled examples available for each class. The goal of few-shot learning is to train a system that can accurately classify
new examples, even if there are only a few labelled examples available for each class. The Siamese network consists of two
identical sub-networks that share weights. Each sub-network takes as input an example from the dataset and produces a fixed-
length embedding vector. During training, the network is trained to produce similar embeddings for examples that belong to
the same class and dissimilar embeddings for examples that belong to different classes.
To classify a new example with few labelled examples available, the Siamese network is used to compute the distance
between the query example and each example in the support set. The distance metric used is typically the Euclidean distance
between the embedding vectors produced by the Siamese network. The class label of the query example is then predicted
based on the class that has the smallest mean distance to the query example.
Below is the algorithm we proposed for Few shot learning algorithm via Siamese Network.
Use ‘imagenet’ weights as starting point
1. For each example in the dataset, let x_i be the input image or data, and let y_i be the label.
2. The Siamese network consists of two identical sub-networks, which can be represented as functions f_θ(x)
and g_θ(x), where θ represents the shared weights of the network.
3. Given a pair of examples (x_i, x_j) and their corresponding labels (y_i, y_j), the output of the Siamese
network is a pair of embedding vectors h_i = f_θ(x_i) and h_j = f_θ(x_j), where h_i and h_j are d-dimensional
vectors.
4. The distance between two embedding vectors h_i and h_j can be computed using a distance metric such as
Euclidean distance:
d(h_i, h_j) = ||h_i - h_j||
where ||.|| is the L2-norm.
5. The loss function for training the Siamese network is typically a contrastive loss, which encourages the
embeddings for the same class to be closer together than embeddings for different classes. The contrastive loss for a
pair of examples (x_i, x_j) and their labels (y_i, y_j) can be defined as:
L(h_i, h_j, y_i, y_j) = (1 - y_i) * d(h_i, h_j)^2 + y_i * max(0, m - d(h_i, h_j))^2
where m is a margin parameter that controls the distance threshold between same-class and different-class pairs. If
y_i = 1, then the loss encourages the distance between h_i and h_j to be less than m. If y_i = 0, then the loss
encourages the distance between h_i and h_j to be greater than or equal to m.
6. To classify a new example q, we first compute its embedding vector h_q = f_θ(q). We then form pairs
between q and each example x in the support set to create n pairs, where n is the number of classes in the support set.
Each pair consists of the query example q and an example x from the support set.
7. For each class c in the support set, we compute the mean of the distances between the query example q and
the examples x in the support set that belong to class c:
μ_c = mean(d(h_q, h_x) for all examples x in the support set that belong to class c)
where h_x = f_θ(x) is the embedding vector for example x.
8. We assign the query example q to the class c with the smallest mean distance μ_c:
y = argmin_c μ_c
9. The accuracy of the few-shot learning system on a test set can be computed as:
accuracy = (number of correct predictions) / (total number of predictions)
Once training is complete, the Siamese Network can be used to predict the similarity between new pairs of inputs by passing
them through the subnetworks and computing the distance between their feature vectors using the similarity function.
This algorithm should generate a set of "hard" pairs that are particularly challenging for the Siamese network to distinguish,
which can be useful for further fine-tuning the network or evaluating its performance. Below is the algorithm we proposed to
sample the hard pairs.
1. Set batch_size, num_classes, and threshold_distance to desired values.
2. Initialise the Siamese network with random weights.
3. Repeat for a fixed number of epochs:
a. If epoch is a multiple of 10:
i. For each class c in num_classes:
1. Create a list of all pairs (a, b) where a and b are examples from class c.
2. Sort the list in decreasing order of the absolute difference between the Siamese network's output for a and b.
3. Select the top half of the sorted list as the "hard" pairs and the bottom half as the "random" pairs.
4. Shuffle the hard and random pairs and combine them to form the mini-batch for this epoch.
b. Else:
i. Randomly sample pairs from the entire dataset to form the mini-batch for this epoch.
c. Train the Siamese network on the mini-batch using a contrastive loss function.
4. After training, use the Siamese network to predict the similarity between all pairs in the dataset.
5. For each class c in num_classes:
a. Create a list of all pairs (a, b) where a and b are examples from class c.
b. Sort the list in decreasing order of the absolute difference between the Siamese network's output for a and b.
c. For each pair (a, b) in the sorted list:
i. If the absolute difference between the Siamese network's output for a and b is greater than threshold_distance, add (a, b)
to the list of hard pairs.
ii. Else, add (a, b) to the list of random pairs.
6. Return the list of hard pairs.
4. Results and Discussion
Figure 7 shows the Graph Plotting comparison between Train loss vs Validation loss and Train accuracy vs Validation
accuracy. On the X-axis there is the number of epochs and on the Y axis there is Loss/Accuracy. As the gap between train
loss/accuracy vs validation loss/accuracy is quite less it we can infer that the model is not overfitting and is generalising well.
Figure.7 Loss and Accuracy plot between train and validation set
Table 3 shows a comparison of the base model with our model. Both of them are using VGG-16 architecture.We see a very
small increase in 3 ways as the base model accuracy was already quite high but as we increase the N value, we can see an
increase from base model to be increasing. For 10 way it’s around 2% increase in accuracy while in 20-way it's 3%
improvement.
Siamese + Transfer
N-Value Our Model
(Base)
3- way 93.33% 93.899%
10-way 86.667% 88.53%
20-way 78.775% 81.615%
Table.3 Accuracy of base model vs Our model (VGG-16 variant)
We also tried our model with some popular CNN architectures and compared the results down below in Table.4.
The results show that VGG-16, ResNet, MobileNet, and DenseNet achieved competitive accuracy levels for all three
classification tasks. Scores of VGG-16 and ResNet were slightly better than MobileNet and DenseNet. However, MobileNet
and DenseNet are relatively faster in terms of time taken per epoch compared to VGG-16 and ResNet. Specifically,
MobileNet takes only 3 minutes per epoch, while VGG-16 and ResNet take 11 minutes per epoch. DenseNet takes 5 minutes
per epoch, which is faster than VGG-16 and ResNet but slower than MobileNet.
Overall, MobileNet and DenseNet are more efficient in terms of time taken per epoch, making them suitable for real-time or
resource-constrained applications. On the other hand, VGG-16 and ResNet achieved comparable accuracy levels better than
MobileNet and DenseNet but are relatively slower.
● 2D images cannot truly reflect the 3D structure information of the human body.
● Adjusting to Domain Shift as currently few shot models perform poorly when there is shift in domain.
● There is a need of multimodal technologies in medical field like having list of symptoms with X=rays or diagnosis report
from a doctor that will help in increasing the accuracy and reliability of the model.
● There are certain classes of diseases that are just hard to predict with few examples no matter how sophisticated the model
is.
In the report we discussed various strategies used for few shots learning and compared their advantages and disadvantages.
Then we looked over the work done in recent years in the medical domain. We then discussed our hypothesis of utilizing hard
pairs to make our model more robust and we were able to get around 2-3 % improvement in accuracy over our base model.
We also compared various CNN architectures and their effectiveness and training time and found out VGG-16 had best
accuracy and MobileNet had least training time.
REFERENCES
[1]. Csurka, G.; Dance, C.R.; Fan, L.; Willamowski, J.; Bray, C. Visual Categorization with Bags of Keypoints. In
Proceedings of the Conference and Workshop on European Conference on Computer Vision, Prague, Czech Republic, 11–14
May 2020.
[2]. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [CrossRef]
[3]. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 886–
893.Electronics 2022, 11, 1752 25 of 28
[4]. Ahonen, T.; Hadid, A.; Pietikainen, M. Face description with local binary patterns: Application to face recognition. IEEE
Trans.Pattern Anal. Mach. Intell. 2006, 28, 2037–2041. [CrossRef] [PubMed]
[5]. Yang, C. Plant leaf recognition by integrating shape and texture features. Pattern Recognit. 2021, 112, 107809.
[CrossRef]
[6]. Al-Saffar, A.A.M.; Tao, H.; Talab, M.A. Review of deep convolutional neural networks in image classification. In
Proceedings of the 2017 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications
(ICRAMET),Jakarta, Indonesia, 23–24 October 2017; pp. 26–31. [CrossRef]
[7]. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In
Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009;
pp. 248–255.
[8] Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Learning and transferring mid-level image representations using convolutional
neural networks In ICLR 2020
[9] Andrychowicz, M.; Denil, M.; GomezN. Learning to learn by gradient descent by gradient descent In ICLR 2019
[10] Dhillon Chaudhari Ravichandran & Soatto A baseline for few - shot image classification In ICLR 2020
[11] A baseline for few - shot image classification In ICLR 2020
[12] (Akihiro Nakamura and Tatsuya Harada. Revisiting fine-tuning for few-shot learning.)
[13] Mishra, N., Rohaninejad, M., Chen, X., & Abbeel, P. (2017).
[14] Papp, D., & Szűcs, G. (2017). Balanced active learning method for image classification. Acta Cybernetica, 23(2), 645-
658.
[15] Papp, D., & Szűcs, G. (2018). Double probability model for open set problem at image classification. Informatica, 29(2),
353-369.
[16] Ramachandra, B., Jones, M.J., & Vatsavai, R. (2020). Learning a distance function with a Siamese network to localize
anomalies in videos. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2587-2596.
[17] Seeland, M., & Mäder, P. (2021). Multi-view classification with convolutional neural networks. Plos ONE, 16(1),
e0245230. doi: 10.1371/journal.pone.0245230
[18] Shyam, P., Gupta, S., & Dukkipati, A. (2017). Attentive recurrent comparators. In Proceedings of the 34th International
Conference on Machine Learning - Volume 70 pp. 3173-3181. doi: 10.5555/3305890.3306009
[19]. Melekhov, J. Kannala and E. Rahtu, "Siamese network features for image matching," 2016 23rd International
Conference on Pattern Recognition (ICPR), 2016, pp. 378-383, doi: 10.1109/ICPR.2016.7899663.
[20] Wu, X., Sun, Y., Liu, L., & Liu, Z. (2017). Hard negative sample mining in siamese networks for object tracking. In
Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 5634-5642).
[21] Wang, W., Wu, Q., Zhang, X., & Li, W. (2020). Hard negative mining for siamese networks with adversarial attacks.
IEEE Transactions on Neural Networks and Learning Systems, 31(4), 1074-1084.
[22] Hassanpour, S., & Baydoun, M. (2020). Siamese network-based medical image retrieval system. Computer Methods and
Programs in Biomedicine, 191, 105422.
[23] Wang, L., Li, Y., & Huang, Y. (2018). Siamese neural network-based classification for medical images. In Proceedings
of the International Conference on Machine Learning and Cybernetics (ICMLC) (pp. 648-653).